* [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile
@ 2024-07-26 9:46 Barry Song
2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
` (4 more replies)
0 siblings, 5 replies; 59+ messages in thread
From: Barry Song @ 2024-07-26 9:46 UTC (permalink / raw)
To: akpm, linux-mm
Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd,
kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
v-songbaohua, willy, xiang, yosryahmed
From: Barry Song <v-songbaohua@oppo.com>
In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app
is switched to the background, most of its memory might be swapped out.
Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.
This is unacceptable and reduces mTHP to merely a toy on systems
with significant swap utilization.
This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.
Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.
It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
without fragmentation. Based on the observed data [1] on Chris's and Ryan's
THP swap allocation optimization, aligned swap-in plays a crucial role
in the success of THP_SWPOUT.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
and enhancing compression ratios significantly. We have another patchset
to enable mTHP compression and decompression in zsmalloc/zRAM[2].
Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
to be an optimal approach. There's a critical distinction between pagecache
and anonymous pages: pagecache can be evicted and later retrieved from disk,
potentially becoming a mTHP upon retrieval, whereas anonymous pages must
always reside in memory or swapfile. If we swap in small folios and identify
adjacent memory suitable for swapping in as mTHP, those pages that have been
converted to small folios may never transition to mTHP. The process of
converting mTHP into small folios remains irreversible. This introduces
the risk of losing all mTHP through several swap-out and swap-in cycles,
let alone losing the benefits of defragmentation, improved compression
ratios, and reduced CPU usage based on mTHP compression/decompression.
Conversely, in deploying mTHP on millions of real-world products with this
feature in OPPO's out-of-tree code[3], we haven't observed any significant
increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
[1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
[3] OnePlusOSS / android_kernel_oneplus_sm8550
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
-v5:
* Add swap-in control policy according to Ying's proposal. Right now only
"always" and "never" are supported, later we can extend to "auto";
* Fix the comment regarding zswap_never_enabled() according to Yosry;
* Filter out unaligned swp entries earlier;
* add mem_cgroup_swapin_uncharge_swap_nr() helper
-v4:
https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/
Many parts of v3 have been merged into the mm tree with the help on reviewing
from Ryan, David, Ying and Chris etc. Thank you very much!
This is the final part to allocate large folios and map them.
* Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
in this v4 RFC though it should be fixed in Yosry's patch
* lots of code improvement (drop large stack, hold ptl etc) according
to Yosry's and Ryan's feedback
* rebased on top of the latest mm-unstable and utilized some new helpers
introduced recently.
-v3:
https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
* avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
thanks!
* fix the issue folio is charged twice for do_swap_page, separating
alloc_anon_folio and alloc_swap_folio as they have many differences
now on
* memcg charing
* clearing allocated folio or not
-v2:
https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
* lots of code cleanup according to Chris's comments, thanks!
* collect Chris's ack tags, thanks!
* address David's comment on moving to use folio_add_new_anon_rmap
for !folio_test_anon in do_swap_page, thanks!
* remove the MADV_PAGEOUT patch from this series as Ryan will
intergrate it into swap-out series
* Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
on large folios swap-in as well
* fixed corrupted data(zero-filled data) in two races: zswap and
a part of entries are in swapcache while some others are not
in by checking SWAP_HAS_CACHE while swapping in a large folio
-v1:
https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t
Barry Song (3):
mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
large folios swap-in
mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large
folios swap-in
mm: Introduce per-thpsize swapin control policy
Chuanhua Han (1):
mm: support large folios swapin as a whole for zRAM-like swapfile
Documentation/admin-guide/mm/transhuge.rst | 6 +
include/linux/huge_mm.h | 1 +
include/linux/memcontrol.h | 12 ++
include/linux/swap.h | 9 +-
mm/huge_memory.c | 44 +++++
mm/memory.c | 212 ++++++++++++++++++---
mm/swap.h | 10 +-
mm/swapfile.c | 102 ++++++----
8 files changed, 329 insertions(+), 67 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 59+ messages in thread* [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in 2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song @ 2024-07-26 9:46 ` Barry Song 2024-07-30 3:00 ` Baolin Wang 2024-07-30 3:11 ` Matthew Wilcox 2024-07-26 9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song ` (3 subsequent siblings) 4 siblings, 2 replies; 59+ messages in thread From: Barry Song @ 2024-07-26 9:46 UTC (permalink / raw) To: akpm, linux-mm Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed From: Barry Song <v-songbaohua@oppo.com> Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports one entry only, to support large folio swap-in, we need to handle multiple swap entries. To optimize stack usage, we iterate twice in __swap_duplicate_nr(): the first time to verify that all entries are valid, and the second time to apply the modifications to the entries. Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- include/linux/swap.h | 9 +++- mm/swap.h | 10 ++++- mm/swapfile.c | 102 ++++++++++++++++++++++++++----------------- 3 files changed, 77 insertions(+), 44 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ba7ea95d1c57..f1b28fd04533 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -480,7 +480,7 @@ extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); -extern int swapcache_prepare(swp_entry_t); +extern int swapcache_prepare_nr(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -554,7 +554,7 @@ static inline int swap_duplicate(swp_entry_t swp) return 0; } -static inline int swapcache_prepare(swp_entry_t swp) +static inline int swapcache_prepare_nr(swp_entry_t swp, int nr) { return 0; } @@ -612,6 +612,11 @@ static inline void swap_free(swp_entry_t entry) swap_free_nr(entry, 1); } +static inline int swapcache_prepare(swp_entry_t entry) +{ + return swapcache_prepare_nr(entry, 1); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/swap.h b/mm/swap.h index baa1fa946b34..81ff7eb0be9c 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -59,7 +59,7 @@ void __delete_from_swap_cache(struct folio *folio, void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr); struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); struct folio *filemap_get_incore_folio(struct address_space *mapping, @@ -120,7 +120,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) return 0; } -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +static inline void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr) { } @@ -172,4 +172,10 @@ static inline unsigned int folio_swap_flags(struct folio *folio) return 0; } #endif /* CONFIG_SWAP */ + +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +{ + swapcache_clear_nr(si, entry, 1); +} + #endif /* _MM_SWAP_H */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 5f73a8553371..e688e46f1c62 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3363,7 +3363,7 @@ void si_swapinfo(struct sysinfo *val) } /* - * Verify that a swap entry is valid and increment its swap map count. + * Verify that nr swap entries are valid and increment their swap map counts. * * Returns error code in following case. * - success -> 0 @@ -3373,66 +3373,88 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> ENOMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +static int __swap_duplicate_nr(swp_entry_t entry, unsigned char usage, int nr) { struct swap_info_struct *p; struct swap_cluster_info *ci; unsigned long offset; unsigned char count; unsigned char has_cache; - int err; + int err, i; p = swp_swap_info(entry); offset = swp_offset(entry); + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); ci = lock_cluster_or_swap_info(p, offset); - count = p->swap_map[offset]; + err = 0; + for (i = 0; i < nr; i++) { + count = p->swap_map[offset + i]; - /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. - */ - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { - err = -ENOENT; - goto unlock_out; - } + /* + * swapin_readahead() doesn't check if a swap entry is valid, so the + * swap entry could be SWAP_MAP_BAD. Check here with lock held. + */ + if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { + err = -ENOENT; + goto unlock_out; + } - has_cache = count & SWAP_HAS_CACHE; - count &= ~SWAP_HAS_CACHE; - err = 0; + has_cache = count & SWAP_HAS_CACHE; + count &= ~SWAP_HAS_CACHE; - if (usage == SWAP_HAS_CACHE) { + if (usage == SWAP_HAS_CACHE) { + /* set SWAP_HAS_CACHE if there is no cache and entry is used */ + if (!has_cache && count) + continue; + else if (has_cache) /* someone else added cache */ + err = -EEXIST; + else /* no users remaining */ + err = -ENOENT; - /* set SWAP_HAS_CACHE if there is no cache and entry is used */ - if (!has_cache && count) - has_cache = SWAP_HAS_CACHE; - else if (has_cache) /* someone else added cache */ - err = -EEXIST; - else /* no users remaining */ - err = -ENOENT; + } else if (count || has_cache) { - } else if (count || has_cache) { + if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + continue; + else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) + err = -EINVAL; + else if (swap_count_continued(p, offset + i, count)) + continue; + else + err = -ENOMEM; + } else + err = -ENOENT; /* unused swap entry */ - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + if (err) + goto unlock_out; + } + + for (i = 0; i < nr; i++) { + count = p->swap_map[offset + i]; + has_cache = count & SWAP_HAS_CACHE; + count &= ~SWAP_HAS_CACHE; + + if (usage == SWAP_HAS_CACHE) + has_cache = SWAP_HAS_CACHE; + else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count += usage; - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) - err = -EINVAL; - else if (swap_count_continued(p, offset, count)) - count = COUNT_CONTINUED; else - err = -ENOMEM; - } else - err = -ENOENT; /* unused swap entry */ + count = COUNT_CONTINUED; - if (!err) - WRITE_ONCE(p->swap_map[offset], count | has_cache); + WRITE_ONCE(p->swap_map[offset + i], count | has_cache); + } unlock_out: unlock_cluster_or_swap_info(p, ci); return err; } +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +{ + return __swap_duplicate_nr(entry, usage, 1); +} + /* * Help swapoff by noting that swap entry belongs to shmem/tmpfs * (in which case its reference count is never incremented). @@ -3459,23 +3481,23 @@ int swap_duplicate(swp_entry_t entry) } /* - * @entry: swap entry for which we allocate swap cache. + * @entry: first swap entry from which we allocate nr swap cache. * - * Called when allocating swap cache for existing swap entry, + * Called when allocating swap cache for existing swap entries, * This can return error codes. Returns 0 at success. * -EEXIST means there is a swap cache. * Note: return code is different from swap_duplicate(). */ -int swapcache_prepare(swp_entry_t entry) +int swapcache_prepare_nr(swp_entry_t entry, int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE); + return __swap_duplicate_nr(entry, SWAP_HAS_CACHE, nr); } -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr) { unsigned long offset = swp_offset(entry); - cluster_swap_free_nr(si, offset, 1, SWAP_HAS_CACHE); + cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) -- 2.34.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in 2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song @ 2024-07-30 3:00 ` Baolin Wang 2024-07-30 3:11 ` Matthew Wilcox 1 sibling, 0 replies; 59+ messages in thread From: Baolin Wang @ 2024-07-30 3:00 UTC (permalink / raw) To: Barry Song, akpm, linux-mm Cc: ying.huang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed Hi Barry, On 2024/7/26 17:46, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports > one entry only, to support large folio swap-in, we need to handle multiple > swap entries. > > To optimize stack usage, we iterate twice in __swap_duplicate_nr(): the > first time to verify that all entries are valid, and the second time to > apply the modifications to the entries. > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> LGTM. Feel free to add: Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> By the way, my shmem swap patchset[1] also relies on this patch, so I wonder if it's possible to merge this patch into the mm-unstable branch first (if other patches still need discussion), to make it easier for me to rebase and resend my patch set? Thanks. [1] https://lore.kernel.org/all/cover.1720079976.git.baolin.wang@linux.alibaba.com/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in 2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song 2024-07-30 3:00 ` Baolin Wang @ 2024-07-30 3:11 ` Matthew Wilcox 2024-07-30 3:15 ` Barry Song 1 sibling, 1 reply; 59+ messages in thread From: Matthew Wilcox @ 2024-07-30 3:11 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Fri, Jul 26, 2024 at 09:46:15PM +1200, Barry Song wrote: > +static inline int swapcache_prepare(swp_entry_t entry) > +{ > + return swapcache_prepare_nr(entry, 1); > +} Same comment as 2/4 -- there are only two callers of swapcache_prepre(). Just make that take the 'nr' argument and change both callers to pass 1. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in 2024-07-30 3:11 ` Matthew Wilcox @ 2024-07-30 3:15 ` Barry Song 0 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-07-30 3:15 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Tue, Jul 30, 2024 at 11:11 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jul 26, 2024 at 09:46:15PM +1200, Barry Song wrote: > > +static inline int swapcache_prepare(swp_entry_t entry) > > +{ > > + return swapcache_prepare_nr(entry, 1); > > +} > > Same comment as 2/4 -- there are only two callers of swapcache_prepre(). > Just make that take the 'nr' argument and change both callers to pass 1. make sense to me. As Baolin also needs this patch for shmem, I'm going to separate this one from this series and send a new version with the suggested change so that Andrew can pull it earlier. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in 2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song 2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song @ 2024-07-26 9:46 ` Barry Song 2024-07-26 16:30 ` Yosry Ahmed 2024-07-26 9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song ` (2 subsequent siblings) 4 siblings, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-26 9:46 UTC (permalink / raw) To: akpm, linux-mm Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed From: Barry Song <v-songbaohua@oppo.com> With large folios swap-in, we might need to uncharge multiple entries all together, it is better to introduce a helper for that. Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- include/linux/memcontrol.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1b79760af685..55958cbce61b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -684,6 +684,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) +{ + int i; + + for (i = 0; i < nr; i++, entry.val++) + mem_cgroup_swapin_uncharge_swap(entry); +} + void __mem_cgroup_uncharge(struct folio *folio); /** @@ -1185,6 +1193,10 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) { } +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) +{ +} + static inline void mem_cgroup_uncharge(struct folio *folio) { } -- 2.34.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in 2024-07-26 9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song @ 2024-07-26 16:30 ` Yosry Ahmed 2024-07-29 2:02 ` Barry Song 0 siblings, 1 reply; 59+ messages in thread From: Yosry Ahmed @ 2024-07-26 16:30 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang On Fri, Jul 26, 2024 at 2:47 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > With large folios swap-in, we might need to uncharge multiple entries > all together, it is better to introduce a helper for that. > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > include/linux/memcontrol.h | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 1b79760af685..55958cbce61b 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -684,6 +684,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > gfp_t gfp, swp_entry_t entry); > void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > > +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++, entry.val++) > + mem_cgroup_swapin_uncharge_swap(entry); mem_cgroup_swapin_uncharge_swap() calls mem_cgroup_uncharge_swap() which already takes in nr_pages, but we currently only pass 1. Would it be better if we just make mem_cgroup_swapin_uncharge_swap() take in nr_pages as well and pass it along to mem_cgroup_uncharge_swap(), instead of calling it in a loop? This would batch the page counter, stats updates, and refcount updates in mem_cgroup_uncharge_swap(). You may be able to observe a bit of a performance gain with this. > +} > + > void __mem_cgroup_uncharge(struct folio *folio); > > /** > @@ -1185,6 +1193,10 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > { > } > > +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) > +{ > +} > + > static inline void mem_cgroup_uncharge(struct folio *folio) > { > } > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in 2024-07-26 16:30 ` Yosry Ahmed @ 2024-07-29 2:02 ` Barry Song 2024-07-29 3:43 ` Matthew Wilcox 0 siblings, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-29 2:02 UTC (permalink / raw) To: yosryahmed Cc: 21cnbao, akpm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang On Sat, Jul 27, 2024 at 4:31 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Fri, Jul 26, 2024 at 2:47 AM Barry Song <21cnbao@gmail.com> wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > With large folios swap-in, we might need to uncharge multiple entries > > all together, it is better to introduce a helper for that. > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > include/linux/memcontrol.h | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 1b79760af685..55958cbce61b 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -684,6 +684,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > gfp_t gfp, swp_entry_t entry); > > void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > > > > +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) > > +{ > > + int i; > > + > > + for (i = 0; i < nr; i++, entry.val++) > > + mem_cgroup_swapin_uncharge_swap(entry); > > mem_cgroup_swapin_uncharge_swap() calls mem_cgroup_uncharge_swap() > which already takes in nr_pages, but we currently only pass 1. Would > it be better if we just make mem_cgroup_swapin_uncharge_swap() take in > nr_pages as well and pass it along to mem_cgroup_uncharge_swap(), > instead of calling it in a loop? > > This would batch the page counter, stats updates, and refcount updates > in mem_cgroup_uncharge_swap(). You may be able to observe a bit of a > performance gain with this. Good suggestion. I'll send the v6 version below after waiting for some comments on the other patches. From 92dfbf300fd51b427d2a6833226d1b777e0b5fee Mon Sep 17 00:00:00 2001 From: Barry Song <v-songbaohua@oppo.com> Date: Fri, 26 Jul 2024 14:33:54 +1200 Subject: [PATCH v6 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in With large folios swap-in, we might need to uncharge multiple entries all together, it is better to introduce a helper for that. Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- include/linux/memcontrol.h | 10 ++++++++-- mm/memcontrol.c | 7 ++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1b79760af685..f5dd1e34654a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); + +void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages); void __mem_cgroup_uncharge(struct folio *folio); @@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, return 0; } -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) { } @@ -1796,6 +1797,11 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, #endif /* CONFIG_MEMCG */ +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +{ + mem_cgroup_swapin_uncharge_swap_nr(entry, 1); +} + #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP) bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index eb92c21615eb..25657d6a133f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4573,14 +4573,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, /* * mem_cgroup_swapin_uncharge_swap - uncharge swap slot - * @entry: swap entry for which the page is charged + * @entry: the first swap entry for which the pages are charged + * @nr_pages: number of pages which will be uncharged * * Call this function after successfully adding the charged page to swapcache. * * Note: This function assumes the page for which swap slot is being uncharged * is order 0 page. */ -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages) { /* * Cgroup1's unified memory+swap counter has been charged with the @@ -4600,7 +4601,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) * let's not wait for it. The page already received a * memory+swap charge, drop the swap entry duplicate. */ - mem_cgroup_uncharge_swap(entry, 1); + mem_cgroup_uncharge_swap(entry, nr_pages); } } -- 2.34.1 > > > +} > > + > > void __mem_cgroup_uncharge(struct folio *folio); > > > > /** > > @@ -1185,6 +1193,10 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > > { > > } > > > > +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr) > > +{ > > +} > > + > > static inline void mem_cgroup_uncharge(struct folio *folio) > > { > > } > > -- > > 2.34.1 > > ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in 2024-07-29 2:02 ` Barry Song @ 2024-07-29 3:43 ` Matthew Wilcox 2024-07-29 4:52 ` Barry Song 0 siblings, 1 reply; 59+ messages in thread From: Matthew Wilcox @ 2024-07-29 3:43 UTC (permalink / raw) To: Barry Song Cc: yosryahmed, akpm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, ying.huang On Mon, Jul 29, 2024 at 02:02:22PM +1200, Barry Song wrote: > -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > + > +void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages); [...] > +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > +{ > + mem_cgroup_swapin_uncharge_swap_nr(entry, 1); > +} There are only two callers of mem_cgroup_swapin_uncharge_swap! Just add an argument to mem_cgroup_swapin_uncharge_swap() and change the two callers. It would be _less_ code than this extra wrapper, and certainly less confusing. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in 2024-07-29 3:43 ` Matthew Wilcox @ 2024-07-29 4:52 ` Barry Song 0 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-07-29 4:52 UTC (permalink / raw) To: Matthew Wilcox Cc: yosryahmed, akpm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, ying.huang On Mon, Jul 29, 2024 at 3:43 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 02:02:22PM +1200, Barry Song wrote: > > -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > > + > > +void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages); > [...] > > +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > > +{ > > + mem_cgroup_swapin_uncharge_swap_nr(entry, 1); > > +} > > There are only two callers of mem_cgroup_swapin_uncharge_swap! Just > add an argument to mem_cgroup_swapin_uncharge_swap() and change the two > callers. It would be _less_ code than this extra wrapper, and certainly > less confusing. sounds good to me. I can totally drop this wrapper - mem_cgroup_swapin_uncharge_swap() in v6. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song 2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song 2024-07-26 9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song @ 2024-07-26 9:46 ` Barry Song 2024-07-29 3:51 ` Matthew Wilcox 2024-07-29 14:16 ` Dan Carpenter 2024-07-26 9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song 2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song 4 siblings, 2 replies; 59+ messages in thread From: Barry Song @ 2024-07-26 9:46 UTC (permalink / raw) To: akpm, linux-mm Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed, Chuanhua Han From: Chuanhua Han <hanchuanhua@oppo.com> In an embedded system like Android, more than half of anonymous memory is actually stored in swap devices such as zRAM. For instance, when an app is switched to the background, most of its memory might be swapped out. Currently, we have mTHP features, but unfortunately, without support for large folio swap-ins, once those large folios are swapped out, we lose them immediately because mTHP is a one-way ticket. This patch introduces mTHP swap-in support. For now, we limit mTHP swap-ins to contiguous swaps that were likely swapped out from mTHP as a whole. Additionally, the current implementation only covers the SWAP_SYNCHRONOUS case. This is the simplest and most common use case, benefiting millions of Android phones and similar devices with minimal implementation cost. In this straightforward scenario, large folios are always exclusive, eliminating the need to handle complex rmap and swapcache issues. It offers several benefits: 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after swap-out and swap-in. 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT without fragmentation. 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage and enhancing compression ratios significantly. Deploying this on millions of actual products, we haven't observed any noticeable increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64. Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 188 insertions(+), 23 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833d2cad6eb2..14048e9285d4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +/* + * check a range of PTEs are completely swap entries with + * contiguous swap offsets and the same SWAP_HAS_CACHE. + * ptep must be first one in the range + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + struct swap_info_struct *si; + unsigned long addr; + swp_entry_t entry; + pgoff_t offset; + char has_cache; + int idx, i; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + offset = swp_offset(entry); + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + si = swp_swap_info(entry); + has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; + for (i = 1; i < nr_pages; i++) { + /* + * while allocating a large folio and doing swap_read_folio for the + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte + * doesn't have swapcache. We need to ensure all PTEs have no cache + * as well, otherwise, we might go to swap devices while the content + * is in swapcache + */ + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) + return false; + } + + return true; +} + +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, + unsigned long addr, unsigned long orders) +{ + int order, nr; + + order = highest_order(orders); + + /* + * To swap-in a THP with nr pages, we require its first swap_offset + * is aligned with nr. This can filter out most invalid entries. + */ + while (orders) { + nr = 1 << order; + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) + break; + order = next_order(&orders, order); + } + + return orders; +} +#else +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + return false; +} +#endif + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long orders; + struct folio *folio; + unsigned long addr; + swp_entry_t entry; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * A large swapped out folio could be partially or fully in zswap. We + * lack handling for such cases, so fallback to swapping in order-0 + * folio. + */ + if (!zswap_never_enabled()) + goto fallback; + + entry = pte_to_swp_entry(vmf->orig_pte); + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * and suitable for swapping THP. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + return folio; + order = next_order(&orders, order); + } + +fallback: +#endif + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); +} + + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare_nr(entry, nr_pages)) { + /* Relax a bit to prevent rapid repeated page faults */ + schedule_timeout_uninterruptible(1); + goto out_page; + } + need_clear_cache = true; + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + + if (!can_swapin_thp(vmf, folio_ptep, nr)) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large folios + * which are fully exclusive. If we ever get large folios within + * swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret; -- 2.34.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-26 9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song @ 2024-07-29 3:51 ` Matthew Wilcox 2024-07-29 4:41 ` Barry Song 2024-07-29 6:36 ` Chuanhua Han 2024-07-29 14:16 ` Dan Carpenter 1 sibling, 2 replies; 59+ messages in thread From: Matthew Wilcox @ 2024-07-29 3:51 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > - vma, vmf->address, false); > + folio = alloc_swap_folio(vmf); > page = &folio->page; This is no longer correct. You need to set 'page' to the precise page that is being faulted rather than the first page of the folio. It was fine before because it always allocated a single-page folio, but now it must use folio_page() or folio_file_page() (whichever has the correct semantics for you). Also you need to fix your test suite to notice this bug. I suggest doing that first so that you know whether you've got the calculation correct. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 3:51 ` Matthew Wilcox @ 2024-07-29 4:41 ` Barry Song [not found] ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com> 2024-07-29 6:36 ` Chuanhua Han 1 sibling, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-29 4:41 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > - vma, vmf->address, false); > > + folio = alloc_swap_folio(vmf); > > page = &folio->page; > > This is no longer correct. You need to set 'page' to the precise page > that is being faulted rather than the first page of the folio. It was > fine before because it always allocated a single-page folio, but now it > must use folio_page() or folio_file_page() (whichever has the correct > semantics for you). > > Also you need to fix your test suite to notice this bug. I suggest > doing that first so that you know whether you've got the calculation > correct. I don't understand why the code is designed in the way the page is the first page of this folio. Otherwise, we need lots of changes later while mapping the folio in ptes and rmap. > Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
[parent not found: <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>]
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile [not found] ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com> @ 2024-07-29 12:49 ` Matthew Wilcox 2024-07-29 13:11 ` Barry Song 0 siblings, 1 reply; 59+ messages in thread From: Matthew Wilcox @ 2024-07-29 12:49 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote: > On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > - vma, vmf->address, false); > > > > + folio = alloc_swap_folio(vmf); > > > > page = &folio->page; > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > that is being faulted rather than the first page of the folio. It was > > > fine before because it always allocated a single-page folio, but now it > > > must use folio_page() or folio_file_page() (whichever has the correct > > > semantics for you). > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > doing that first so that you know whether you've got the calculation > > > correct. > > > > I don't understand why the code is designed in the way the page > > is the first page of this folio. Otherwise, we need lots of changes > > later while mapping the folio in ptes and rmap. What? folio = swap_cache_get_folio(entry, vma, vmf->address); if (folio) page = folio_file_page(folio, swp_offset(entry)); page is the precise page, not the first page of the folio. > For both accessing large folios in the swapcache and allocating > new large folios, the page points to the first page of the folio. we > are mapping the whole folio not the specific page. But what address are we mapping the whole folio at? > for swapcache cases, you can find the same thing here, > > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > ... > entry = folio->swap; > page = &folio->page; > } Yes, but you missed some important lines from your quote: page_idx = idx; address = folio_start; ptep = folio_ptep; nr_pages = nr; We deliberate adjust the address so that, yes, we're mapping the entire folio, but we're mapping it at an address that means that the page we actually faulted on ends up at the address that we faulted on. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 12:49 ` Matthew Wilcox @ 2024-07-29 13:11 ` Barry Song 2024-07-29 15:13 ` Matthew Wilcox 0 siblings, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-29 13:11 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Tue, Jul 30, 2024 at 12:49 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote: > > On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > > - vma, vmf->address, false); > > > > > + folio = alloc_swap_folio(vmf); > > > > > page = &folio->page; > > > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > > that is being faulted rather than the first page of the folio. It was > > > > fine before because it always allocated a single-page folio, but now it > > > > must use folio_page() or folio_file_page() (whichever has the correct > > > > semantics for you). > > > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > > doing that first so that you know whether you've got the calculation > > > > correct. > > > > > > I don't understand why the code is designed in the way the page > > > is the first page of this folio. Otherwise, we need lots of changes > > > later while mapping the folio in ptes and rmap. > > What? > > folio = swap_cache_get_folio(entry, vma, vmf->address); > if (folio) > page = folio_file_page(folio, swp_offset(entry)); > > page is the precise page, not the first page of the folio. this is the case we may get a large folio in swapcache but we result in mapping only one subpage due to the condition to map the whole folio is not met. if we meet the condition, we are going to set page to the head instead and map the whole mTHP: if (folio_test_large(folio) && folio_test_swapcache(folio)) { int nr = folio_nr_pages(folio); unsigned long idx = folio_page_idx(folio, page); unsigned long folio_start = address - idx * PAGE_SIZE; unsigned long folio_end = folio_start + nr * PAGE_SIZE; pte_t *folio_ptep; pte_t folio_pte; if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) goto check_folio; if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) goto check_folio; folio_ptep = vmf->pte - idx; folio_pte = ptep_get(folio_ptep); if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || swap_pte_batch(folio_ptep, nr, folio_pte) != nr) goto check_folio; page_idx = idx; address = folio_start; ptep = folio_ptep; nr_pages = nr; entry = folio->swap; page = &folio->page; } > > > For both accessing large folios in the swapcache and allocating > > new large folios, the page points to the first page of the folio. we > > are mapping the whole folio not the specific page. > > But what address are we mapping the whole folio at? > > > for swapcache cases, you can find the same thing here, > > > > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > > ... > > entry = folio->swap; > > page = &folio->page; > > } > > Yes, but you missed some important lines from your quote: > > page_idx = idx; > address = folio_start; > ptep = folio_ptep; > nr_pages = nr; > > We deliberate adjust the address so that, yes, we're mapping the entire > folio, but we're mapping it at an address that means that the page we > actually faulted on ends up at the address that we faulted on. for this zRAM case, it is a new allocated large folio, only while all conditions are met, we will allocate and map the whole folio. you can check can_swapin_thp() and thp_swap_suitable_orders(). static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) { struct swap_info_struct *si; unsigned long addr; swp_entry_t entry; pgoff_t offset; char has_cache; int idx, i; pte_t pte; addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); idx = (vmf->address - addr) / PAGE_SIZE; pte = ptep_get(ptep); if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) return false; entry = pte_to_swp_entry(pte); offset = swp_offset(entry); if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) return false; si = swp_swap_info(entry); has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; for (i = 1; i < nr_pages; i++) { /* * while allocating a large folio and doing swap_read_folio for the * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte * doesn't have swapcache. We need to ensure all PTEs have no cache * as well, otherwise, we might go to swap devices while the content * is in swapcache */ if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) return false; } return true; } and static struct folio *alloc_swap_folio(struct vm_fault *vmf) { .... entry = pte_to_swp_entry(vmf->orig_pte); /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); .... } static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, unsigned long addr, unsigned long orders) { int order, nr; order = highest_order(orders); /* * To swap-in a THP with nr pages, we require its first swap_offset * is aligned with nr. This can filter out most invalid entries. */ while (orders) { nr = 1 << order; if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) break; order = next_order(&orders, order); } return orders; } A mTHP is swapped out at aligned swap offset. and we only swap in aligned mTHP. if somehow one mTHP is mremap() to unaligned address, we won't swap them in as a large folio. For swapcache case, we are still checking unaligned mTHP, but for new allocated mTHP, it is a different story. There is totally no necessity to support unaligned mTHP and there is no possibility to support unless something is marked in swap devices to say there was a mTHP. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 13:11 ` Barry Song @ 2024-07-29 15:13 ` Matthew Wilcox 2024-07-29 20:03 ` Barry Song 2024-07-30 8:12 ` Ryan Roberts 0 siblings, 2 replies; 59+ messages in thread From: Matthew Wilcox @ 2024-07-29 15:13 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: > for this zRAM case, it is a new allocated large folio, only > while all conditions are met, we will allocate and map > the whole folio. you can check can_swapin_thp() and > thp_swap_suitable_orders(). YOU ARE DOING THIS WRONGLY! All of you anonymous memory people are utterly fixated on TLBs AND THIS IS WRONG. Yes, TLB performance is important, particularly with crappy ARM designs, which I know a lot of you are paid to work on. But you seem to think this is the only consideration, and you're making bad design choices as a result. It's overly complicated, and you're leaving performance on the table. Look back at the results Ryan showed in the early days of working on large anonymous folios. Half of the performance win on his system came from using larger TLBs. But the other half came from _reduced software overhead_. The LRU lock is a huge problem, and using large folios cuts the length of the LRU list, hence LRU lock hold time. Your _own_ data on how hard it is to get hold of a large folio due to fragmentation should be enough to convince you that the more large folios in the system, the better the whole system runs. We should not decline to allocate large folios just because they can't be mapped with a single TLB! ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 15:13 ` Matthew Wilcox @ 2024-07-29 20:03 ` Barry Song 2024-07-29 21:56 ` Barry Song 2024-07-30 8:12 ` Ryan Roberts 1 sibling, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-29 20:03 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: > > for this zRAM case, it is a new allocated large folio, only > > while all conditions are met, we will allocate and map > > the whole folio. you can check can_swapin_thp() and > > thp_swap_suitable_orders(). > > YOU ARE DOING THIS WRONGLY! > > All of you anonymous memory people are utterly fixated on TLBs AND THIS > IS WRONG. Yes, TLB performance is important, particularly with crappy > ARM designs, which I know a lot of you are paid to work on. But you > seem to think this is the only consideration, and you're making bad > design choices as a result. It's overly complicated, and you're leaving > performance on the table. > > Look back at the results Ryan showed in the early days of working on > large anonymous folios. Half of the performance win on his system came > from using larger TLBs. But the other half came from _reduced software > overhead_. The LRU lock is a huge problem, and using large folios cuts > the length of the LRU list, hence LRU lock hold time. > > Your _own_ data on how hard it is to get hold of a large folio due to > fragmentation should be enough to convince you that the more large folios > in the system, the better the whole system runs. We should not decline to > allocate large folios just because they can't be mapped with a single TLB! I am not convinced. for a new allocated large folio, even alloc_anon_folio() of do_anonymous_page() does the exactly same thing alloc_anon_folio() { /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); } you are not going to allocate a mTHP for an unaligned address for a new PF. Please point out where it is wrong. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 20:03 ` Barry Song @ 2024-07-29 21:56 ` Barry Song 0 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-07-29 21:56 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Tue, Jul 30, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: > > > for this zRAM case, it is a new allocated large folio, only > > > while all conditions are met, we will allocate and map > > > the whole folio. you can check can_swapin_thp() and > > > thp_swap_suitable_orders(). > > > > YOU ARE DOING THIS WRONGLY! > > > > All of you anonymous memory people are utterly fixated on TLBs AND THIS > > IS WRONG. Yes, TLB performance is important, particularly with crappy > > ARM designs, which I know a lot of you are paid to work on. But you > > seem to think this is the only consideration, and you're making bad > > design choices as a result. It's overly complicated, and you're leaving > > performance on the table. > > > > Look back at the results Ryan showed in the early days of working on > > large anonymous folios. Half of the performance win on his system came > > from using larger TLBs. But the other half came from _reduced software > > overhead_. The LRU lock is a huge problem, and using large folios cuts > > the length of the LRU list, hence LRU lock hold time. > > > > Your _own_ data on how hard it is to get hold of a large folio due to > > fragmentation should be enough to convince you that the more large folios > > in the system, the better the whole system runs. We should not decline to > > allocate large folios just because they can't be mapped with a single TLB! > > I am not convinced. for a new allocated large folio, even alloc_anon_folio() > of do_anonymous_page() does the exactly same thing > > alloc_anon_folio() > { > /* > * Get a list of all the (large) orders below PMD_ORDER that are enabled > * for this vma. Then filter out the orders that can't be allocated over > * the faulting address and still be fully contained in the vma. > */ > orders = thp_vma_allowable_orders(vma, vma->vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > orders = thp_vma_suitable_orders(vma, vmf->address, orders); > > } > > you are not going to allocate a mTHP for an unaligned address for a new > PF. > Please point out where it is wrong. Let's assume we have a folio with the virtual address as 0x500000000000 ~ 0x500000000000 + 64KB if it is swapped out to 0x10000 ~ 0x100000 + 64KB. The current code will swap it in as a mTHP if page fault occurs in any address within (0x500000000000 ~ 0x500000000000 + 64KB) In this case, the mTHP enjoys both decreased TLB and reduced overhead such as LRU lock etc. So it sounds we have nothing lost in this case. But if the folio is mremap-ed to an unaligned address like: (0x600000000000 + 16KB ~ 0x600000000000 + 80KB) and its swap offset is still (0x10000 ~ 0x100000 + 64KB). The current code won't swap in them as mTHP. Sounds like a loss? If this is the performance problem you are trying to address, my point is that it is not worth increasing the complexity for this stage though this might be doable. We once tracked hundreds of phones running apps randomly for a couple of days, and we didn't encounter such a case. So this is pretty much a corner case. If your concern is more than this, for example, if you want to swap in large folios even when swaps are completely not contiguous, this is a different story. I agree this is a potential optimization direction to go, but in that case, you still need to find an aligned boundary to handle page faults just like do_anonymous_page(), otherwise, you may result in all kinds of pointless intersections where PFs can cover the address ranges of other PFs, then make the PTEs check such as pte_range_none() completely dis-ordered: static struct folio *alloc_anon_folio(struct vm_fault *vmf) { .... /* * Find the highest order where the aligned range is completely * pte_none(). Note that all remaining orders will be completely * pte_none(). */ order = highest_order(orders); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); if (pte_range_none(pte + pte_index(addr), 1 << order)) break; order = next_order(&orders, order); } } > > Thanks > Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 15:13 ` Matthew Wilcox 2024-07-29 20:03 ` Barry Song @ 2024-07-30 8:12 ` Ryan Roberts 1 sibling, 0 replies; 59+ messages in thread From: Ryan Roberts @ 2024-07-30 8:12 UTC (permalink / raw) To: Matthew Wilcox, Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On 29/07/2024 16:13, Matthew Wilcox wrote: > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: >> for this zRAM case, it is a new allocated large folio, only >> while all conditions are met, we will allocate and map >> the whole folio. you can check can_swapin_thp() and >> thp_swap_suitable_orders(). > > YOU ARE DOING THIS WRONGLY! I've only scanned the preceeding thread, but I think you're talking about the design descision to only allocate large folios that are naturally aligned in virtual address space, and you're arguing to remove that restriction? The main reason we gave ourselves that constraint for anon mTHP was because allowing it would create the possibility of wandering off the end of the PTE table and add significant complexity to manage neighbouring PTE tables and their respective PTLs. If the proposal is to start doing this, then I don't agree with that approach. > > All of you anonymous memory people are utterly fixated on TLBs AND THIS > IS WRONG. Yes, TLB performance is important, particularly with crappy > ARM designs, which I know a lot of you are paid to work on. But you > seem to think this is the only consideration, and you're making bad > design choices as a result. It's overly complicated, and you're leaving > performance on the table. > > Look back at the results Ryan showed in the early days of working on > large anonymous folios. Half of the performance win on his system came > from using larger TLBs. But the other half came from _reduced software > overhead_. I would just point out that I think the results you are referring to are for the kernel compilation workload, and yes this is indeed what I observed. But kernel compilation is a bit of an outlier since it does a huge amount of fork/exec so the kernel spends a lot of time fiddling with page tables and faulting. The vast majority of the reduced sw overhead is due to significantly reducing the number of faults because we map more pages per fault. But in my experience this workload is a bit of an outlier; most workloads that I've tested with at least tend to set up their memory at the start and its static forever more, which means that those workloads benefit mostly from the TLB benefits - there are very few existing SW overheads to actually reduce. > The LRU lock is a huge problem, and using large folios cuts > the length of the LRU list, hence LRU lock hold time. I'm sure this is true and you have lots more experience and data than me. And it makes intuitive sense. But I've never personally seen this in any of the workloads that I've benchmarked. Thanks, Ryan > > Your _own_ data on how hard it is to get hold of a large folio due to > fragmentation should be enough to convince you that the more large folios > in the system, the better the whole system runs. We should not decline to > allocate large folios just because they can't be mapped with a single TLB! > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 3:51 ` Matthew Wilcox 2024-07-29 4:41 ` Barry Song @ 2024-07-29 6:36 ` Chuanhua Han 2024-07-29 12:55 ` Matthew Wilcox 1 sibling, 1 reply; 59+ messages in thread From: Chuanhua Han @ 2024-07-29 6:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > - vma, vmf->address, false); > > + folio = alloc_swap_folio(vmf); > > page = &folio->page; > > This is no longer correct. You need to set 'page' to the precise page > that is being faulted rather than the first page of the folio. It was > fine before because it always allocated a single-page folio, but now it > must use folio_page() or folio_file_page() (whichever has the correct > semantics for you). > > Also you need to fix your test suite to notice this bug. I suggest > doing that first so that you know whether you've got the calculation > correct. > > This is no problem now, we support large folios swapin as a whole, so the head page is used here instead of the page that is being faulted. You can also refer to the current code context, now support large folios swapin as a whole, and previously only support small page swapin is not the same. -- Thanks, Chuanhua ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 6:36 ` Chuanhua Han @ 2024-07-29 12:55 ` Matthew Wilcox 2024-07-29 13:18 ` Barry Song 2024-07-29 13:32 ` Chuanhua Han 0 siblings, 2 replies; 59+ messages in thread From: Matthew Wilcox @ 2024-07-29 12:55 UTC (permalink / raw) To: Chuanhua Han Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote: > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > - vma, vmf->address, false); > > > + folio = alloc_swap_folio(vmf); > > > page = &folio->page; > > > > This is no longer correct. You need to set 'page' to the precise page > > that is being faulted rather than the first page of the folio. It was > > fine before because it always allocated a single-page folio, but now it > > must use folio_page() or folio_file_page() (whichever has the correct > > semantics for you). > > > > Also you need to fix your test suite to notice this bug. I suggest > > doing that first so that you know whether you've got the calculation > > correct. > > > > > > This is no problem now, we support large folios swapin as a whole, so > the head page is used here instead of the page that is being faulted. > You can also refer to the current code context, now support large > folios swapin as a whole, and previously only support small page > swapin is not the same. You have completely failed to understand the problem. Let's try it this way: We take a page fault at address 0x123456789000. If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000. If you now map page 0 of the folio at 0x123456789000, you've given the user the wrong page! That looks like data corruption. The code in if (folio_test_large(folio) && folio_test_swapcache(folio)) { as Barry pointed out will save you -- but what if those conditions fail? What if the mmap has been mremap()ed and the folio now crosses a PMD boundary? mk_pte() will now be called on the wrong page. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 12:55 ` Matthew Wilcox @ 2024-07-29 13:18 ` Barry Song 2024-07-29 13:32 ` Chuanhua Han 1 sibling, 0 replies; 59+ messages in thread From: Barry Song @ 2024-07-29 13:18 UTC (permalink / raw) To: Matthew Wilcox Cc: Chuanhua Han, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han On Tue, Jul 30, 2024 at 12:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote: > > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > - vma, vmf->address, false); > > > > + folio = alloc_swap_folio(vmf); > > > > page = &folio->page; > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > that is being faulted rather than the first page of the folio. It was > > > fine before because it always allocated a single-page folio, but now it > > > must use folio_page() or folio_file_page() (whichever has the correct > > > semantics for you). > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > doing that first so that you know whether you've got the calculation > > > correct. > > > > > > > > > > This is no problem now, we support large folios swapin as a whole, so > > the head page is used here instead of the page that is being faulted. > > You can also refer to the current code context, now support large > > folios swapin as a whole, and previously only support small page > > swapin is not the same. > > You have completely failed to understand the problem. Let's try it this > way: > > We take a page fault at address 0x123456789000. > If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000. > If you now map page 0 of the folio at 0x123456789000, you've > given the user the wrong page! That looks like data corruption. > > The code in > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > as Barry pointed out will save you -- but what if those conditions fail? > What if the mmap has been mremap()ed and the folio now crosses a PMD > boundary? mk_pte() will now be called on the wrong page. Chuanhua understood everything correctly. I think you might have missed that we have very strict checks both before allocating large folios and before mapping them for this new allocated mTHP swap-in case. to allocate a large folio, we check all alignment requirements; PTEs have aligned swap offset and all physically contiguous, that is how mTHP is swapped out. if a mTHP has been mremap() to be unaligned, we won't swap them in as mTHP. two reasons: 1. we have no way to figure out what is the start address of a previous mTHP for non-swapcache case; 2. mremap() to unaligned addresses is rare. to map a large folio, we check all PTEs are still there by double confirming can_swapin_thp() is true. if PTEs have changed, this is a "goto out_nomap" case. /* allocated large folios for SWP_SYNCHRONOUS_IO */ if (folio_test_large(folio) && !folio_test_swapcache(folio)) { unsigned long nr = folio_nr_pages(folio); unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; pte_t *folio_ptep = vmf->pte - idx; if (!can_swapin_thp(vmf, folio_ptep, nr)) goto out_nomap; page_idx = idx; address = folio_start; ptep = folio_ptep; goto check_folio; } Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-29 12:55 ` Matthew Wilcox 2024-07-29 13:18 ` Barry Song @ 2024-07-29 13:32 ` Chuanhua Han 1 sibling, 0 replies; 59+ messages in thread From: Chuanhua Han @ 2024-07-29 13:32 UTC (permalink / raw) To: Matthew Wilcox Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 20:55写道: > > On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote: > > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > - vma, vmf->address, false); > > > > + folio = alloc_swap_folio(vmf); > > > > page = &folio->page; > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > that is being faulted rather than the first page of the folio. It was > > > fine before because it always allocated a single-page folio, but now it > > > must use folio_page() or folio_file_page() (whichever has the correct > > > semantics for you). > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > doing that first so that you know whether you've got the calculation > > > correct. > > > > > > > > > > This is no problem now, we support large folios swapin as a whole, so > > the head page is used here instead of the page that is being faulted. > > You can also refer to the current code context, now support large > > folios swapin as a whole, and previously only support small page > > swapin is not the same. > > You have completely failed to understand the problem. Let's try it this > way: > > We take a page fault at address 0x123456789000. > If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000. > If you now map page 0 of the folio at 0x123456789000, you've > given the user the wrong page! That looks like data corruption. The user does not get the wrong data because we are mapping the whole, and for 16KiB folio, we map 16KiB through the page table. > > The code in > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > as Barry pointed out will save you -- but what if those conditions fail? > What if the mmap has been mremap()ed and the folio now crosses a PMD > boundary? mk_pte() will now be called on the wrong page. These special cases have been dealt with in our patch. For mthp's large folio, mk_pte uses head page to construct pte. -- Thanks, Chuanhua ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile 2024-07-26 9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song 2024-07-29 3:51 ` Matthew Wilcox @ 2024-07-29 14:16 ` Dan Carpenter 1 sibling, 0 replies; 59+ messages in thread From: Dan Carpenter @ 2024-07-29 14:16 UTC (permalink / raw) To: oe-kbuild, Barry Song, akpm, linux-mm Cc: lkp, oe-kbuild-all, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed, Chuanhua Han Hi Barry, kernel test robot noticed the following build warnings: url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240726094618.401593-4-21cnbao%40gmail.com patch subject: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile config: i386-randconfig-141-20240727 (https://download.01.org/0day-ci/archive/20240727/202407270917.18F5rYPH-lkp@intel.com/config) compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Reported-by: Dan Carpenter <dan.carpenter@linaro.org> | Closes: https://lore.kernel.org/r/202407270917.18F5rYPH-lkp@intel.com/ smatch warnings: mm/memory.c:4467 do_swap_page() error: uninitialized symbol 'nr_pages'. vim +/nr_pages +4467 mm/memory.c 2b7403035459c7 Souptick Joarder 2018-08-23 4143 vm_fault_t do_swap_page(struct vm_fault *vmf) ^1da177e4c3f41 Linus Torvalds 2005-04-16 4144 { 82b0f8c39a3869 Jan Kara 2016-12-14 4145 struct vm_area_struct *vma = vmf->vma; d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4146) struct folio *swapcache, *folio = NULL; d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4147) struct page *page; 2799e77529c2a2 Miaohe Lin 2021-06-28 4148 struct swap_info_struct *si = NULL; 14f9135d547060 David Hildenbrand 2022-05-09 4149 rmap_t rmap_flags = RMAP_NONE; 13ddaf26be324a Kairui Song 2024-02-07 4150 bool need_clear_cache = false; 1493a1913e34b0 David Hildenbrand 2022-05-09 4151 bool exclusive = false; 65500d234e74fc Hugh Dickins 2005-10-29 4152 swp_entry_t entry; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4153 pte_t pte; 2b7403035459c7 Souptick Joarder 2018-08-23 4154 vm_fault_t ret = 0; aae466b0052e18 Joonsoo Kim 2020-08-11 4155 void *shadow = NULL; 508758960b8d89 Chuanhua Han 2024-05-29 4156 int nr_pages; 508758960b8d89 Chuanhua Han 2024-05-29 4157 unsigned long page_idx; 508758960b8d89 Chuanhua Han 2024-05-29 4158 unsigned long address; 508758960b8d89 Chuanhua Han 2024-05-29 4159 pte_t *ptep; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4160 2ca99358671ad3 Peter Xu 2021-11-05 4161 if (!pte_unmap_same(vmf)) 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4162 goto out; 65500d234e74fc Hugh Dickins 2005-10-29 4163 2994302bc8a171 Jan Kara 2016-12-14 4164 entry = pte_to_swp_entry(vmf->orig_pte); d1737fdbec7f90 Andi Kleen 2009-09-16 4165 if (unlikely(non_swap_entry(entry))) { 0697212a411c1d Christoph Lameter 2006-06-23 4166 if (is_migration_entry(entry)) { 82b0f8c39a3869 Jan Kara 2016-12-14 4167 migration_entry_wait(vma->vm_mm, vmf->pmd, 82b0f8c39a3869 Jan Kara 2016-12-14 4168 vmf->address); b756a3b5e7ead8 Alistair Popple 2021-06-30 4169 } else if (is_device_exclusive_entry(entry)) { b756a3b5e7ead8 Alistair Popple 2021-06-30 4170 vmf->page = pfn_swap_entry_to_page(entry); b756a3b5e7ead8 Alistair Popple 2021-06-30 4171 ret = remove_device_exclusive_entry(vmf); 5042db43cc26f5 Jérôme Glisse 2017-09-08 4172 } else if (is_device_private_entry(entry)) { 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4173 if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4174 /* 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4175 * migrate_to_ram is not yet ready to operate 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4176 * under VMA lock. 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4177 */ 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4178 vma_end_read(vma); 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4179 ret = VM_FAULT_RETRY; 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4180 goto out; 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4181 } 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4182 af5cdaf82238fb Alistair Popple 2021-06-30 4183 vmf->page = pfn_swap_entry_to_page(entry); 16ce101db85db6 Alistair Popple 2022-09-28 4184 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 16ce101db85db6 Alistair Popple 2022-09-28 4185 vmf->address, &vmf->ptl); 3db82b9374ca92 Hugh Dickins 2023-06-08 4186 if (unlikely(!vmf->pte || c33c794828f212 Ryan Roberts 2023-06-12 4187 !pte_same(ptep_get(vmf->pte), c33c794828f212 Ryan Roberts 2023-06-12 4188 vmf->orig_pte))) 3b65f437d9e8dd Ryan Roberts 2023-06-02 4189 goto unlock; 16ce101db85db6 Alistair Popple 2022-09-28 4190 16ce101db85db6 Alistair Popple 2022-09-28 4191 /* 16ce101db85db6 Alistair Popple 2022-09-28 4192 * Get a page reference while we know the page can't be 16ce101db85db6 Alistair Popple 2022-09-28 4193 * freed. 16ce101db85db6 Alistair Popple 2022-09-28 4194 */ 16ce101db85db6 Alistair Popple 2022-09-28 4195 get_page(vmf->page); 16ce101db85db6 Alistair Popple 2022-09-28 4196 pte_unmap_unlock(vmf->pte, vmf->ptl); 4a955bed882e73 Alistair Popple 2022-11-14 4197 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); 16ce101db85db6 Alistair Popple 2022-09-28 4198 put_page(vmf->page); d1737fdbec7f90 Andi Kleen 2009-09-16 4199 } else if (is_hwpoison_entry(entry)) { d1737fdbec7f90 Andi Kleen 2009-09-16 4200 ret = VM_FAULT_HWPOISON; 5c041f5d1f23d3 Peter Xu 2022-05-12 4201 } else if (is_pte_marker_entry(entry)) { 5c041f5d1f23d3 Peter Xu 2022-05-12 4202 ret = handle_pte_marker(vmf); d1737fdbec7f90 Andi Kleen 2009-09-16 4203 } else { 2994302bc8a171 Jan Kara 2016-12-14 4204 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL); d99be1a8ecf377 Hugh Dickins 2009-12-14 4205 ret = VM_FAULT_SIGBUS; d1737fdbec7f90 Andi Kleen 2009-09-16 4206 } 0697212a411c1d Christoph Lameter 2006-06-23 4207 goto out; 0697212a411c1d Christoph Lameter 2006-06-23 4208 } 0bcac06f27d752 Minchan Kim 2017-11-15 4209 2799e77529c2a2 Miaohe Lin 2021-06-28 4210 /* Prevent swapoff from happening to us. */ 2799e77529c2a2 Miaohe Lin 2021-06-28 4211 si = get_swap_device(entry); 2799e77529c2a2 Miaohe Lin 2021-06-28 4212 if (unlikely(!si)) 2799e77529c2a2 Miaohe Lin 2021-06-28 4213 goto out; 0bcac06f27d752 Minchan Kim 2017-11-15 4214 5a423081b2465d Matthew Wilcox (Oracle 2022-09-02 4215) folio = swap_cache_get_folio(entry, vma, vmf->address); 5a423081b2465d Matthew Wilcox (Oracle 2022-09-02 4216) if (folio) 5a423081b2465d Matthew Wilcox (Oracle 2022-09-02 4217) page = folio_file_page(folio, swp_offset(entry)); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4218) swapcache = folio; f80207727aaca3 Minchan Kim 2018-01-18 4219 d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4220) if (!folio) { a449bf58e45abf Qian Cai 2020-08-14 4221 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && eb085574a7526c Huang Ying 2019-07-11 4222 __swap_count(entry) == 1) { 684d098daf0b3a Chuanhua Han 2024-07-26 4223 /* skip swapcache */ 684d098daf0b3a Chuanhua Han 2024-07-26 4224 folio = alloc_swap_folio(vmf); 684d098daf0b3a Chuanhua Han 2024-07-26 4225 page = &folio->page; 684d098daf0b3a Chuanhua Han 2024-07-26 4226 if (folio) { 684d098daf0b3a Chuanhua Han 2024-07-26 4227 __folio_set_locked(folio); 684d098daf0b3a Chuanhua Han 2024-07-26 4228 __folio_set_swapbacked(folio); 684d098daf0b3a Chuanhua Han 2024-07-26 4229 684d098daf0b3a Chuanhua Han 2024-07-26 4230 nr_pages = folio_nr_pages(folio); nr_pages is initialized here 684d098daf0b3a Chuanhua Han 2024-07-26 4231 if (folio_test_large(folio)) 684d098daf0b3a Chuanhua Han 2024-07-26 4232 entry.val = ALIGN_DOWN(entry.val, nr_pages); 13ddaf26be324a Kairui Song 2024-02-07 4233 /* 13ddaf26be324a Kairui Song 2024-02-07 4234 * Prevent parallel swapin from proceeding with 13ddaf26be324a Kairui Song 2024-02-07 4235 * the cache flag. Otherwise, another thread may 13ddaf26be324a Kairui Song 2024-02-07 4236 * finish swapin first, free the entry, and swapout 13ddaf26be324a Kairui Song 2024-02-07 4237 * reusing the same entry. It's undetectable as 13ddaf26be324a Kairui Song 2024-02-07 4238 * pte_same() returns true due to entry reuse. 13ddaf26be324a Kairui Song 2024-02-07 4239 */ 684d098daf0b3a Chuanhua Han 2024-07-26 4240 if (swapcache_prepare_nr(entry, nr_pages)) { 13ddaf26be324a Kairui Song 2024-02-07 4241 /* Relax a bit to prevent rapid repeated page faults */ 13ddaf26be324a Kairui Song 2024-02-07 4242 schedule_timeout_uninterruptible(1); 684d098daf0b3a Chuanhua Han 2024-07-26 4243 goto out_page; 13ddaf26be324a Kairui Song 2024-02-07 4244 } 13ddaf26be324a Kairui Song 2024-02-07 4245 need_clear_cache = true; 13ddaf26be324a Kairui Song 2024-02-07 4246 6599591816f522 Matthew Wilcox (Oracle 2022-09-02 4247) if (mem_cgroup_swapin_charge_folio(folio, 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4248) vma->vm_mm, GFP_KERNEL, 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4249) entry)) { 545b1b077ca6b3 Michal Hocko 2020-06-25 4250 ret = VM_FAULT_OOM; 4c6355b25e8bb8 Johannes Weiner 2020-06-03 4251 goto out_page; 545b1b077ca6b3 Michal Hocko 2020-06-25 4252 } 684d098daf0b3a Chuanhua Han 2024-07-26 4253 mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages); 4c6355b25e8bb8 Johannes Weiner 2020-06-03 4254 aae466b0052e18 Joonsoo Kim 2020-08-11 4255 shadow = get_shadow_from_swap_cache(entry); aae466b0052e18 Joonsoo Kim 2020-08-11 4256 if (shadow) 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4257) workingset_refault(folio, shadow); 0076f029cb2906 Joonsoo Kim 2020-06-25 4258 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4259) folio_add_lru(folio); 0add0c77a9bd0c Shakeel Butt 2021-04-29 4260 c9bdf768dd9319 Matthew Wilcox (Oracle 2023-12-13 4261) /* To provide entry to swap_read_folio() */ 3d2c9087688777 David Hildenbrand 2023-08-21 4262 folio->swap = entry; b2d1f38b524121 Yosry Ahmed 2024-06-07 4263 swap_read_folio(folio, NULL); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4264) folio->private = NULL; 0bcac06f27d752 Minchan Kim 2017-11-15 4265 } aa8d22a11da933 Minchan Kim 2017-11-15 4266 } else { e9e9b7ecee4a13 Minchan Kim 2018-04-05 4267 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, e9e9b7ecee4a13 Minchan Kim 2018-04-05 4268 vmf); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4269) if (page) 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4270) folio = page_folio(page); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4271) swapcache = folio; 0bcac06f27d752 Minchan Kim 2017-11-15 4272 } 0bcac06f27d752 Minchan Kim 2017-11-15 4273 d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4274) if (!folio) { ^1da177e4c3f41 Linus Torvalds 2005-04-16 4275 /* 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4276 * Back out if somebody else faulted in this pte 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4277 * while we released the pte lock. ^1da177e4c3f41 Linus Torvalds 2005-04-16 4278 */ 82b0f8c39a3869 Jan Kara 2016-12-14 4279 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 82b0f8c39a3869 Jan Kara 2016-12-14 4280 vmf->address, &vmf->ptl); c33c794828f212 Ryan Roberts 2023-06-12 4281 if (likely(vmf->pte && c33c794828f212 Ryan Roberts 2023-06-12 4282 pte_same(ptep_get(vmf->pte), vmf->orig_pte))) ^1da177e4c3f41 Linus Torvalds 2005-04-16 4283 ret = VM_FAULT_OOM; 65500d234e74fc Hugh Dickins 2005-10-29 4284 goto unlock; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4285 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4286 ^1da177e4c3f41 Linus Torvalds 2005-04-16 4287 /* Had to read the page from swap area: Major fault */ ^1da177e4c3f41 Linus Torvalds 2005-04-16 4288 ret = VM_FAULT_MAJOR; f8891e5e1f93a1 Christoph Lameter 2006-06-30 4289 count_vm_event(PGMAJFAULT); 2262185c5b287f Roman Gushchin 2017-07-06 4290 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); d1737fdbec7f90 Andi Kleen 2009-09-16 4291 } else if (PageHWPoison(page)) { 71f72525dfaaec Wu Fengguang 2009-12-16 4292 /* 71f72525dfaaec Wu Fengguang 2009-12-16 4293 * hwpoisoned dirty swapcache pages are kept for killing 71f72525dfaaec Wu Fengguang 2009-12-16 4294 * owner processes (which may be unknown at hwpoison time) 71f72525dfaaec Wu Fengguang 2009-12-16 4295 */ d1737fdbec7f90 Andi Kleen 2009-09-16 4296 ret = VM_FAULT_HWPOISON; 4779cb31c0ee3b Andi Kleen 2009-10-14 4297 goto out_release; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4298 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4299 fdc724d6aa44ef Suren Baghdasaryan 2023-06-30 4300 ret |= folio_lock_or_retry(folio, vmf); fdc724d6aa44ef Suren Baghdasaryan 2023-06-30 4301 if (ret & VM_FAULT_RETRY) d065bd810b6deb Michel Lespinasse 2010-10-26 4302 goto out_release; 073e587ec2cc37 KAMEZAWA Hiroyuki 2008-10-18 4303 84d60fdd3733fb David Hildenbrand 2022-03-24 4304 if (swapcache) { 4969c1192d15af Andrea Arcangeli 2010-09-09 4305 /* 3b344157c0c15b Matthew Wilcox (Oracle 2022-09-02 4306) * Make sure folio_free_swap() or swapoff did not release the 84d60fdd3733fb David Hildenbrand 2022-03-24 4307 * swapcache from under us. The page pin, and pte_same test 84d60fdd3733fb David Hildenbrand 2022-03-24 4308 * below, are not enough to exclude that. Even if it is still 84d60fdd3733fb David Hildenbrand 2022-03-24 4309 * swapcache, we need to check that the page's swap has not 84d60fdd3733fb David Hildenbrand 2022-03-24 4310 * changed. 4969c1192d15af Andrea Arcangeli 2010-09-09 4311 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4312) if (unlikely(!folio_test_swapcache(folio) || cfeed8ffe55b37 David Hildenbrand 2023-08-21 4313 page_swap_entry(page).val != entry.val)) 4969c1192d15af Andrea Arcangeli 2010-09-09 4314 goto out_page; 4969c1192d15af Andrea Arcangeli 2010-09-09 4315 84d60fdd3733fb David Hildenbrand 2022-03-24 4316 /* 84d60fdd3733fb David Hildenbrand 2022-03-24 4317 * KSM sometimes has to copy on read faults, for example, if 84d60fdd3733fb David Hildenbrand 2022-03-24 4318 * page->index of !PageKSM() pages would be nonlinear inside the 84d60fdd3733fb David Hildenbrand 2022-03-24 4319 * anon VMA -- PageKSM() is lost on actual swapout. 84d60fdd3733fb David Hildenbrand 2022-03-24 4320 */ 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4321) folio = ksm_might_need_to_copy(folio, vma, vmf->address); 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4322) if (unlikely(!folio)) { 5ad6468801d28c Hugh Dickins 2009-12-14 4323 ret = VM_FAULT_OOM; 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4324) folio = swapcache; 4969c1192d15af Andrea Arcangeli 2010-09-09 4325 goto out_page; 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4326) } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) { 6b970599e807ea Kefeng Wang 2022-12-09 4327 ret = VM_FAULT_HWPOISON; 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4328) folio = swapcache; 6b970599e807ea Kefeng Wang 2022-12-09 4329 goto out_page; 4969c1192d15af Andrea Arcangeli 2010-09-09 4330 } 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4331) if (folio != swapcache) 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4332) page = folio_page(folio, 0); c145e0b47c77eb David Hildenbrand 2022-03-24 4333 c145e0b47c77eb David Hildenbrand 2022-03-24 4334 /* c145e0b47c77eb David Hildenbrand 2022-03-24 4335 * If we want to map a page that's in the swapcache writable, we c145e0b47c77eb David Hildenbrand 2022-03-24 4336 * have to detect via the refcount if we're really the exclusive c145e0b47c77eb David Hildenbrand 2022-03-24 4337 * owner. Try removing the extra reference from the local LRU 1fec6890bf2247 Matthew Wilcox (Oracle 2023-06-21 4338) * caches if required. c145e0b47c77eb David Hildenbrand 2022-03-24 4339 */ d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4340) if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache && 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4341) !folio_test_ksm(folio) && !folio_test_lru(folio)) c145e0b47c77eb David Hildenbrand 2022-03-24 4342 lru_add_drain(); 84d60fdd3733fb David Hildenbrand 2022-03-24 4343 } 5ad6468801d28c Hugh Dickins 2009-12-14 4344 4231f8425833b1 Kefeng Wang 2023-03-02 4345 folio_throttle_swaprate(folio, GFP_KERNEL); 8a9f3ccd24741b Balbir Singh 2008-02-07 4346 ^1da177e4c3f41 Linus Torvalds 2005-04-16 4347 /* 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4348 * Back out if somebody else already faulted in this pte. ^1da177e4c3f41 Linus Torvalds 2005-04-16 4349 */ 82b0f8c39a3869 Jan Kara 2016-12-14 4350 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, 82b0f8c39a3869 Jan Kara 2016-12-14 4351 &vmf->ptl); c33c794828f212 Ryan Roberts 2023-06-12 4352 if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) b81074800b98ac Kirill Korotaev 2005-05-16 4353 goto out_nomap; b81074800b98ac Kirill Korotaev 2005-05-16 4354 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4355) if (unlikely(!folio_test_uptodate(folio))) { b81074800b98ac Kirill Korotaev 2005-05-16 4356 ret = VM_FAULT_SIGBUS; b81074800b98ac Kirill Korotaev 2005-05-16 4357 goto out_nomap; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4358 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4359 684d098daf0b3a Chuanhua Han 2024-07-26 4360 /* allocated large folios for SWP_SYNCHRONOUS_IO */ 684d098daf0b3a Chuanhua Han 2024-07-26 4361 if (folio_test_large(folio) && !folio_test_swapcache(folio)) { 684d098daf0b3a Chuanhua Han 2024-07-26 4362 unsigned long nr = folio_nr_pages(folio); 684d098daf0b3a Chuanhua Han 2024-07-26 4363 unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); 684d098daf0b3a Chuanhua Han 2024-07-26 4364 unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; 684d098daf0b3a Chuanhua Han 2024-07-26 4365 pte_t *folio_ptep = vmf->pte - idx; 684d098daf0b3a Chuanhua Han 2024-07-26 4366 684d098daf0b3a Chuanhua Han 2024-07-26 4367 if (!can_swapin_thp(vmf, folio_ptep, nr)) 684d098daf0b3a Chuanhua Han 2024-07-26 4368 goto out_nomap; 684d098daf0b3a Chuanhua Han 2024-07-26 4369 684d098daf0b3a Chuanhua Han 2024-07-26 4370 page_idx = idx; 684d098daf0b3a Chuanhua Han 2024-07-26 4371 address = folio_start; 684d098daf0b3a Chuanhua Han 2024-07-26 4372 ptep = folio_ptep; 684d098daf0b3a Chuanhua Han 2024-07-26 4373 goto check_folio; Let's say we hit this goto 684d098daf0b3a Chuanhua Han 2024-07-26 4374 } 684d098daf0b3a Chuanhua Han 2024-07-26 4375 508758960b8d89 Chuanhua Han 2024-05-29 4376 nr_pages = 1; 508758960b8d89 Chuanhua Han 2024-05-29 4377 page_idx = 0; 508758960b8d89 Chuanhua Han 2024-05-29 4378 address = vmf->address; 508758960b8d89 Chuanhua Han 2024-05-29 4379 ptep = vmf->pte; 508758960b8d89 Chuanhua Han 2024-05-29 4380 if (folio_test_large(folio) && folio_test_swapcache(folio)) { 508758960b8d89 Chuanhua Han 2024-05-29 4381 int nr = folio_nr_pages(folio); 508758960b8d89 Chuanhua Han 2024-05-29 4382 unsigned long idx = folio_page_idx(folio, page); 508758960b8d89 Chuanhua Han 2024-05-29 4383 unsigned long folio_start = address - idx * PAGE_SIZE; 508758960b8d89 Chuanhua Han 2024-05-29 4384 unsigned long folio_end = folio_start + nr * PAGE_SIZE; 508758960b8d89 Chuanhua Han 2024-05-29 4385 pte_t *folio_ptep; 508758960b8d89 Chuanhua Han 2024-05-29 4386 pte_t folio_pte; 508758960b8d89 Chuanhua Han 2024-05-29 4387 508758960b8d89 Chuanhua Han 2024-05-29 4388 if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) 508758960b8d89 Chuanhua Han 2024-05-29 4389 goto check_folio; 508758960b8d89 Chuanhua Han 2024-05-29 4390 if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) 508758960b8d89 Chuanhua Han 2024-05-29 4391 goto check_folio; 508758960b8d89 Chuanhua Han 2024-05-29 4392 508758960b8d89 Chuanhua Han 2024-05-29 4393 folio_ptep = vmf->pte - idx; 508758960b8d89 Chuanhua Han 2024-05-29 4394 folio_pte = ptep_get(folio_ptep); 508758960b8d89 Chuanhua Han 2024-05-29 4395 if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || 508758960b8d89 Chuanhua Han 2024-05-29 4396 swap_pte_batch(folio_ptep, nr, folio_pte) != nr) 508758960b8d89 Chuanhua Han 2024-05-29 4397 goto check_folio; 508758960b8d89 Chuanhua Han 2024-05-29 4398 508758960b8d89 Chuanhua Han 2024-05-29 4399 page_idx = idx; 508758960b8d89 Chuanhua Han 2024-05-29 4400 address = folio_start; 508758960b8d89 Chuanhua Han 2024-05-29 4401 ptep = folio_ptep; 508758960b8d89 Chuanhua Han 2024-05-29 4402 nr_pages = nr; 508758960b8d89 Chuanhua Han 2024-05-29 4403 entry = folio->swap; 508758960b8d89 Chuanhua Han 2024-05-29 4404 page = &folio->page; 508758960b8d89 Chuanhua Han 2024-05-29 4405 } 508758960b8d89 Chuanhua Han 2024-05-29 4406 508758960b8d89 Chuanhua Han 2024-05-29 4407 check_folio: 78fbe906cc900b David Hildenbrand 2022-05-09 4408 /* 78fbe906cc900b David Hildenbrand 2022-05-09 4409 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte 78fbe906cc900b David Hildenbrand 2022-05-09 4410 * must never point at an anonymous page in the swapcache that is 78fbe906cc900b David Hildenbrand 2022-05-09 4411 * PG_anon_exclusive. Sanity check that this holds and especially, that 78fbe906cc900b David Hildenbrand 2022-05-09 4412 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity 78fbe906cc900b David Hildenbrand 2022-05-09 4413 * check after taking the PT lock and making sure that nobody 78fbe906cc900b David Hildenbrand 2022-05-09 4414 * concurrently faulted in this page and set PG_anon_exclusive. 78fbe906cc900b David Hildenbrand 2022-05-09 4415 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4416) BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4417) BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); 78fbe906cc900b David Hildenbrand 2022-05-09 4418 1493a1913e34b0 David Hildenbrand 2022-05-09 4419 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4420 * Check under PT lock (to protect against concurrent fork() sharing 1493a1913e34b0 David Hildenbrand 2022-05-09 4421 * the swap entry concurrently) for certainly exclusive pages. 1493a1913e34b0 David Hildenbrand 2022-05-09 4422 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4423) if (!folio_test_ksm(folio)) { 1493a1913e34b0 David Hildenbrand 2022-05-09 4424 exclusive = pte_swp_exclusive(vmf->orig_pte); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4425) if (folio != swapcache) { 1493a1913e34b0 David Hildenbrand 2022-05-09 4426 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4427 * We have a fresh page that is not exposed to the 1493a1913e34b0 David Hildenbrand 2022-05-09 4428 * swapcache -> certainly exclusive. 1493a1913e34b0 David Hildenbrand 2022-05-09 4429 */ 1493a1913e34b0 David Hildenbrand 2022-05-09 4430 exclusive = true; 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4431) } else if (exclusive && folio_test_writeback(folio) && eacde32757c756 Miaohe Lin 2022-05-19 4432 data_race(si->flags & SWP_STABLE_WRITES)) { 1493a1913e34b0 David Hildenbrand 2022-05-09 4433 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4434 * This is tricky: not all swap backends support 1493a1913e34b0 David Hildenbrand 2022-05-09 4435 * concurrent page modifications while under writeback. 1493a1913e34b0 David Hildenbrand 2022-05-09 4436 * 1493a1913e34b0 David Hildenbrand 2022-05-09 4437 * So if we stumble over such a page in the swapcache 1493a1913e34b0 David Hildenbrand 2022-05-09 4438 * we must not set the page exclusive, otherwise we can 1493a1913e34b0 David Hildenbrand 2022-05-09 4439 * map it writable without further checks and modify it 1493a1913e34b0 David Hildenbrand 2022-05-09 4440 * while still under writeback. 1493a1913e34b0 David Hildenbrand 2022-05-09 4441 * 1493a1913e34b0 David Hildenbrand 2022-05-09 4442 * For these problematic swap backends, simply drop the 1493a1913e34b0 David Hildenbrand 2022-05-09 4443 * exclusive marker: this is perfectly fine as we start 1493a1913e34b0 David Hildenbrand 2022-05-09 4444 * writeback only if we fully unmapped the page and 1493a1913e34b0 David Hildenbrand 2022-05-09 4445 * there are no unexpected references on the page after 1493a1913e34b0 David Hildenbrand 2022-05-09 4446 * unmapping succeeded. After fully unmapped, no 1493a1913e34b0 David Hildenbrand 2022-05-09 4447 * further GUP references (FOLL_GET and FOLL_PIN) can 1493a1913e34b0 David Hildenbrand 2022-05-09 4448 * appear, so dropping the exclusive marker and mapping 1493a1913e34b0 David Hildenbrand 2022-05-09 4449 * it only R/O is fine. 1493a1913e34b0 David Hildenbrand 2022-05-09 4450 */ 1493a1913e34b0 David Hildenbrand 2022-05-09 4451 exclusive = false; 1493a1913e34b0 David Hildenbrand 2022-05-09 4452 } 1493a1913e34b0 David Hildenbrand 2022-05-09 4453 } 1493a1913e34b0 David Hildenbrand 2022-05-09 4454 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4455 /* 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4456 * Some architectures may have to restore extra metadata to the page 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4457 * when reading from swap. This metadata may be indexed by swap entry 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4458 * so this must be called before swap_free(). 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4459 */ f238b8c33c6738 Barry Song 2024-03-23 4460 arch_swap_restore(folio_swap(entry, folio), folio); 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4461 8c7c6e34a1256a KAMEZAWA Hiroyuki 2009-01-07 4462 /* c145e0b47c77eb David Hildenbrand 2022-03-24 4463 * Remove the swap entry and conditionally try to free up the swapcache. c145e0b47c77eb David Hildenbrand 2022-03-24 4464 * We're already holding a reference on the page but haven't mapped it c145e0b47c77eb David Hildenbrand 2022-03-24 4465 * yet. 8c7c6e34a1256a KAMEZAWA Hiroyuki 2009-01-07 4466 */ 508758960b8d89 Chuanhua Han 2024-05-29 @4467 swap_free_nr(entry, nr_pages); ^^^^^^^^ Smatch warning. The code is a bit complicated so it could be a false positive. a160e5377b55bc Matthew Wilcox (Oracle 2022-09-02 4468) if (should_try_to_free_swap(folio, vma, vmf->flags)) a160e5377b55bc Matthew Wilcox (Oracle 2022-09-02 4469) folio_free_swap(folio); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4470 508758960b8d89 Chuanhua Han 2024-05-29 4471 add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); 508758960b8d89 Chuanhua Han 2024-05-29 4472 add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4473 pte = mk_pte(page, vma->vm_page_prot); c18160dba5ff63 Barry Song 2024-06-02 4474 if (pte_swp_soft_dirty(vmf->orig_pte)) c18160dba5ff63 Barry Song 2024-06-02 4475 pte = pte_mksoft_dirty(pte); c18160dba5ff63 Barry Song 2024-06-02 4476 if (pte_swp_uffd_wp(vmf->orig_pte)) c18160dba5ff63 Barry Song 2024-06-02 4477 pte = pte_mkuffd_wp(pte); c145e0b47c77eb David Hildenbrand 2022-03-24 4478 c145e0b47c77eb David Hildenbrand 2022-03-24 4479 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4480 * Same logic as in do_wp_page(); however, optimize for pages that are 1493a1913e34b0 David Hildenbrand 2022-05-09 4481 * certainly not shared either because we just allocated them without 1493a1913e34b0 David Hildenbrand 2022-05-09 4482 * exposing them to the swapcache or because the swap entry indicates 1493a1913e34b0 David Hildenbrand 2022-05-09 4483 * exclusivity. c145e0b47c77eb David Hildenbrand 2022-03-24 4484 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4485) if (!folio_test_ksm(folio) && 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4486) (exclusive || folio_ref_count(folio) == 1)) { c18160dba5ff63 Barry Song 2024-06-02 4487 if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && 20dfa5b7adc5a1 Barry Song 2024-06-08 4488 !pte_needs_soft_dirty_wp(vma, pte)) { c18160dba5ff63 Barry Song 2024-06-02 4489 pte = pte_mkwrite(pte, vma); 6c287605fd5646 David Hildenbrand 2022-05-09 4490 if (vmf->flags & FAULT_FLAG_WRITE) { c18160dba5ff63 Barry Song 2024-06-02 4491 pte = pte_mkdirty(pte); 82b0f8c39a3869 Jan Kara 2016-12-14 4492 vmf->flags &= ~FAULT_FLAG_WRITE; 6c287605fd5646 David Hildenbrand 2022-05-09 4493 } c18160dba5ff63 Barry Song 2024-06-02 4494 } 14f9135d547060 David Hildenbrand 2022-05-09 4495 rmap_flags |= RMAP_EXCLUSIVE; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4496 } 508758960b8d89 Chuanhua Han 2024-05-29 4497 folio_ref_add(folio, nr_pages - 1); 508758960b8d89 Chuanhua Han 2024-05-29 4498 flush_icache_pages(vma, page, nr_pages); 508758960b8d89 Chuanhua Han 2024-05-29 4499 vmf->orig_pte = pte_advance_pfn(pte, page_idx); 0bcac06f27d752 Minchan Kim 2017-11-15 4500 0bcac06f27d752 Minchan Kim 2017-11-15 4501 /* ksm created a completely new copy */ d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4502) if (unlikely(folio != swapcache && swapcache)) { 15bde4abab734c Barry Song 2024-06-18 4503 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4504) folio_add_lru_vma(folio, vma); 9ae2feacedde16 Barry Song 2024-06-18 4505 } else if (!folio_test_anon(folio)) { 9ae2feacedde16 Barry Song 2024-06-18 4506 /* 684d098daf0b3a Chuanhua Han 2024-07-26 4507 * We currently only expect small !anon folios which are either 684d098daf0b3a Chuanhua Han 2024-07-26 4508 * fully exclusive or fully shared, or new allocated large folios 684d098daf0b3a Chuanhua Han 2024-07-26 4509 * which are fully exclusive. If we ever get large folios within 684d098daf0b3a Chuanhua Han 2024-07-26 4510 * swapcache here, we have to be careful. 9ae2feacedde16 Barry Song 2024-06-18 4511 */ 684d098daf0b3a Chuanhua Han 2024-07-26 4512 VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); 9ae2feacedde16 Barry Song 2024-06-18 4513 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); 9ae2feacedde16 Barry Song 2024-06-18 4514 folio_add_new_anon_rmap(folio, vma, address, rmap_flags); 0bcac06f27d752 Minchan Kim 2017-11-15 4515 } else { 508758960b8d89 Chuanhua Han 2024-05-29 4516 folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, b832a354d787bf David Hildenbrand 2023-12-20 4517 rmap_flags); 00501b531c4723 Johannes Weiner 2014-08-08 4518 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4519 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4520) VM_BUG_ON(!folio_test_anon(folio) || 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4521) (pte_write(pte) && !PageAnonExclusive(page))); 508758960b8d89 Chuanhua Han 2024-05-29 4522 set_ptes(vma->vm_mm, address, ptep, pte, nr_pages); 508758960b8d89 Chuanhua Han 2024-05-29 4523 arch_do_swap_page_nr(vma->vm_mm, vma, address, 508758960b8d89 Chuanhua Han 2024-05-29 4524 pte, pte, nr_pages); 1eba86c096e35e Pasha Tatashin 2022-01-14 4525 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4526) folio_unlock(folio); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4527) if (folio != swapcache && swapcache) { 4969c1192d15af Andrea Arcangeli 2010-09-09 4528 /* 4969c1192d15af Andrea Arcangeli 2010-09-09 4529 * Hold the lock to avoid the swap entry to be reused 4969c1192d15af Andrea Arcangeli 2010-09-09 4530 * until we take the PT lock for the pte_same() check 4969c1192d15af Andrea Arcangeli 2010-09-09 4531 * (to avoid false positives from pte_same). For 4969c1192d15af Andrea Arcangeli 2010-09-09 4532 * further safety release the lock after the swap_free 4969c1192d15af Andrea Arcangeli 2010-09-09 4533 * so that the swap count won't change under a 4969c1192d15af Andrea Arcangeli 2010-09-09 4534 * parallel locked swapcache. 4969c1192d15af Andrea Arcangeli 2010-09-09 4535 */ d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4536) folio_unlock(swapcache); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4537) folio_put(swapcache); 4969c1192d15af Andrea Arcangeli 2010-09-09 4538 } c475a8ab625d56 Hugh Dickins 2005-06-21 4539 82b0f8c39a3869 Jan Kara 2016-12-14 4540 if (vmf->flags & FAULT_FLAG_WRITE) { 2994302bc8a171 Jan Kara 2016-12-14 4541 ret |= do_wp_page(vmf); 61469f1d51777f Hugh Dickins 2008-03-04 4542 if (ret & VM_FAULT_ERROR) 61469f1d51777f Hugh Dickins 2008-03-04 4543 ret &= VM_FAULT_ERROR; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4544 goto out; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4545 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4546 ^1da177e4c3f41 Linus Torvalds 2005-04-16 4547 /* No need to invalidate - it was non-present before */ 508758960b8d89 Chuanhua Han 2024-05-29 4548 update_mmu_cache_range(vmf, vma, address, ptep, nr_pages); 65500d234e74fc Hugh Dickins 2005-10-29 4549 unlock: 3db82b9374ca92 Hugh Dickins 2023-06-08 4550 if (vmf->pte) 82b0f8c39a3869 Jan Kara 2016-12-14 4551 pte_unmap_unlock(vmf->pte, vmf->ptl); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4552 out: 13ddaf26be324a Kairui Song 2024-02-07 4553 /* Clear the swap cache pin for direct swapin after PTL unlock */ 13ddaf26be324a Kairui Song 2024-02-07 4554 if (need_clear_cache) 684d098daf0b3a Chuanhua Han 2024-07-26 4555 swapcache_clear_nr(si, entry, nr_pages); 2799e77529c2a2 Miaohe Lin 2021-06-28 4556 if (si) 2799e77529c2a2 Miaohe Lin 2021-06-28 4557 put_swap_device(si); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4558 return ret; b81074800b98ac Kirill Korotaev 2005-05-16 4559 out_nomap: 3db82b9374ca92 Hugh Dickins 2023-06-08 4560 if (vmf->pte) 82b0f8c39a3869 Jan Kara 2016-12-14 4561 pte_unmap_unlock(vmf->pte, vmf->ptl); bc43f75cd98158 Johannes Weiner 2009-04-30 4562 out_page: 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4563) folio_unlock(folio); 4779cb31c0ee3b Andi Kleen 2009-10-14 4564 out_release: 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4565) folio_put(folio); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4566) if (folio != swapcache && swapcache) { d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4567) folio_unlock(swapcache); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4568) folio_put(swapcache); 4969c1192d15af Andrea Arcangeli 2010-09-09 4569 } 13ddaf26be324a Kairui Song 2024-02-07 4570 if (need_clear_cache) 684d098daf0b3a Chuanhua Han 2024-07-26 4571 swapcache_clear_nr(si, entry, nr_pages); 2799e77529c2a2 Miaohe Lin 2021-06-28 4572 if (si) 2799e77529c2a2 Miaohe Lin 2021-06-28 4573 put_swap_device(si); 65500d234e74fc Hugh Dickins 2005-10-29 4574 return ret; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4575 } -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song ` (2 preceding siblings ...) 2024-07-26 9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song @ 2024-07-26 9:46 ` Barry Song 2024-07-27 5:58 ` kernel test robot 2024-07-29 3:52 ` Matthew Wilcox 2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song 4 siblings, 2 replies; 59+ messages in thread From: Barry Song @ 2024-07-26 9:46 UTC (permalink / raw) To: akpm, linux-mm Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed From: Barry Song <v-songbaohua@oppo.com> Quote Ying's comment: A user space interface can be implemented to select different swap-in order policies, similar to the mTHP allocation order policy. We need a distinct policy because the performance characteristics of memory allocation differ significantly from those of swap-in. For example, SSD read speeds can be much slower than memory allocation. With policy selection, I believe we can implement mTHP swap-in for non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand the implications of their choices. I think that it's better to start with at least always never. I believe that we will add auto in the future to tune automatically, which can be used as default finally. Suggested-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- Documentation/admin-guide/mm/transhuge.rst | 6 +++ include/linux/huge_mm.h | 1 + mm/huge_memory.c | 44 ++++++++++++++++++++++ mm/memory.c | 3 +- 4 files changed, 53 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 058485daf186..2e94e956ee12 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -144,6 +144,12 @@ hugepage sizes have enabled="never". If enabling multiple hugepage sizes, the kernel will select the most appropriate enabled size for a given allocation. +Transparent Hugepage Swap-in for anonymous memory can be disabled or enabled +by per-supported-THP-size with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled + echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled + It's also possible to limit defrag efforts in the VM to generate anonymous hugepages in case they're not immediately free to madvise regions or to never try to defrag memory and simply fallback to regular diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e25d9ebfdf89..25174305b17f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -92,6 +92,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define TVA_SMAPS (1 << 0) /* Will be used for procfs */ #define TVA_IN_PF (1 << 1) /* Page fault handler */ #define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */ +#define TVA_IN_SWAPIN (1 << 3) /* Do swap-in */ #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0167dc27e365..41460847988c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -80,6 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; +unsigned long huge_anon_orders_swapin_always __read_mostly; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, @@ -88,6 +89,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, { bool smaps = tva_flags & TVA_SMAPS; bool in_pf = tva_flags & TVA_IN_PF; + bool in_swapin = tva_flags & TVA_IN_SWAPIN; bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; unsigned long supported_orders; @@ -100,6 +102,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; orders &= supported_orders; + if (in_swapin) + orders &= READ_ONCE(huge_anon_orders_swapin_always); if (!orders) return 0; @@ -523,8 +527,48 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj, static struct kobj_attribute thpsize_enabled_attr = __ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store); +static DEFINE_SPINLOCK(huge_anon_orders_swapin_lock); + +static ssize_t thpsize_swapin_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int order = to_thpsize(kobj)->order; + const char *output; + + if (test_bit(order, &huge_anon_orders_swapin_always)) + output = "[always] never"; + else + output = "always [never]"; + + return sysfs_emit(buf, "%s\n", output); +} + +static ssize_t thpsize_swapin_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int order = to_thpsize(kobj)->order; + ssize_t ret = count; + + if (sysfs_streq(buf, "always")) { + spin_lock(&huge_anon_orders_swapin_lock); + set_bit(order, &huge_anon_orders_swapin_always); + spin_unlock(&huge_anon_orders_swapin_lock); + } else if (sysfs_streq(buf, "never")) { + spin_lock(&huge_anon_orders_swapin_lock); + clear_bit(order, &huge_anon_orders_swapin_always); + spin_unlock(&huge_anon_orders_swapin_lock); + } else + ret = -EINVAL; + + return ret; +} +static struct kobj_attribute thpsize_swapin_enabled_attr = + __ATTR(swapin_enabled, 0644, thpsize_swapin_enabled_show, thpsize_swapin_enabled_store); + static struct attribute *thpsize_attrs[] = { &thpsize_enabled_attr.attr, + &thpsize_swapin_enabled_attr.attr, #ifdef CONFIG_SHMEM &thpsize_shmem_enabled_attr.attr, #endif diff --git a/mm/memory.c b/mm/memory.c index 14048e9285d4..27c77f739a2c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4091,7 +4091,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) * and suitable for swapping THP. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS, + BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); -- 2.34.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-26 9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song @ 2024-07-27 5:58 ` kernel test robot 2024-07-29 1:37 ` Barry Song 2024-07-29 3:52 ` Matthew Wilcox 1 sibling, 1 reply; 59+ messages in thread From: kernel test robot @ 2024-07-27 5:58 UTC (permalink / raw) To: Barry Song, akpm, linux-mm Cc: oe-kbuild-all, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, yosryahmed Hi Barry, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config) compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static? mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c 51 52 /* 53 * By default, transparent hugepage support is disabled in order to avoid 54 * risking an increased memory footprint for applications that are not 55 * guaranteed to benefit from it. When transparent hugepage support is 56 * enabled, it is for all mappings, and khugepaged scans all mappings. 57 * Defrag is invoked by khugepaged hugepage allocations and by page faults 58 * for all hugepage allocations. 59 */ 60 unsigned long transparent_hugepage_flags __read_mostly = 61 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS 62 (1<<TRANSPARENT_HUGEPAGE_FLAG)| 63 #endif 64 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE 65 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)| 66 #endif 67 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)| 68 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| 69 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); 70 71 static struct shrinker *deferred_split_shrinker; 72 static unsigned long deferred_split_count(struct shrinker *shrink, 73 struct shrink_control *sc); 74 static unsigned long deferred_split_scan(struct shrinker *shrink, 75 struct shrink_control *sc); 76 77 static atomic_t huge_zero_refcount; 78 struct folio *huge_zero_folio __read_mostly; 79 unsigned long huge_zero_pfn __read_mostly = ~0UL; 80 unsigned long huge_anon_orders_always __read_mostly; 81 unsigned long huge_anon_orders_madvise __read_mostly; 82 unsigned long huge_anon_orders_inherit __read_mostly; > 83 unsigned long huge_anon_orders_swapin_always __read_mostly; 84 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-27 5:58 ` kernel test robot @ 2024-07-29 1:37 ` Barry Song 0 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-07-29 1:37 UTC (permalink / raw) To: lkp Cc: 21cnbao, akpm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan, nphamcs, oe-kbuild-all, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed On Sat, Jul 27, 2024 at 5:58 PM kernel test robot <lkp@intel.com> wrote: > > Hi Barry, > > kernel test robot noticed the following build warnings: > > [auto build test WARNING on akpm-mm/mm-everything] Hi Thanks! Would you check if the below patch fixes the problem? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 41460847988c..06984a325af7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -80,7 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; -unsigned long huge_anon_orders_swapin_always __read_mostly; +static unsigned long huge_anon_orders_swapin_always __read_mostly; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, > > url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com > patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy > config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config) > compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/ > > sparse warnings: (new ones prefixed by >>) > >> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static? > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock > mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock > mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock > mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit > mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit > mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock > mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock > mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock > mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > > vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c > > 51 > 52 /* > 53 * By default, transparent hugepage support is disabled in order to avoid > 54 * risking an increased memory footprint for applications that are not > 55 * guaranteed to benefit from it. When transparent hugepage support is > 56 * enabled, it is for all mappings, and khugepaged scans all mappings. > 57 * Defrag is invoked by khugepaged hugepage allocations and by page faults > 58 * for all hugepage allocations. > 59 */ > 60 unsigned long transparent_hugepage_flags __read_mostly = > 61 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS > 62 (1<<TRANSPARENT_HUGEPAGE_FLAG)| > 63 #endif > 64 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE > 65 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)| > 66 #endif > 67 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)| > 68 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| > 69 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); > 70 > 71 static struct shrinker *deferred_split_shrinker; > 72 static unsigned long deferred_split_count(struct shrinker *shrink, > 73 struct shrink_control *sc); > 74 static unsigned long deferred_split_scan(struct shrinker *shrink, > 75 struct shrink_control *sc); > 76 > 77 static atomic_t huge_zero_refcount; > 78 struct folio *huge_zero_folio __read_mostly; > 79 unsigned long huge_zero_pfn __read_mostly = ~0UL; > 80 unsigned long huge_anon_orders_always __read_mostly; > 81 unsigned long huge_anon_orders_madvise __read_mostly; > 82 unsigned long huge_anon_orders_inherit __read_mostly; > > 83 unsigned long huge_anon_orders_swapin_always __read_mostly; > 84 > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki Thanks Barry ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-26 9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song 2024-07-27 5:58 ` kernel test robot @ 2024-07-29 3:52 ` Matthew Wilcox 2024-07-29 4:49 ` Barry Song ` (3 more replies) 1 sibling, 4 replies; 59+ messages in thread From: Matthew Wilcox @ 2024-07-29 3:52 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: > A user space interface can be implemented to select different swap-in > order policies, similar to the mTHP allocation order policy. We need > a distinct policy because the performance characteristics of memory > allocation differ significantly from those of swap-in. For example, > SSD read speeds can be much slower than memory allocation. With > policy selection, I believe we can implement mTHP swap-in for > non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand > the implications of their choices. I think that it's better to start > with at least always never. I believe that we will add auto in the > future to tune automatically, which can be used as default finally. I strongly disagree. Use the same sysctl as the other anonymous memory allocations. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 3:52 ` Matthew Wilcox @ 2024-07-29 4:49 ` Barry Song 2024-07-29 16:11 ` Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-07-29 4:49 UTC (permalink / raw) To: Matthew Wilcox Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Mon, Jul 29, 2024 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: > > A user space interface can be implemented to select different swap-in > > order policies, similar to the mTHP allocation order policy. We need > > a distinct policy because the performance characteristics of memory > > allocation differ significantly from those of swap-in. For example, > > SSD read speeds can be much slower than memory allocation. With > > policy selection, I believe we can implement mTHP swap-in for > > non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand > > the implications of their choices. I think that it's better to start > > with at least always never. I believe that we will add auto in the > > future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. In versions v1-v4, we used the same controls as anonymous memory allocations. Ying expressed concerns that this approach isn't always ideal, especially for non-zRAM devices, as SSD read speeds can be much slower than memory allocation. I think his concern is reasonable to some extent. However, this patchset only addresses scenarios involving zRAM-like devices and will not impact SSDs. I would like to get Ying's feedback on whether it's acceptable to drop this one in v6. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 3:52 ` Matthew Wilcox 2024-07-29 4:49 ` Barry Song @ 2024-07-29 16:11 ` Christoph Hellwig 2024-07-29 20:11 ` Barry Song 2024-07-30 2:27 ` Chuanhua Han 2024-07-30 8:36 ` Ryan Roberts 2024-08-05 6:10 ` Huang, Ying 3 siblings, 2 replies; 59+ messages in thread From: Christoph Hellwig @ 2024-07-29 16:11 UTC (permalink / raw) To: Matthew Wilcox Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote: > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I agree with Matthew here. We also really need to stop optimizing for this weird zram case and move people to zswap instead after fixing the various issues. A special block device that isn't really a block device and needs various special hooks isn't the right abstraction for different zwap strategies. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 16:11 ` Christoph Hellwig @ 2024-07-29 20:11 ` Barry Song 2024-07-30 16:30 ` Christoph Hellwig 2024-07-30 2:27 ` Chuanhua Han 1 sibling, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-29 20:11 UTC (permalink / raw) To: Christoph Hellwig Cc: Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Tue, Jul 30, 2024 at 4:11 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote: > > I strongly disagree. Use the same sysctl as the other anonymous memory > > allocations. > > I agree with Matthew here. The whole anonymous memory allocation control is still used here. this is just an addition: anonymous memory allocation control & swapin policy, primarily for addressing SSD concern not for zRAM in the original v4's comment. > > We also really need to stop optimizing for this weird zram case and move > people to zswap instead after fixing the various issues. A special > block device that isn't really a block device and needs various special > hooks isn't the right abstraction for different zwap strategies. My understanding is zRAM is much more popularly used in embedded systems than zswap. I seldomly(or never) hear who is using zswap in Android. it seems pointless to force people to move to zswap, in embedded systems we don't have a backend real block disk device after zswap. > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 20:11 ` Barry Song @ 2024-07-30 16:30 ` Christoph Hellwig 2024-07-30 19:28 ` Nhat Pham 2024-08-01 20:55 ` Chris Li 0 siblings, 2 replies; 59+ messages in thread From: Christoph Hellwig @ 2024-07-30 16:30 UTC (permalink / raw) To: Barry Song Cc: Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote: > > We also really need to stop optimizing for this weird zram case and move > > people to zswap instead after fixing the various issues. A special > > block device that isn't really a block device and needs various special > > hooks isn't the right abstraction for different zwap strategies. > > My understanding is zRAM is much more popularly used in embedded > systems than zswap. I seldomly(or never) hear who is using zswap > in Android. it seems pointless to force people to move to zswap, in > embedded systems we don't have a backend real block disk device > after zswap. Well, that is the point. zram is a horrible hack that abuses a block device to implement a feature missing the VM layer. Right now people have a reason for it because zswap requires a "real" backing device and that's fine for them and for now. But instead of building VM infrastructure around these kinds of hacks we need to fix the VM infrastructure. Chris Li has been talking about and working towards a proper swap abstraction and that needs to happen. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-30 16:30 ` Christoph Hellwig @ 2024-07-30 19:28 ` Nhat Pham 2024-07-30 21:06 ` Barry Song 2024-08-01 20:55 ` Chris Li 1 sibling, 1 reply; 59+ messages in thread From: Nhat Pham @ 2024-07-30 19:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > > Well, that is the point. zram is a horrible hack that abuses a block > device to implement a feature missing the VM layer. Right now people > have a reason for it because zswap requires a "real" backing device > and that's fine for them and for now. But instead of building VM I completely agree with this assessment. > infrastructure around these kinds of hacks we need to fix the VM > infrastructure. Chris Li has been talking about and working towards > a proper swap abstraction and that needs to happen. I'm also working towards something along this line. My design would add a "virtual" swap ID that will be stored in the page table, and can refer to either a real, storage-backed swap entry, or a zswap entry. zswap can then exist without any backing swap device. There are several additional benefits of this approach: 1. We can optimize swapoff as well - the page table can still refer to the swap ID, but the ID now points to a physical page frame. swapoff code just needs to sever the link from the swap ID to the physical swap entry (which either just requires a swap ID mapping walk, or even faster if we have a reverse mapping mechanism), and update the link to the page frame instead. 2. We can take this opportunity to clean up the swap count code. I'd be happy to collaborate/compare notes :) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-30 19:28 ` Nhat Pham @ 2024-07-30 21:06 ` Barry Song 2024-07-31 18:35 ` Nhat Pham 0 siblings, 1 reply; 59+ messages in thread From: Barry Song @ 2024-07-30 21:06 UTC (permalink / raw) To: Nhat Pham Cc: Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Wed, Jul 31, 2024 at 7:28 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > Well, that is the point. zram is a horrible hack that abuses a block > > device to implement a feature missing the VM layer. Right now people > > have a reason for it because zswap requires a "real" backing device > > and that's fine for them and for now. But instead of building VM > > I completely agree with this assessment. > > > infrastructure around these kinds of hacks we need to fix the VM > > infrastructure. Chris Li has been talking about and working towards > > a proper swap abstraction and that needs to happen. > > I'm also working towards something along this line. My design would > add a "virtual" swap ID that will be stored in the page table, and can > refer to either a real, storage-backed swap entry, or a zswap entry. > zswap can then exist without any backing swap device. > > There are several additional benefits of this approach: > > 1. We can optimize swapoff as well - the page table can still refer to > the swap ID, but the ID now points to a physical page frame. swapoff > code just needs to sever the link from the swap ID to the physical > swap entry (which either just requires a swap ID mapping walk, or even > faster if we have a reverse mapping mechanism), and update the link to > the page frame instead. > > 2. We can take this opportunity to clean up the swap count code. > > I'd be happy to collaborate/compare notes :) I appreciate that you have a good plan, and I welcome the improvements in zswap. However, we need to face reality. Having a good plan doesn't mean we should wait for you to proceed. In my experience, I've never heard of anyone using zswap in an embedded system, especially among the billions of Android devices.(Correct me if you know one.) How soon do you expect embedded systems and Android to adopt zswap? In one year, two years, five years, or ten years? Have you asked if Google plans to use zswap in Android? Currently, zswap does not support large folios, which is why Yosry has introduced an API like zswap_never_enabled() to allow others to explore parallel options like mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS error. I believe you were involved in those discussions: mm: zswap: add zswap_never_enabled() https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6 mm: zswap: handle incorrect attempts to load large folios https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1 Should everyone around the world hold off on working on mTHP swap until zswap has addressed the issue to support large folios? Not to mention whether people are ready and happy to switch to zswap. I don't see any reason why we should wait and not start implementing something that could benefit billions of devices worldwide. Parallel exploration leads to human progress in different fields. That's why I believe Yosry's patch, which allows others to move forward, is a more considerate approach. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-30 21:06 ` Barry Song @ 2024-07-31 18:35 ` Nhat Pham 2024-08-01 3:00 ` Sergey Senozhatsky 0 siblings, 1 reply; 59+ messages in thread From: Nhat Pham @ 2024-07-31 18:35 UTC (permalink / raw) To: Barry Song Cc: Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Tue, Jul 30, 2024 at 2:06 PM Barry Song <21cnbao@gmail.com> wrote: > > > I'd be happy to collaborate/compare notes :) > > I appreciate that you have a good plan, and I welcome the improvements in zswap. > However, we need to face reality. Having a good plan doesn't mean we should > wait for you to proceed. > > In my experience, I've never heard of anyone using zswap in an embedded > system, especially among the billions of Android devices.(Correct me if you > know one.) How soon do you expect embedded systems and Android to adopt > zswap? In one year, two years, five years, or ten years? Have you asked if > Google plans to use zswap in Android? Well, no one uses zswap in an embedded environment precisely because of the aforementioned issues, which we are working to resolve :) > > Currently, zswap does not support large folios, which is why Yosry has > introduced > an API like zswap_never_enabled() to allow others to explore parallel > options like > mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS > error. I believe you were involved in those discussions: > > mm: zswap: add zswap_never_enabled() > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6 > mm: zswap: handle incorrect attempts to load large folios > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1 > I am, and for the record I reviewed and/or ack-ed all of these patches, and provided my inputs on how to move forward with zswap's support for large folios. I do not want zswap to prevent the development of the rest of the swap ecosystem. > Should everyone around the world hold off on working on mTHP swap until > zswap has addressed the issue to support large folios? Not to mention whether > people are ready and happy to switch to zswap. > I think you misunderstood my intention. For the record, I'm not trying to stop you from improving zram, and I'm not proposing that we kill zram right away. Well, at least not until zswap reaches feature parity with zram, which, as you point out, will take awhile. Both support for large folios and swap/zswap decoupling are on our agenda, and you're welcome to participate in the discussion - for what it's worth, your attempt with zram (+zstd) is the basis/proof-of-concept for our future efforts :) That said, I believe that there is a fundamental redundancy here, which we (zram and zswap developers) should resolve at some point by unifying the two memory compression systems. The sooner we can unify these two, the less effort we will have to spend on developing and maintaining two separate mechanisms for the same (or very similar) purpose. For instance, large folio support has to be done twice. Same goes with writeback/offloading to backend storage, etc. And I (admittedly with a bias), agree with Christoph that zswap is the way to go moving forwards. I will not address the rest - seems like there isn't something to disagree or discuss down there :) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-31 18:35 ` Nhat Pham @ 2024-08-01 3:00 ` Sergey Senozhatsky 0 siblings, 0 replies; 59+ messages in thread From: Sergey Senozhatsky @ 2024-08-01 3:00 UTC (permalink / raw) To: Nhat Pham Cc: Barry Song, Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On (24/07/31 11:35), Nhat Pham wrote: > > I'm not proposing that we kill zram right away. > Just for the record, zram is a generic block device and has use-cases outside of swap. Just mkfs on /dev/zram0, mount it and do whatever. The "kill zram" thing is not going to fly. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-30 16:30 ` Christoph Hellwig 2024-07-30 19:28 ` Nhat Pham @ 2024-08-01 20:55 ` Chris Li 2024-08-12 8:27 ` Christoph Hellwig 1 sibling, 1 reply; 59+ messages in thread From: Chris Li @ 2024-08-01 20:55 UTC (permalink / raw) To: Christoph Hellwig Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote: > > > We also really need to stop optimizing for this weird zram case and move > > > people to zswap instead after fixing the various issues. A special > > > block device that isn't really a block device and needs various special > > > hooks isn't the right abstraction for different zwap strategies. > > > > My understanding is zRAM is much more popularly used in embedded > > systems than zswap. I seldomly(or never) hear who is using zswap > > in Android. it seems pointless to force people to move to zswap, in > > embedded systems we don't have a backend real block disk device > > after zswap. > > Well, that is the point. zram is a horrible hack that abuses a block > device to implement a feature missing the VM layer. Right now people > have a reason for it because zswap requires a "real" backing device > and that's fine for them and for now. But instead of building VM > infrastructure around these kinds of hacks we need to fix the VM > infrastructure. Chris Li has been talking about and working towards > a proper swap abstraction and that needs to happen. Yes, I have been working on the swap allocator for the mTHP usage case. Haven't got to the zswap vs zram yet. Currently there is a feature gap between zswap and zram, so zswap doesn't do all the stuff zram does. For the zswap "real" backend issue, Google has been using the ghost swapfile for many years. That can be one way to get around that. The patch is much smaller than overhauling the swap back end abstraction. Currently Android uses zram and it needs to be the Android team's decision to move from zram to something else. I don't see that happening any time soon. There are practical limitations. Personally I have been using zram as some way to provide a block like device as my goto route for testing the swap stack. I still do an SSD drive swap test, but at the same time I want to reduce the SSD swap usage to avoid the wear on my SSD drive. I already destroyed two of my old HDD drives during the swap testing. The swap random seek is very unfriendly to HDD, not sure who is still using HDD for swap any more. Anyway, removing zram is never a goal of the swap abstraction because I am still using it. We can start with reducing the feature gap between zswap and ZRAM. The end of the day, it is the Android team's call using zram or not. Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-08-01 20:55 ` Chris Li @ 2024-08-12 8:27 ` Christoph Hellwig 2024-08-12 8:44 ` Barry Song 0 siblings, 1 reply; 59+ messages in thread From: Christoph Hellwig @ 2024-08-12 8:27 UTC (permalink / raw) To: Chris Li Cc: Christoph Hellwig, Barry Song, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote: > Currently Android uses zram and it needs to be the Android team's > decision to move from zram to something else. I don't see that > happening any time soon. There are practical limitations. No one can tell anyone to stop using things. But we can stop adding new hacks for this, and especially user facing controls. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-08-12 8:27 ` Christoph Hellwig @ 2024-08-12 8:44 ` Barry Song 0 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-08-12 8:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Chris Li, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote: > > Currently Android uses zram and it needs to be the Android team's > > decision to move from zram to something else. I don't see that > > happening any time soon. There are practical limitations. > > No one can tell anyone to stop using things. But we can stop adding > new hacks for this, and especially user facing controls. Well, this user-facing control has absolutely nothing to do with zram-related hacks. It's meant to address a general issue, mainly concerning slow-speed swap devices like SSDs, as suggested in Ying's comment on v4. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 16:11 ` Christoph Hellwig 2024-07-29 20:11 ` Barry Song @ 2024-07-30 2:27 ` Chuanhua Han 1 sibling, 0 replies; 59+ messages in thread From: Chuanhua Han @ 2024-07-30 2:27 UTC (permalink / raw) To: Christoph Hellwig Cc: Matthew Wilcox, Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed Christoph Hellwig <hch@infradead.org> 于2024年7月30日周二 00:11写道: > > On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote: > > I strongly disagree. Use the same sysctl as the other anonymous memory > > allocations. > > I agree with Matthew here. > > We also really need to stop optimizing for this weird zram case and move > people to zswap instead after fixing the various issues. A special > block device that isn't really a block device and needs various special > hooks isn't the right abstraction for different zwap strategies. I disagree, zram is most popular in embedded systems (like Android). > > -- Thanks, Chuanhua ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 3:52 ` Matthew Wilcox 2024-07-29 4:49 ` Barry Song 2024-07-29 16:11 ` Christoph Hellwig @ 2024-07-30 8:36 ` Ryan Roberts 2024-07-30 8:47 ` David Hildenbrand 2024-08-05 6:10 ` Huang, Ying 3 siblings, 1 reply; 59+ messages in thread From: Ryan Roberts @ 2024-07-30 8:36 UTC (permalink / raw) To: Matthew Wilcox, Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On 29/07/2024 04:52, Matthew Wilcox wrote: > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >> A user space interface can be implemented to select different swap-in >> order policies, similar to the mTHP allocation order policy. We need >> a distinct policy because the performance characteristics of memory >> allocation differ significantly from those of swap-in. For example, >> SSD read speeds can be much slower than memory allocation. With >> policy selection, I believe we can implement mTHP swap-in for >> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >> the implications of their choices. I think that it's better to start >> with at least always never. I believe that we will add auto in the >> future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I vaguely recall arguing in the past that just because the user has requested 2M THP that doesn't mean its the right thing to do for performance to swap-in the whole 2M in one go. That's potentially a pretty huge latency, depending on where the backend is, and it could be a waste of IO if the application never touches most of the 2M. Although the fact that the application hinted for a 2M THP in the first place hopefully means that they are storing objects that need to be accessed at similar times. Today it will be swapped in page-by-page then eventually collapsed by khugepaged. But I think those arguments become weaker as the THP size gets smaller. 16K/64K swap-in will likely yield significant performance improvements, and I think Barry has numbers for this? So I guess we have a few options: - Just use the same sysfs interface as for anon allocation, And see if anyone reports performance regressions. Investigate one of the options below if an issue is raised. That's the simplest and cleanest approach, I think. - New sysfs interface as Barry has implemented; nobody really wants more controls if it can be helped. - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts and never got any traction. - Secret option 4: Can we allocate a full-size folio but only choose to swap-in to it bit-by-bit? You would need a way to mark which pages of the folio are valid (e.g. per-page flag) but guess that's a non-starter given the strategy to remove per-page flags? Thanks, Ryan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-30 8:36 ` Ryan Roberts @ 2024-07-30 8:47 ` David Hildenbrand 0 siblings, 0 replies; 59+ messages in thread From: David Hildenbrand @ 2024-07-30 8:47 UTC (permalink / raw) To: Ryan Roberts, Matthew Wilcox, Barry Song Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed On 30.07.24 10:36, Ryan Roberts wrote: > On 29/07/2024 04:52, Matthew Wilcox wrote: >> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >>> A user space interface can be implemented to select different swap-in >>> order policies, similar to the mTHP allocation order policy. We need >>> a distinct policy because the performance characteristics of memory >>> allocation differ significantly from those of swap-in. For example, >>> SSD read speeds can be much slower than memory allocation. With >>> policy selection, I believe we can implement mTHP swap-in for >>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >>> the implications of their choices. I think that it's better to start >>> with at least always never. I believe that we will add auto in the >>> future to tune automatically, which can be used as default finally. >> >> I strongly disagree. Use the same sysctl as the other anonymous memory >> allocations. > > I vaguely recall arguing in the past that just because the user has requested 2M > THP that doesn't mean its the right thing to do for performance to swap-in the > whole 2M in one go. That's potentially a pretty huge latency, depending on where > the backend is, and it could be a waste of IO if the application never touches > most of the 2M. Although the fact that the application hinted for a 2M THP in > the first place hopefully means that they are storing objects that need to be > accessed at similar times. Today it will be swapped in page-by-page then > eventually collapsed by khugepaged. > > But I think those arguments become weaker as the THP size gets smaller. 16K/64K > swap-in will likely yield significant performance improvements, and I think > Barry has numbers for this? > > So I guess we have a few options: > > - Just use the same sysfs interface as for anon allocation, And see if anyone > reports performance regressions. Investigate one of the options below if an > issue is raised. That's the simplest and cleanest approach, I think. > > - New sysfs interface as Barry has implemented; nobody really wants more > controls if it can be helped. > > - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts > and never got any traction. > > - Secret option 4: Can we allocate a full-size folio but only choose to swap-in > to it bit-by-bit? You would need a way to mark which pages of the folio are > valid (e.g. per-page flag) but guess that's a non-starter given the strategy to > remove per-page flags? Maybe we could allocate for folios in the swapcache a bitmap to store that information (folio->private). But I am not convinced that is the right thing to do. If we know some basic properties of the backend, can't we automatically make a pretty good decision regarding the folio size to use? E.g., slow disk, avoid 2M ... Avoiding sysctls if possible here would really be preferable... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy 2024-07-29 3:52 ` Matthew Wilcox ` (2 preceding siblings ...) 2024-07-30 8:36 ` Ryan Roberts @ 2024-08-05 6:10 ` Huang, Ying 3 siblings, 0 replies; 59+ messages in thread From: Huang, Ying @ 2024-08-05 6:10 UTC (permalink / raw) To: Matthew Wilcox, Christoph Hellwig Cc: Barry Song, akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed Matthew Wilcox <willy@infradead.org> writes: > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >> A user space interface can be implemented to select different swap-in >> order policies, similar to the mTHP allocation order policy. We need >> a distinct policy because the performance characteristics of memory >> allocation differ significantly from those of swap-in. For example, >> SSD read speeds can be much slower than memory allocation. With >> policy selection, I believe we can implement mTHP swap-in for >> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >> the implications of their choices. I think that it's better to start >> with at least always never. I believe that we will add auto in the >> future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I still believe we have some reasons for this tunable. 1. As Ryan pointed out in [1], swap-in with large mTHP orders may cause long latency, which some users might want to avoid. [1] https://lore.kernel.org/lkml/f0c7f061-6284-4fe5-8cbf-93281070895b@arm.com/ 2. We have readahead information available for swap-in, which is unavailable for anonymous page allocation. This enables us to build an automatic swap-in order policy similar to that for page cache order based on readahead. 3. Swap-out/swap-in cycles present an opportunity to identify hot pages. In many use cases, we can utilize mTHP for hot pages and order-0 page for cold pages, especially under memory pressure. When an mTHP has been swapped out, it indicates that it could be a cold page. Converting it to order-0 pages might be a beneficial policy. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v6 0/2] mm: Ignite large folios swap-in support 2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song ` (3 preceding siblings ...) 2024-07-26 9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song @ 2024-08-02 12:20 ` Barry Song 2024-08-02 12:20 ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song 2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song 4 siblings, 2 replies; 59+ messages in thread From: Barry Song @ 2024-08-02 12:20 UTC (permalink / raw) To: akpm, linux-mm Cc: baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch From: Barry Song <v-songbaohua@oppo.com> Currently, we support mTHP swapout but not swapin. This means that once mTHP is swapped out, it will come back as small folios when swapped in. This is particularly detrimental for devices like Android, where more than half of the memory is in swap. The lack of mTHP swapin functionality makes mTHP a showstopper in scenarios that heavily rely on swap. This patchset introduces mTHP swap-in support. It starts with synchronous devices similar to zRAM, aiming to benefit as many users as possible with minimal changes. -v6: * remove the swapin control added in v5, per Willy, Christoph; The original reason for adding the swpin_enabled control was primarily to address concerns for slower devices. Currently, since we only support fast sync devices, swap-in size is less of a concern. We’ll gain a clearer understanding of the next steps while more devices begin to support mTHP swap-in. * add nr argument in mem_cgroup_swapin_uncharge_swap() instead of adding new API, Willy; * swapcache_prepare() and swapcache_clear() large folios support is also removed as it has been separated per Baolin's request, right now has been in mm-unstable. * provide more data in changelog. -v5: https://lore.kernel.org/linux-mm/20240726094618.401593-1-21cnbao@gmail.com/ * Add swap-in control policy according to Ying's proposal. Right now only "always" and "never" are supported, later we can extend to "auto"; * Fix the comment regarding zswap_never_enabled() according to Yosry; * Filter out unaligned swp entries earlier; * add mem_cgroup_swapin_uncharge_swap_nr() helper -v4: https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/ Many parts of v3 have been merged into the mm tree with the help on reviewing from Ryan, David, Ying and Chris etc. Thank you very much! This is the final part to allocate large folios and map them. * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix in this v4 RFC though it should be fixed in Yosry's patch * lots of code improvement (drop large stack, hold ptl etc) according to Yosry's and Ryan's feedback * rebased on top of the latest mm-unstable and utilized some new helpers introduced recently. -v3: https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry, thanks! * fix the issue folio is charged twice for do_swap_page, separating alloc_anon_folio and alloc_swap_folio as they have many differences now on * memcg charing * clearing allocated folio or not -v2: https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/ * lots of code cleanup according to Chris's comments, thanks! * collect Chris's ack tags, thanks! * address David's comment on moving to use folio_add_new_anon_rmap for !folio_test_anon in do_swap_page, thanks! * remove the MADV_PAGEOUT patch from this series as Ryan will intergrate it into swap-out series * Apply Kairui's work of "mm/swap: fix race when skipping swapcache" on large folios swap-in as well * fixed corrupted data(zero-filled data) in two races: zswap and a part of entries are in swapcache while some others are not in by checking SWAP_HAS_CACHE while swapping in a large folio -v1: https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t Barry Song (1): mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Chuanhua Han (1): mm: support large folios swap-in for zRAM-like devices include/linux/memcontrol.h | 5 +- mm/memcontrol.c | 7 +- mm/memory.c | 211 +++++++++++++++++++++++++++++++++---- mm/swap_state.c | 2 +- 4 files changed, 196 insertions(+), 29 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios 2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song @ 2024-08-02 12:20 ` Barry Song 2024-08-02 17:29 ` Chris Li 2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song 1 sibling, 1 reply; 59+ messages in thread From: Barry Song @ 2024-08-02 12:20 UTC (permalink / raw) To: akpm, linux-mm Cc: baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch From: Barry Song <v-songbaohua@oppo.com> With large folios swap-in, we might need to uncharge multiple entries all together, add nr argument in mem_cgroup_swapin_uncharge_swap(). For the existing two users, just pass nr=1. Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- include/linux/memcontrol.h | 5 +++-- mm/memcontrol.c | 7 ++++--- mm/memory.c | 2 +- mm/swap_state.c | 2 +- 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1b79760af685..44f7fb7dc0c8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); + +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); void __mem_cgroup_uncharge(struct folio *folio); @@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, return 0; } -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b889a7fbf382..5d763c234c44 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, /* * mem_cgroup_swapin_uncharge_swap - uncharge swap slot - * @entry: swap entry for which the page is charged + * @entry: the first swap entry for which the pages are charged + * @nr_pages: number of pages which will be uncharged * * Call this function after successfully adding the charged page to swapcache. * * Note: This function assumes the page for which swap slot is being uncharged * is order 0 page. */ -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) { /* * Cgroup1's unified memory+swap counter has been charged with the @@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) * let's not wait for it. The page already received a * memory+swap charge, drop the swap entry duplicate. */ - mem_cgroup_uncharge_swap(entry, 1); + mem_cgroup_uncharge_swap(entry, nr_pages); } } diff --git a/mm/memory.c b/mm/memory.c index 4c8716cb306c..4cf4902db1ec 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap(entry, 1); shadow = get_shadow_from_swap_cache(entry); if (shadow) diff --git a/mm/swap_state.c b/mm/swap_state.c index 293ff1afdca4..1159e3225754 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) goto fail_unlock; - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap(entry, 1); if (shadow) workingset_refault(new_folio, shadow); -- 2.34.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios 2024-08-02 12:20 ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song @ 2024-08-02 17:29 ` Chris Li 0 siblings, 0 replies; 59+ messages in thread From: Chris Li @ 2024-08-02 17:29 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, baolin.wang, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch Acked-by: Chris Li <chrisl@kernel.org> Chris On Fri, Aug 2, 2024 at 5:21 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > With large folios swap-in, we might need to uncharge multiple entries > all together, add nr argument in mem_cgroup_swapin_uncharge_swap(). > > For the existing two users, just pass nr=1. > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > include/linux/memcontrol.h | 5 +++-- > mm/memcontrol.c | 7 ++++--- > mm/memory.c | 2 +- > mm/swap_state.c | 2 +- > 4 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 1b79760af685..44f7fb7dc0c8 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > > int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > gfp_t gfp, swp_entry_t entry); > -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > + > +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); > > void __mem_cgroup_uncharge(struct folio *folio); > > @@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > return 0; > } > > -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr) > { > } > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b889a7fbf382..5d763c234c44 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > /* > * mem_cgroup_swapin_uncharge_swap - uncharge swap slot > - * @entry: swap entry for which the page is charged > + * @entry: the first swap entry for which the pages are charged > + * @nr_pages: number of pages which will be uncharged > * > * Call this function after successfully adding the charged page to swapcache. > * > * Note: This function assumes the page for which swap slot is being uncharged > * is order 0 page. > */ > -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) > { > /* > * Cgroup1's unified memory+swap counter has been charged with the > @@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) > * let's not wait for it. The page already received a > * memory+swap charge, drop the swap entry duplicate. > */ > - mem_cgroup_uncharge_swap(entry, 1); > + mem_cgroup_uncharge_swap(entry, nr_pages); > } > } > > diff --git a/mm/memory.c b/mm/memory.c > index 4c8716cb306c..4cf4902db1ec 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > ret = VM_FAULT_OOM; > goto out_page; > } > - mem_cgroup_swapin_uncharge_swap(entry); > + mem_cgroup_swapin_uncharge_swap(entry, 1); > > shadow = get_shadow_from_swap_cache(entry); > if (shadow) > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 293ff1afdca4..1159e3225754 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) > goto fail_unlock; > > - mem_cgroup_swapin_uncharge_swap(entry); > + mem_cgroup_swapin_uncharge_swap(entry, 1); > > if (shadow) > workingset_refault(new_folio, shadow); > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song 2024-08-02 12:20 ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song @ 2024-08-02 12:20 ` Barry Song 2024-08-03 19:08 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 59+ messages in thread From: Barry Song @ 2024-08-02 12:20 UTC (permalink / raw) To: akpm, linux-mm Cc: baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch, Chuanhua Han From: Chuanhua Han <hanchuanhua@oppo.com> Currently, we have mTHP features, but unfortunately, without support for large folio swap-ins, once these large folios are swapped out, they are lost because mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents mTHP from being used on devices like Android that heavily rely on swap. This patch introduces mTHP swap-in support. It starts from sync devices such as zRAM. This is probably the simplest and most common use case, benefiting billions of Android phones and similar devices with minimal implementation cost. In this straightforward scenario, large folios are always exclusive, eliminating the need to handle complex rmap and swapcache issues. It offers several benefits: 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after swap-out and swap-in. Large folios in the buddy system are also preserved as much as possible, rather than being fragmented due to swap-in. 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT. w/o this patch (Refer to the data from Chris's and Kairui's latest swap allocator optimization while running ./thp_swap_allocator_test w/o "-a" option [1]): ./thp_swap_allocator_test Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07% w/ this patch (always 0%): Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% ... 3. With both mTHP swap-out and swap-in supported, we offer the option to enable zsmalloc compression/decompression with larger granularity[2]. The upcoming optimization in zsmalloc will significantly increase swap speed and improve compression efficiency. Tested by running 100 iterations of swapping 100MiB of anon memory, the swap speed improved dramatically: time consumption of swapin(ms) time consumption of swapout(ms) lz4 4k 45274 90540 lz4 64k 22942 55667 zstdn 4k 85035 186585 zstdn 64k 46558 118533 The compression ratio also improved, as evaluated with 1 GiB of data: granularity orig_data_size compr_data_size 4KiB-zstd 1048576000 246876055 64KiB-zstd 1048576000 199763892 Without mTHP swap-in, the potential optimizations in zsmalloc cannot be realized. 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor of nr_pages. Swapping in content filled with the same data 0x11, w/o and w/ the patch for five rounds (Since the content is the same, decompression will be very fast. This primarily assesses the impact of reduced page faults): swp in bandwidth(bytes/ms) w/o w/ round1 624152 1127501 round2 631672 1127501 round3 620459 1139756 round4 606113 1139756 round5 624152 1152281 avg 621310 1137359 +83% [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 188 insertions(+), 23 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4cf4902db1ec..07029532469a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +/* + * check a range of PTEs are completely swap entries with + * contiguous swap offsets and the same SWAP_HAS_CACHE. + * ptep must be first one in the range + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + struct swap_info_struct *si; + unsigned long addr; + swp_entry_t entry; + pgoff_t offset; + char has_cache; + int idx, i; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + offset = swp_offset(entry); + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + si = swp_swap_info(entry); + has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; + for (i = 1; i < nr_pages; i++) { + /* + * while allocating a large folio and doing swap_read_folio for the + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte + * doesn't have swapcache. We need to ensure all PTEs have no cache + * as well, otherwise, we might go to swap devices while the content + * is in swapcache + */ + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) + return false; + } + + return true; +} + +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, + unsigned long addr, unsigned long orders) +{ + int order, nr; + + order = highest_order(orders); + + /* + * To swap-in a THP with nr pages, we require its first swap_offset + * is aligned with nr. This can filter out most invalid entries. + */ + while (orders) { + nr = 1 << order; + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) + break; + order = next_order(&orders, order); + } + + return orders; +} +#else +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + return false; +} +#endif + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long orders; + struct folio *folio; + unsigned long addr; + swp_entry_t entry; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * A large swapped out folio could be partially or fully in zswap. We + * lack handling for such cases, so fallback to swapping in order-0 + * folio. + */ + if (!zswap_never_enabled()) + goto fallback; + + entry = pte_to_swp_entry(vmf->orig_pte); + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * and suitable for swapping THP. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + return folio; + order = next_order(&orders, order); + } + +fallback: +#endif + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); +} + + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry, 1)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare(entry, nr_pages)) { + /* Relax a bit to prevent rapid repeated page faults */ + schedule_timeout_uninterruptible(1); + goto out_page; + } + need_clear_cache = true; + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry, 1); + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + + if (!can_swapin_thp(vmf, folio_ptep, nr)) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large folios + * which are fully exclusive. If we ever get large folios within + * swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry, 1); + swapcache_clear(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry, 1); + swapcache_clear(si, entry, nr_pages); if (si) put_swap_device(si); return ret; -- 2.34.1 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song @ 2024-08-03 19:08 ` Andrew Morton 2024-08-12 8:26 ` Christoph Hellwig 2024-08-15 9:47 ` Kairui Song 2 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2024-08-03 19:08 UTC (permalink / raw) To: Barry Song Cc: linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch, Chuanhua Han On Sat, 3 Aug 2024 00:20:31 +1200 Barry Song <21cnbao@gmail.com> wrote: > From: Chuanhua Han <hanchuanhua@oppo.com> > > Currently, we have mTHP features, but unfortunately, without support for large > folio swap-ins, once these large folios are swapped out, they are lost because > mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents > mTHP from being used on devices like Android that heavily rely on swap. > > This patch introduces mTHP swap-in support. It starts from sync devices such > as zRAM. This is probably the simplest and most common use case, benefiting > billions of Android phones and similar devices with minimal implementation > cost. In this straightforward scenario, large folios are always exclusive, > eliminating the need to handle complex rmap and swapcache issues. > > It offers several benefits: > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after > swap-out and swap-in. Large folios in the buddy system are also > preserved as much as possible, rather than being fragmented due > to swap-in. > > 2. Eliminates fragmentation in swap slots and supports successful > THP_SWPOUT. > > w/o this patch (Refer to the data from Chris's and Kairui's latest > swap allocator optimization while running ./thp_swap_allocator_test > w/o "-a" option [1]): > > ... > > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > ... > > +#endif > + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); > +} Generates an unused-variable warning with allnoconfig. Because vma_alloc_folio_noprof() was implemented as a macro instead of an inlined C function. Why do we keep doing this. Please check: From: Andrew Morton <akpm@linux-foundation.org> Subject: mm-support-large-folios-swap-in-for-zram-like-devices-fix Date: Sat Aug 3 11:59:00 AM PDT 2024 fix unused var warning mm/memory.c: In function 'alloc_swap_folio': mm/memory.c:4062:32: warning: unused variable 'vma' [-Wunused-variable] 4062 | struct vm_area_struct *vma = vmf->vma; | ^~~ Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/memory.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/mm/memory.c~mm-support-large-folios-swap-in-for-zram-like-devices-fix +++ a/mm/memory.c @@ -4059,8 +4059,8 @@ static inline bool can_swapin_thp(struct static struct folio *alloc_swap_folio(struct vm_fault *vmf) { - struct vm_area_struct *vma = vmf->vma; #ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct vm_area_struct *vma = vmf->vma; unsigned long orders; struct folio *folio; unsigned long addr; @@ -4128,7 +4128,8 @@ static struct folio *alloc_swap_folio(st fallback: #endif - return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, + vmf->address, false); } _ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song 2024-08-03 19:08 ` Andrew Morton @ 2024-08-12 8:26 ` Christoph Hellwig 2024-08-12 8:53 ` Barry Song 2024-08-15 9:47 ` Kairui Song 2 siblings, 1 reply; 59+ messages in thread From: Christoph Hellwig @ 2024-08-12 8:26 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch, Chuanhua Han The subject feels wrong. Nothing particular about zram, it is all about SWP_SYNCHRONOUS_IO, so the Subject and commit log should state that. On Sat, Aug 03, 2024 at 12:20:31AM +1200, Barry Song wrote: > From: Chuanhua Han <hanchuanhua@oppo.com> > > Currently, we have mTHP features, but unfortunately, without support for large > folio swap-ins, once these large folios are swapped out, they are lost because > mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents Please wrap your commit logs after 73 characters to make them readable. > +/* > + * check a range of PTEs are completely swap entries with > + * contiguous swap offsets and the same SWAP_HAS_CACHE. > + * ptep must be first one in the range > + */ Please capitalize the first character of block comments, make them full sentences and use up all 80 characters. > + for (i = 1; i < nr_pages; i++) { > + /* > + * while allocating a large folio and doing swap_read_folio for the And also do not go over 80 characters for them, which renders them really hard to read. > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE Please stub out the entire function. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-12 8:26 ` Christoph Hellwig @ 2024-08-12 8:53 ` Barry Song 2024-08-12 11:38 ` Christoph Hellwig 0 siblings, 1 reply; 59+ messages in thread From: Barry Song @ 2024-08-12 8:53 UTC (permalink / raw) To: Christoph Hellwig Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, Chuanhua Han On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote: > > The subject feels wrong. Nothing particular about zram, it is all about > SWP_SYNCHRONOUS_IO, so the Subject and commit log should state that. right. This is absolutely for sync io, zram is the most typical one which is widely used in Android and embedded systems. Others could be nvdimm, brd. > > On Sat, Aug 03, 2024 at 12:20:31AM +1200, Barry Song wrote: > > From: Chuanhua Han <hanchuanhua@oppo.com> > > > > Currently, we have mTHP features, but unfortunately, without support for large > > folio swap-ins, once these large folios are swapped out, they are lost because > > mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents > > Please wrap your commit logs after 73 characters to make them readable. ack. > > > +/* > > + * check a range of PTEs are completely swap entries with > > + * contiguous swap offsets and the same SWAP_HAS_CACHE. > > + * ptep must be first one in the range > > + */ > > Please capitalize the first character of block comments, make them full > sentences and use up all 80 characters. ack. > > > + for (i = 1; i < nr_pages; i++) { > > + /* > > + * while allocating a large folio and doing swap_read_folio for the > > And also do not go over 80 characters for them, which renders them > really hard to read. > > > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > +{ > > + struct vm_area_struct *vma = vmf->vma; > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > Please stub out the entire function. I assume you mean the below? #ifdef CONFIG_TRANSPARENT_HUGEPAGE static struct folio *alloc_swap_folio(struct vm_fault *vmf) { } #else static struct folio *alloc_swap_folio(struct vm_fault *vmf) { } #endif If so, this is fine to me. the only reason I am using the current pattern is that i am trying to follow the same pattern with static struct folio *alloc_anon_folio(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; #ifdef CONFIG_TRANSPARENT_HUGEPAGE #endif ... } Likely we also want to change that one? Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-12 8:53 ` Barry Song @ 2024-08-12 11:38 ` Christoph Hellwig 0 siblings, 0 replies; 59+ messages in thread From: Christoph Hellwig @ 2024-08-12 11:38 UTC (permalink / raw) To: Barry Song Cc: Christoph Hellwig, akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, Chuanhua Han On Mon, Aug 12, 2024 at 08:53:06PM +1200, Barry Song wrote: > On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote: > I assume you mean the below? > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > static struct folio *alloc_swap_folio(struct vm_fault *vmf) > { > } > #else > static struct folio *alloc_swap_folio(struct vm_fault *vmf) > { > } > #endif Yes. > If so, this is fine to me. the only reason I am using the current > pattern is that i am trying to follow the same pattern with > > static struct folio *alloc_anon_folio(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > #endif > ... > } > > Likely we also want to change that one? It would be nice to fix that a well, probably noy in this series, though. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song 2024-08-03 19:08 ` Andrew Morton 2024-08-12 8:26 ` Christoph Hellwig @ 2024-08-15 9:47 ` Kairui Song 2024-08-15 13:27 ` Kefeng Wang 2 siblings, 1 reply; 59+ messages in thread From: Kairui Song @ 2024-08-15 9:47 UTC (permalink / raw) To: Chuanhua Han, Barry Song Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote: > > From: Chuanhua Han <hanchuanhua@oppo.com> Hi Chuanhua, > > Currently, we have mTHP features, but unfortunately, without support for large > folio swap-ins, once these large folios are swapped out, they are lost because > mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents > mTHP from being used on devices like Android that heavily rely on swap. > > This patch introduces mTHP swap-in support. It starts from sync devices such > as zRAM. This is probably the simplest and most common use case, benefiting > billions of Android phones and similar devices with minimal implementation > cost. In this straightforward scenario, large folios are always exclusive, > eliminating the need to handle complex rmap and swapcache issues. > > It offers several benefits: > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after > swap-out and swap-in. Large folios in the buddy system are also > preserved as much as possible, rather than being fragmented due > to swap-in. > > 2. Eliminates fragmentation in swap slots and supports successful > THP_SWPOUT. > > w/o this patch (Refer to the data from Chris's and Kairui's latest > swap allocator optimization while running ./thp_swap_allocator_test > w/o "-a" option [1]): > > ./thp_swap_allocator_test > Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% > Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% > Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% > Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% > Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% > Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% > Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% > Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% > Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% > Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% > Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% > Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% > Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% > Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% > Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% > Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07% > > w/ this patch (always 0%): > Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00% > Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% > ... > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable > zsmalloc compression/decompression with larger granularity[2]. The upcoming > optimization in zsmalloc will significantly increase swap speed and improve > compression efficiency. Tested by running 100 iterations of swapping 100MiB > of anon memory, the swap speed improved dramatically: > time consumption of swapin(ms) time consumption of swapout(ms) > lz4 4k 45274 90540 > lz4 64k 22942 55667 > zstdn 4k 85035 186585 > zstdn 64k 46558 118533 > > The compression ratio also improved, as evaluated with 1 GiB of data: > granularity orig_data_size compr_data_size > 4KiB-zstd 1048576000 246876055 > 64KiB-zstd 1048576000 199763892 > > Without mTHP swap-in, the potential optimizations in zsmalloc cannot be > realized. > > 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor > of nr_pages. Swapping in content filled with the same data 0x11, w/o > and w/ the patch for five rounds (Since the content is the same, > decompression will be very fast. This primarily assesses the impact of > reduced page faults): > > swp in bandwidth(bytes/ms) w/o w/ > round1 624152 1127501 > round2 631672 1127501 > round3 620459 1139756 > round4 606113 1139756 > round5 624152 1152281 > avg 621310 1137359 +83% > > [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ > [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> > Co-developed-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 188 insertions(+), 23 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 4cf4902db1ec..07029532469a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > return VM_FAULT_SIGBUS; > } > > +/* > + * check a range of PTEs are completely swap entries with > + * contiguous swap offsets and the same SWAP_HAS_CACHE. > + * ptep must be first one in the range > + */ > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > +{ > + struct swap_info_struct *si; > + unsigned long addr; > + swp_entry_t entry; > + pgoff_t offset; > + char has_cache; > + int idx, i; > + pte_t pte; > + > + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); > + idx = (vmf->address - addr) / PAGE_SIZE; > + pte = ptep_get(ptep); > + > + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) > + return false; > + entry = pte_to_swp_entry(pte); > + offset = swp_offset(entry); > + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) > + return false; > + > + si = swp_swap_info(entry); > + has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; > + for (i = 1; i < nr_pages; i++) { > + /* > + * while allocating a large folio and doing swap_read_folio for the > + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte > + * doesn't have swapcache. We need to ensure all PTEs have no cache > + * as well, otherwise, we might go to swap devices while the content > + * is in swapcache > + */ > + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) > + return false; > + } > + > + return true; > +} > + > +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > + unsigned long addr, unsigned long orders) > +{ > + int order, nr; > + > + order = highest_order(orders); > + > + /* > + * To swap-in a THP with nr pages, we require its first swap_offset > + * is aligned with nr. This can filter out most invalid entries. > + */ > + while (orders) { > + nr = 1 << order; > + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) > + break; > + order = next_order(&orders, order); > + } > + > + return orders; > +} > +#else > +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > +{ > + return false; > +} > +#endif > + > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + struct vm_area_struct *vma = vmf->vma; > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + unsigned long orders; > + struct folio *folio; > + unsigned long addr; > + swp_entry_t entry; > + spinlock_t *ptl; > + pte_t *pte; > + gfp_t gfp; > + int order; > + > + /* > + * If uffd is active for the vma we need per-page fault fidelity to > + * maintain the uffd semantics. > + */ > + if (unlikely(userfaultfd_armed(vma))) > + goto fallback; > + > + /* > + * A large swapped out folio could be partially or fully in zswap. We > + * lack handling for such cases, so fallback to swapping in order-0 > + * folio. > + */ > + if (!zswap_never_enabled()) > + goto fallback; > + > + entry = pte_to_swp_entry(vmf->orig_pte); > + /* > + * Get a list of all the (large) orders below PMD_ORDER that are enabled > + * and suitable for swapping THP. > + */ > + orders = thp_vma_allowable_orders(vma, vma->vm_flags, > + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > + orders = thp_vma_suitable_orders(vma, vmf->address, orders); > + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); > + > + if (!orders) > + goto fallback; > + > + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); > + if (unlikely(!pte)) > + goto fallback; > + > + /* > + * For do_swap_page, find the highest order where the aligned range is > + * completely swap entries with contiguous swap offsets. > + */ > + order = highest_order(orders); > + while (orders) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) > + break; > + order = next_order(&orders, order); > + } > + > + pte_unmap_unlock(pte, ptl); > + > + /* Try allocating the highest of the remaining orders. */ > + gfp = vma_thp_gfp_mask(vma); > + while (orders) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + return folio; > + order = next_order(&orders, order); > + } > + > +fallback: > +#endif > + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); > +} > + > + > /* > * We enter with non-exclusive mmap_lock (to exclude vma changes, > * but allow concurrent faults), and pte mapped but not yet locked. > @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > __swap_count(entry) == 1) { > - /* > - * Prevent parallel swapin from proceeding with > - * the cache flag. Otherwise, another thread may > - * finish swapin first, free the entry, and swapout > - * reusing the same entry. It's undetectable as > - * pte_same() returns true due to entry reuse. > - */ > - if (swapcache_prepare(entry, 1)) { > - /* Relax a bit to prevent rapid repeated page faults */ > - schedule_timeout_uninterruptible(1); > - goto out; > - } > - need_clear_cache = true; > - > /* skip swapcache */ > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > - vma, vmf->address, false); > + folio = alloc_swap_folio(vmf); > page = &folio->page; > if (folio) { > __folio_set_locked(folio); > __folio_set_swapbacked(folio); > > + nr_pages = folio_nr_pages(folio); > + if (folio_test_large(folio)) > + entry.val = ALIGN_DOWN(entry.val, nr_pages); > + /* > + * Prevent parallel swapin from proceeding with > + * the cache flag. Otherwise, another thread may > + * finish swapin first, free the entry, and swapout > + * reusing the same entry. It's undetectable as > + * pte_same() returns true due to entry reuse. > + */ > + if (swapcache_prepare(entry, nr_pages)) { > + /* Relax a bit to prevent rapid repeated page faults */ > + schedule_timeout_uninterruptible(1); > + goto out_page; > + } > + need_clear_cache = true; > + > if (mem_cgroup_swapin_charge_folio(folio, > vma->vm_mm, GFP_KERNEL, > entry)) { > ret = VM_FAULT_OOM; > goto out_page; > } After your patch, with build kernel test, I'm seeing kernel log spamming like this: [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF ............ And heavy performance loss with workloads limited by memcg, mTHP enabled. After some debugging, the problematic part is the mem_cgroup_swapin_charge_folio call above. When under pressure, cgroup charge fails easily for mTHP. One 64k swapin will require a much more aggressive reclaim to success. If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is gone and mTHP swapin should have a much higher swapin success rate. But this might not be the right way. For this particular issue, maybe you can change the charge order, try charging first, if successful, use mTHP. if failed, fallback to 4k? > - mem_cgroup_swapin_uncharge_swap(entry, 1); > + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); > > shadow = get_shadow_from_swap_cache(entry); > if (shadow) > @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > goto out_nomap; > } > > + /* allocated large folios for SWP_SYNCHRONOUS_IO */ > + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { > + unsigned long nr = folio_nr_pages(folio); > + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); > + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; > + pte_t *folio_ptep = vmf->pte - idx; > + > + if (!can_swapin_thp(vmf, folio_ptep, nr)) > + goto out_nomap; > + > + page_idx = idx; > + address = folio_start; > + ptep = folio_ptep; > + goto check_folio; > + } > + > nr_pages = 1; > page_idx = 0; > address = vmf->address; > @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > folio_add_lru_vma(folio, vma); > } else if (!folio_test_anon(folio)) { > /* > - * We currently only expect small !anon folios, which are either > - * fully exclusive or fully shared. If we ever get large folios > - * here, we have to be careful. > + * We currently only expect small !anon folios which are either > + * fully exclusive or fully shared, or new allocated large folios > + * which are fully exclusive. If we ever get large folios within > + * swapcache here, we have to be careful. > */ > - VM_WARN_ON_ONCE(folio_test_large(folio)); > + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); > VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > folio_add_new_anon_rmap(folio, vma, address, rmap_flags); > } else { > @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > out: > /* Clear the swap cache pin for direct swapin after PTL unlock */ > if (need_clear_cache) > - swapcache_clear(si, entry, 1); > + swapcache_clear(si, entry, nr_pages); > if (si) > put_swap_device(si); > return ret; > @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > folio_put(swapcache); > } > if (need_clear_cache) > - swapcache_clear(si, entry, 1); > + swapcache_clear(si, entry, nr_pages); > if (si) > put_swap_device(si); > return ret; > -- > 2.34.1 > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-15 9:47 ` Kairui Song @ 2024-08-15 13:27 ` Kefeng Wang 2024-08-15 23:06 ` Barry Song 0 siblings, 1 reply; 59+ messages in thread From: Kefeng Wang @ 2024-08-15 13:27 UTC (permalink / raw) To: Kairui Song, Chuanhua Han, Barry Song Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch On 2024/8/15 17:47, Kairui Song wrote: > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote: >> >> From: Chuanhua Han <hanchuanhua@oppo.com> > > Hi Chuanhua, > >> ... >> + >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) >> +{ >> + struct vm_area_struct *vma = vmf->vma; >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + unsigned long orders; >> + struct folio *folio; >> + unsigned long addr; >> + swp_entry_t entry; >> + spinlock_t *ptl; >> + pte_t *pte; >> + gfp_t gfp; >> + int order; >> + >> + /* >> + * If uffd is active for the vma we need per-page fault fidelity to >> + * maintain the uffd semantics. >> + */ >> + if (unlikely(userfaultfd_armed(vma))) >> + goto fallback; >> + >> + /* >> + * A large swapped out folio could be partially or fully in zswap. We >> + * lack handling for such cases, so fallback to swapping in order-0 >> + * folio. >> + */ >> + if (!zswap_never_enabled()) >> + goto fallback; >> + >> + entry = pte_to_swp_entry(vmf->orig_pte); >> + /* >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled >> + * and suitable for swapping THP. >> + */ >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders); >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); >> + >> + if (!orders) >> + goto fallback; >> + >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); >> + if (unlikely(!pte)) >> + goto fallback; >> + >> + /* >> + * For do_swap_page, find the highest order where the aligned range is >> + * completely swap entries with contiguous swap offsets. >> + */ >> + order = highest_order(orders); >> + while (orders) { >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) >> + break; >> + order = next_order(&orders, order); >> + } >> + >> + pte_unmap_unlock(pte, ptl); >> + >> + /* Try allocating the highest of the remaining orders. */ >> + gfp = vma_thp_gfp_mask(vma); >> + while (orders) { >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + return folio; >> + order = next_order(&orders, order); >> + } >> + >> +fallback: >> +#endif >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); >> +} >> + >> + >> /* >> * We enter with non-exclusive mmap_lock (to exclude vma changes, >> * but allow concurrent faults), and pte mapped but not yet locked. >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> if (!folio) { >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && >> __swap_count(entry) == 1) { >> - /* >> - * Prevent parallel swapin from proceeding with >> - * the cache flag. Otherwise, another thread may >> - * finish swapin first, free the entry, and swapout >> - * reusing the same entry. It's undetectable as >> - * pte_same() returns true due to entry reuse. >> - */ >> - if (swapcache_prepare(entry, 1)) { >> - /* Relax a bit to prevent rapid repeated page faults */ >> - schedule_timeout_uninterruptible(1); >> - goto out; >> - } >> - need_clear_cache = true; >> - >> /* skip swapcache */ >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, >> - vma, vmf->address, false); >> + folio = alloc_swap_folio(vmf); >> page = &folio->page; >> if (folio) { >> __folio_set_locked(folio); >> __folio_set_swapbacked(folio); >> >> + nr_pages = folio_nr_pages(folio); >> + if (folio_test_large(folio)) >> + entry.val = ALIGN_DOWN(entry.val, nr_pages); >> + /* >> + * Prevent parallel swapin from proceeding with >> + * the cache flag. Otherwise, another thread may >> + * finish swapin first, free the entry, and swapout >> + * reusing the same entry. It's undetectable as >> + * pte_same() returns true due to entry reuse. >> + */ >> + if (swapcache_prepare(entry, nr_pages)) { >> + /* Relax a bit to prevent rapid repeated page faults */ >> + schedule_timeout_uninterruptible(1); >> + goto out_page; >> + } >> + need_clear_cache = true; >> + >> if (mem_cgroup_swapin_charge_folio(folio, >> vma->vm_mm, GFP_KERNEL, >> entry)) { >> ret = VM_FAULT_OOM; >> goto out_page; >> } > > After your patch, with build kernel test, I'm seeing kernel log > spamming like this: > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > ............ > > And heavy performance loss with workloads limited by memcg, mTHP enabled. > > After some debugging, the problematic part is the > mem_cgroup_swapin_charge_folio call above. > When under pressure, cgroup charge fails easily for mTHP. One 64k > swapin will require a much more aggressive reclaim to success. > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > gone and mTHP swapin should have a much higher swapin success rate. > But this might not be the right way. > > For this particular issue, maybe you can change the charge order, try > charging first, if successful, use mTHP. if failed, fallback to 4k? This is what we did in alloc_anon_folio(), see 085ff35e7636 ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), 1) fallback earlier 2) using same GFP flags for allocation and charge but it seems that there is a little complicated for swapin charge > >> - mem_cgroup_swapin_uncharge_swap(entry, 1); >> + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); >> >> shadow = get_shadow_from_swap_cache(entry); >> if (shadow) >> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> goto out_nomap; >> } >> >> + /* allocated large folios for SWP_SYNCHRONOUS_IO */ >> + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { >> + unsigned long nr = folio_nr_pages(folio); >> + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); >> + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; >> + pte_t *folio_ptep = vmf->pte - idx; >> + >> + if (!can_swapin_thp(vmf, folio_ptep, nr)) >> + goto out_nomap; >> + >> + page_idx = idx; >> + address = folio_start; >> + ptep = folio_ptep; >> + goto check_folio; >> + } >> + >> nr_pages = 1; >> page_idx = 0; >> address = vmf->address; >> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> folio_add_lru_vma(folio, vma); >> } else if (!folio_test_anon(folio)) { >> /* >> - * We currently only expect small !anon folios, which are either >> - * fully exclusive or fully shared. If we ever get large folios >> - * here, we have to be careful. >> + * We currently only expect small !anon folios which are either >> + * fully exclusive or fully shared, or new allocated large folios >> + * which are fully exclusive. If we ever get large folios within >> + * swapcache here, we have to be careful. >> */ >> - VM_WARN_ON_ONCE(folio_test_large(folio)); >> + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); >> VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); >> folio_add_new_anon_rmap(folio, vma, address, rmap_flags); >> } else { >> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> out: >> /* Clear the swap cache pin for direct swapin after PTL unlock */ >> if (need_clear_cache) >> - swapcache_clear(si, entry, 1); >> + swapcache_clear(si, entry, nr_pages); >> if (si) >> put_swap_device(si); >> return ret; >> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> folio_put(swapcache); >> } >> if (need_clear_cache) >> - swapcache_clear(si, entry, 1); >> + swapcache_clear(si, entry, nr_pages); >> if (si) >> put_swap_device(si); >> return ret; >> -- >> 2.34.1 >> >> > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-15 13:27 ` Kefeng Wang @ 2024-08-15 23:06 ` Barry Song 2024-08-16 16:50 ` Kairui Song 2024-08-16 21:16 ` Matthew Wilcox 0 siblings, 2 replies; 59+ messages in thread From: Barry Song @ 2024-08-15 23:06 UTC (permalink / raw) To: wangkefeng.wang Cc: akpm, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > > > On 2024/8/15 17:47, Kairui Song wrote: > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> From: Chuanhua Han <hanchuanhua@oppo.com> > > > > Hi Chuanhua, > > > >> > ... > > >> + > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > >> +{ > >> + struct vm_area_struct *vma = vmf->vma; > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >> + unsigned long orders; > >> + struct folio *folio; > >> + unsigned long addr; > >> + swp_entry_t entry; > >> + spinlock_t *ptl; > >> + pte_t *pte; > >> + gfp_t gfp; > >> + int order; > >> + > >> + /* > >> + * If uffd is active for the vma we need per-page fault fidelity to > >> + * maintain the uffd semantics. > >> + */ > >> + if (unlikely(userfaultfd_armed(vma))) > >> + goto fallback; > >> + > >> + /* > >> + * A large swapped out folio could be partially or fully in zswap. We > >> + * lack handling for such cases, so fallback to swapping in order-0 > >> + * folio. > >> + */ > >> + if (!zswap_never_enabled()) > >> + goto fallback; > >> + > >> + entry = pte_to_swp_entry(vmf->orig_pte); > >> + /* > >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled > >> + * and suitable for swapping THP. > >> + */ > >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, > >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders); > >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); > >> + > >> + if (!orders) > >> + goto fallback; > >> + > >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); > >> + if (unlikely(!pte)) > >> + goto fallback; > >> + > >> + /* > >> + * For do_swap_page, find the highest order where the aligned range is > >> + * completely swap entries with contiguous swap offsets. > >> + */ > >> + order = highest_order(orders); > >> + while (orders) { > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) > >> + break; > >> + order = next_order(&orders, order); > >> + } > >> + > >> + pte_unmap_unlock(pte, ptl); > >> + > >> + /* Try allocating the highest of the remaining orders. */ > >> + gfp = vma_thp_gfp_mask(vma); > >> + while (orders) { > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); > >> + if (folio) > >> + return folio; > >> + order = next_order(&orders, order); > >> + } > >> + > >> +fallback: > >> +#endif > >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); > >> +} > >> + > >> + > >> /* > >> * We enter with non-exclusive mmap_lock (to exclude vma changes, > >> * but allow concurrent faults), and pte mapped but not yet locked. > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> if (!folio) { > >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > >> __swap_count(entry) == 1) { > >> - /* > >> - * Prevent parallel swapin from proceeding with > >> - * the cache flag. Otherwise, another thread may > >> - * finish swapin first, free the entry, and swapout > >> - * reusing the same entry. It's undetectable as > >> - * pte_same() returns true due to entry reuse. > >> - */ > >> - if (swapcache_prepare(entry, 1)) { > >> - /* Relax a bit to prevent rapid repeated page faults */ > >> - schedule_timeout_uninterruptible(1); > >> - goto out; > >> - } > >> - need_clear_cache = true; > >> - > >> /* skip swapcache */ > >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > >> - vma, vmf->address, false); > >> + folio = alloc_swap_folio(vmf); > >> page = &folio->page; > >> if (folio) { > >> __folio_set_locked(folio); > >> __folio_set_swapbacked(folio); > >> > >> + nr_pages = folio_nr_pages(folio); > >> + if (folio_test_large(folio)) > >> + entry.val = ALIGN_DOWN(entry.val, nr_pages); > >> + /* > >> + * Prevent parallel swapin from proceeding with > >> + * the cache flag. Otherwise, another thread may > >> + * finish swapin first, free the entry, and swapout > >> + * reusing the same entry. It's undetectable as > >> + * pte_same() returns true due to entry reuse. > >> + */ > >> + if (swapcache_prepare(entry, nr_pages)) { > >> + /* Relax a bit to prevent rapid repeated page faults */ > >> + schedule_timeout_uninterruptible(1); > >> + goto out_page; > >> + } > >> + need_clear_cache = true; > >> + > >> if (mem_cgroup_swapin_charge_folio(folio, > >> vma->vm_mm, GFP_KERNEL, > >> entry)) { > >> ret = VM_FAULT_OOM; > >> goto out_page; > >> } > > > > After your patch, with build kernel test, I'm seeing kernel log > > spamming like this: > > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed > > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > ............ > > > > And heavy performance loss with workloads limited by memcg, mTHP enabled. > > > > After some debugging, the problematic part is the > > mem_cgroup_swapin_charge_folio call above. > > When under pressure, cgroup charge fails easily for mTHP. One 64k > > swapin will require a much more aggressive reclaim to success. > > > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > > gone and mTHP swapin should have a much higher swapin success rate. > > But this might not be the right way. > > > > For this particular issue, maybe you can change the charge order, try > > charging first, if successful, use mTHP. if failed, fallback to 4k? > > This is what we did in alloc_anon_folio(), see 085ff35e7636 > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), > 1) fallback earlier > 2) using same GFP flags for allocation and charge > > but it seems that there is a little complicated for swapin charge Kefeng, thanks! I guess we can continue using the same approach and it's not too complicated. Kairui, sorry for the trouble and thanks for the report! could you check if the solution below resolves the issue? On phones, we don't encounter the scenarios you’re facing. From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001 From: Barry Song <v-songbaohua@oppo.com> Date: Fri, 16 Aug 2024 10:51:48 +1200 Subject: [PATCH] mm: fallback to next_order if charing mTHP fails When memcg approaches its limit, charging mTHP becomes difficult. At this point, when the charge fails, we fallback to the next order to avoid repeatedly retrying larger orders. Reported-by: Kairui Song <ryncsn@gmail.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- mm/memory.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0ed3603aaf31..6cba28ef91e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio = vma_alloc_folio(gfp, order, vma, addr, true); - if (folio) - return folio; + if (folio) { + if (!mem_cgroup_swapin_charge_folio(folio, + vma->vm_mm, gfp, entry)) + return folio; + folio_put(folio); + } order = next_order(&orders, order); } @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) } need_clear_cache = true; - if (mem_cgroup_swapin_charge_folio(folio, + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; -- 2.34.1 Thanks Barry ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-15 23:06 ` Barry Song @ 2024-08-16 16:50 ` Kairui Song 2024-08-16 20:34 ` Andrew Morton 2024-08-16 21:16 ` Matthew Wilcox 1 sibling, 1 reply; 59+ messages in thread From: Kairui Song @ 2024-08-16 16:50 UTC (permalink / raw) To: Barry Song Cc: wangkefeng.wang, akpm, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > > > > > > > On 2024/8/15 17:47, Kairui Song wrote: > > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote: > > >> > > >> From: Chuanhua Han <hanchuanhua@oppo.com> > > > > > > Hi Chuanhua, > > > > > >> > > ... > > > > >> + > > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > >> +{ > > >> + struct vm_area_struct *vma = vmf->vma; > > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > >> + unsigned long orders; > > >> + struct folio *folio; > > >> + unsigned long addr; > > >> + swp_entry_t entry; > > >> + spinlock_t *ptl; > > >> + pte_t *pte; > > >> + gfp_t gfp; > > >> + int order; > > >> + > > >> + /* > > >> + * If uffd is active for the vma we need per-page fault fidelity to > > >> + * maintain the uffd semantics. > > >> + */ > > >> + if (unlikely(userfaultfd_armed(vma))) > > >> + goto fallback; > > >> + > > >> + /* > > >> + * A large swapped out folio could be partially or fully in zswap. We > > >> + * lack handling for such cases, so fallback to swapping in order-0 > > >> + * folio. > > >> + */ > > >> + if (!zswap_never_enabled()) > > >> + goto fallback; > > >> + > > >> + entry = pte_to_swp_entry(vmf->orig_pte); > > >> + /* > > >> + * Get a list of all the (large) orders below PMD_ORDER that are enabled > > >> + * and suitable for swapping THP. > > >> + */ > > >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, > > >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > > >> + orders = thp_vma_suitable_orders(vma, vmf->address, orders); > > >> + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); > > >> + > > >> + if (!orders) > > >> + goto fallback; > > >> + > > >> + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); > > >> + if (unlikely(!pte)) > > >> + goto fallback; > > >> + > > >> + /* > > >> + * For do_swap_page, find the highest order where the aligned range is > > >> + * completely swap entries with contiguous swap offsets. > > >> + */ > > >> + order = highest_order(orders); > > >> + while (orders) { > > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > >> + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) > > >> + break; > > >> + order = next_order(&orders, order); > > >> + } > > >> + > > >> + pte_unmap_unlock(pte, ptl); > > >> + > > >> + /* Try allocating the highest of the remaining orders. */ > > >> + gfp = vma_thp_gfp_mask(vma); > > >> + while (orders) { > > >> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); > > >> + if (folio) > > >> + return folio; > > >> + order = next_order(&orders, order); > > >> + } > > >> + > > >> +fallback: > > >> +#endif > > >> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); > > >> +} > > >> + > > >> + > > >> /* > > >> * We enter with non-exclusive mmap_lock (to exclude vma changes, > > >> * but allow concurrent faults), and pte mapped but not yet locked. > > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > >> if (!folio) { > > >> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > >> __swap_count(entry) == 1) { > > >> - /* > > >> - * Prevent parallel swapin from proceeding with > > >> - * the cache flag. Otherwise, another thread may > > >> - * finish swapin first, free the entry, and swapout > > >> - * reusing the same entry. It's undetectable as > > >> - * pte_same() returns true due to entry reuse. > > >> - */ > > >> - if (swapcache_prepare(entry, 1)) { > > >> - /* Relax a bit to prevent rapid repeated page faults */ > > >> - schedule_timeout_uninterruptible(1); > > >> - goto out; > > >> - } > > >> - need_clear_cache = true; > > >> - > > >> /* skip swapcache */ > > >> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > >> - vma, vmf->address, false); > > >> + folio = alloc_swap_folio(vmf); > > >> page = &folio->page; > > >> if (folio) { > > >> __folio_set_locked(folio); > > >> __folio_set_swapbacked(folio); > > >> > > >> + nr_pages = folio_nr_pages(folio); > > >> + if (folio_test_large(folio)) > > >> + entry.val = ALIGN_DOWN(entry.val, nr_pages); > > >> + /* > > >> + * Prevent parallel swapin from proceeding with > > >> + * the cache flag. Otherwise, another thread may > > >> + * finish swapin first, free the entry, and swapout > > >> + * reusing the same entry. It's undetectable as > > >> + * pte_same() returns true due to entry reuse. > > >> + */ > > >> + if (swapcache_prepare(entry, nr_pages)) { > > >> + /* Relax a bit to prevent rapid repeated page faults */ > > >> + schedule_timeout_uninterruptible(1); > > >> + goto out_page; > > >> + } > > >> + need_clear_cache = true; > > >> + > > >> if (mem_cgroup_swapin_charge_folio(folio, > > >> vma->vm_mm, GFP_KERNEL, > > >> entry)) { > > >> ret = VM_FAULT_OOM; > > >> goto out_page; > > >> } > > > > > > After your patch, with build kernel test, I'm seeing kernel log > > > spamming like this: > > > [ 101.048594] pagefault_out_of_memory: 95 callbacks suppressed > > > [ 101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > [ 101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF > > > ............ > > > > > > And heavy performance loss with workloads limited by memcg, mTHP enabled. > > > > > > After some debugging, the problematic part is the > > > mem_cgroup_swapin_charge_folio call above. > > > When under pressure, cgroup charge fails easily for mTHP. One 64k > > > swapin will require a much more aggressive reclaim to success. > > > > > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is > > > gone and mTHP swapin should have a much higher swapin success rate. > > > But this might not be the right way. > > > > > > For this particular issue, maybe you can change the charge order, try > > > charging first, if successful, use mTHP. if failed, fallback to 4k? > > > > This is what we did in alloc_anon_folio(), see 085ff35e7636 > > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"), > > 1) fallback earlier > > 2) using same GFP flags for allocation and charge > > > > but it seems that there is a little complicated for swapin charge > > Kefeng, thanks! I guess we can continue using the same approach and > it's not too complicated. > > Kairui, sorry for the trouble and thanks for the report! could you > check if the solution below resolves the issue? On phones, we don't > encounter the scenarios you’re facing. > > From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001 > From: Barry Song <v-songbaohua@oppo.com> > Date: Fri, 16 Aug 2024 10:51:48 +1200 > Subject: [PATCH] mm: fallback to next_order if charing mTHP fails > > When memcg approaches its limit, charging mTHP becomes difficult. > At this point, when the charge fails, we fallback to the next order > to avoid repeatedly retrying larger orders. > > Reported-by: Kairui Song <ryncsn@gmail.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/memory.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 0ed3603aaf31..6cba28ef91e7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > while (orders) { > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > folio = vma_alloc_folio(gfp, order, vma, addr, true); > - if (folio) > - return folio; > + if (folio) { > + if (!mem_cgroup_swapin_charge_folio(folio, > + vma->vm_mm, gfp, entry)) > + return folio; > + folio_put(folio); > + } > order = next_order(&orders, order); > } > > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > } > need_clear_cache = true; > > - if (mem_cgroup_swapin_charge_folio(folio, > + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio, > vma->vm_mm, GFP_KERNEL, > entry)) { > ret = VM_FAULT_OOM; > -- > 2.34.1 > Hi Barry After the fix the spamming log is gone, thanks for the fix. > > Thanks > Barry > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-16 16:50 ` Kairui Song @ 2024-08-16 20:34 ` Andrew Morton 2024-08-27 3:41 ` Chuanhua Han 0 siblings, 1 reply; 59+ messages in thread From: Andrew Morton @ 2024-08-16 20:34 UTC (permalink / raw) To: Kairui Song Cc: Barry Song, wangkefeng.wang, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed On Sat, 17 Aug 2024 00:50:00 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > -- > > 2.34.1 > > > > Hi Barry > > After the fix the spamming log is gone, thanks for the fix. > Thanks, I'll drop the v6 series. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-16 20:34 ` Andrew Morton @ 2024-08-27 3:41 ` Chuanhua Han 0 siblings, 0 replies; 59+ messages in thread From: Chuanhua Han @ 2024-08-27 3:41 UTC (permalink / raw) To: Andrew Morton Cc: Kairui Song, Barry Song, wangkefeng.wang, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed Andrew Morton <akpm@linux-foundation.org> 于2024年8月17日周六 04:35写道: > > On Sat, 17 Aug 2024 00:50:00 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > > > -- > > > 2.34.1 > > > > > > > Hi Barry > > > > After the fix the spamming log is gone, thanks for the fix. > > > > Thanks, I'll drop the v6 series. Hi, Andrew Can you please queue v7 for testing: https://lore.kernel.org/linux-mm/20240821074541.516249-1-hanchuanhua@oppo.com/ V7 has addressed all comments regarding the changelog, the subject and order-0 charge from Christoph, Kairui and Willy. > -- Thanks, Chuanhua ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-15 23:06 ` Barry Song 2024-08-16 16:50 ` Kairui Song @ 2024-08-16 21:16 ` Matthew Wilcox 2024-08-16 21:39 ` Barry Song 1 sibling, 1 reply; 59+ messages in thread From: Matthew Wilcox @ 2024-08-16 21:16 UTC (permalink / raw) To: Barry Song Cc: wangkefeng.wang, akpm, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, ying.huang, yosryahmed On Fri, Aug 16, 2024 at 11:06:12AM +1200, Barry Song wrote: > When memcg approaches its limit, charging mTHP becomes difficult. > At this point, when the charge fails, we fallback to the next order > to avoid repeatedly retrying larger orders. Why do you always find the ugliest possible solution to a problem? > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > } > need_clear_cache = true; > > - if (mem_cgroup_swapin_charge_folio(folio, > + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio, > vma->vm_mm, GFP_KERNEL, > entry)) { > ret = VM_FAULT_OOM; Just make alloc_swap_folio() always charge the folio, even for order-0. And you'll have to uncharge it in the swapcache_prepare() failure case. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices 2024-08-16 21:16 ` Matthew Wilcox @ 2024-08-16 21:39 ` Barry Song 0 siblings, 0 replies; 59+ messages in thread From: Barry Song @ 2024-08-16 21:39 UTC (permalink / raw) To: Matthew Wilcox Cc: wangkefeng.wang, akpm, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang, ying.huang, yosryahmed On Sat, Aug 17, 2024 at 9:17 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Aug 16, 2024 at 11:06:12AM +1200, Barry Song wrote: > > When memcg approaches its limit, charging mTHP becomes difficult. > > At this point, when the charge fails, we fallback to the next order > > to avoid repeatedly retrying larger orders. > > Why do you always find the ugliest possible solution to a problem? > had definitely thought about charging order-0 as well in alloc_swap_folio() when sending this quick fix mainly for quick verification it can fix the problem. v7 will definitely charge order-0 in alloc_swap_folio(). > > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > } > > need_clear_cache = true; > > > > - if (mem_cgroup_swapin_charge_folio(folio, > > + if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio, > > vma->vm_mm, GFP_KERNEL, > > entry)) { > > ret = VM_FAULT_OOM; > > Just make alloc_swap_folio() always charge the folio, even for order-0. > > And you'll have to uncharge it in the swapcache_prepare() failure case. I suppose this is done by folio_put() automatically. Thanks Barry ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2024-08-27 3:41 UTC | newest]
Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-26 9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
2024-07-26 9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
2024-07-30 3:00 ` Baolin Wang
2024-07-30 3:11 ` Matthew Wilcox
2024-07-30 3:15 ` Barry Song
2024-07-26 9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song
2024-07-26 16:30 ` Yosry Ahmed
2024-07-29 2:02 ` Barry Song
2024-07-29 3:43 ` Matthew Wilcox
2024-07-29 4:52 ` Barry Song
2024-07-26 9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
2024-07-29 3:51 ` Matthew Wilcox
2024-07-29 4:41 ` Barry Song
[not found] ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>
2024-07-29 12:49 ` Matthew Wilcox
2024-07-29 13:11 ` Barry Song
2024-07-29 15:13 ` Matthew Wilcox
2024-07-29 20:03 ` Barry Song
2024-07-29 21:56 ` Barry Song
2024-07-30 8:12 ` Ryan Roberts
2024-07-29 6:36 ` Chuanhua Han
2024-07-29 12:55 ` Matthew Wilcox
2024-07-29 13:18 ` Barry Song
2024-07-29 13:32 ` Chuanhua Han
2024-07-29 14:16 ` Dan Carpenter
2024-07-26 9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
2024-07-27 5:58 ` kernel test robot
2024-07-29 1:37 ` Barry Song
2024-07-29 3:52 ` Matthew Wilcox
2024-07-29 4:49 ` Barry Song
2024-07-29 16:11 ` Christoph Hellwig
2024-07-29 20:11 ` Barry Song
2024-07-30 16:30 ` Christoph Hellwig
2024-07-30 19:28 ` Nhat Pham
2024-07-30 21:06 ` Barry Song
2024-07-31 18:35 ` Nhat Pham
2024-08-01 3:00 ` Sergey Senozhatsky
2024-08-01 20:55 ` Chris Li
2024-08-12 8:27 ` Christoph Hellwig
2024-08-12 8:44 ` Barry Song
2024-07-30 2:27 ` Chuanhua Han
2024-07-30 8:36 ` Ryan Roberts
2024-07-30 8:47 ` David Hildenbrand
2024-08-05 6:10 ` Huang, Ying
2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
2024-08-02 12:20 ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
2024-08-02 17:29 ` Chris Li
2024-08-02 12:20 ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
2024-08-03 19:08 ` Andrew Morton
2024-08-12 8:26 ` Christoph Hellwig
2024-08-12 8:53 ` Barry Song
2024-08-12 11:38 ` Christoph Hellwig
2024-08-15 9:47 ` Kairui Song
2024-08-15 13:27 ` Kefeng Wang
2024-08-15 23:06 ` Barry Song
2024-08-16 16:50 ` Kairui Song
2024-08-16 20:34 ` Andrew Morton
2024-08-27 3:41 ` Chuanhua Han
2024-08-16 21:16 ` Matthew Wilcox
2024-08-16 21:39 ` Barry Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).