[PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile
@ 2024-07-26  9:46 Barry Song
  2024-07-26  9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
                   ` (4 more replies)
  0 siblings, 5 replies; 59+ messages in thread
From: Barry Song @ 2024-07-26  9:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, yosryahmed

From: Barry Song <v-songbaohua@oppo.com>

In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app
is switched to the background, most of its memory might be swapped out.

Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.

This is unacceptable and reduces mTHP to merely a toy on systems
with significant swap utilization.

This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.

Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
   without fragmentation. Based on the observed data [1] on Chris's and Ryan's
   THP swap allocation optimization, aligned swap-in plays a crucial role
   in the success of THP_SWPOUT.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
   and enhancing compression ratios significantly. We have another patchset
   to enable mTHP compression and decompression in zsmalloc/zRAM[2].

Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
to be an optimal approach. There's a critical distinction between pagecache
and anonymous pages: pagecache can be evicted and later retrieved from disk,
potentially becoming a mTHP upon retrieval, whereas anonymous pages must
always reside in memory or swapfile. If we swap in small folios and identify
adjacent memory suitable for swapping in as mTHP, those pages that have been
converted to small folios may never transition to mTHP. The process of
converting mTHP into small folios remains irreversible. This introduces
the risk of losing all mTHP through several swap-out and swap-in cycles,
let alone losing the benefits of defragmentation, improved compression
ratios, and reduced CPU usage based on mTHP compression/decompression.

Conversely, in deploying mTHP on millions of real-world products with this
feature in OPPO's out-of-tree code[3], we haven't observed any significant
increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.

[1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
[3] OnePlusOSS / android_kernel_oneplus_sm8550
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

-v5:
 * Add swap-in control policy according to Ying's proposal. Right now only
   "always" and "never" are supported, later we can extend to "auto";
 * Fix the comment regarding zswap_never_enabled() according to Yosry;
 * Filter out unaligned swp entries earlier;
 * add mem_cgroup_swapin_uncharge_swap_nr() helper

-v4:
 https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/

 Many parts of v3 have been merged into the mm tree with the help on reviewing
 from Ryan, David, Ying and Chris etc. Thank you very much!
 This is the final part to allocate large folios and map them.

 * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
   in this v4 RFC though it should be fixed in Yosry's patch
 * lots of code improvement (drop large stack, hold ptl etc) according
   to Yosry's and Ryan's feedback
 * rebased on top of the latest mm-unstable and utilized some new helpers
   introduced recently.

-v3:
 https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
 * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
   thanks!
 * fix the issue folio is charged twice for do_swap_page, separating
   alloc_anon_folio and alloc_swap_folio as they have many differences
   now on
   * memcg charing
   * clearing allocated folio or not

-v2:
 https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
 * lots of code cleanup according to Chris's comments, thanks!
 * collect Chris's ack tags, thanks!
 * address David's comment on moving to use folio_add_new_anon_rmap
   for !folio_test_anon in do_swap_page, thanks!
 * remove the MADV_PAGEOUT patch from this series as Ryan will
   intergrate it into swap-out series
 * Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
   on large folios swap-in as well
 * fixed corrupted data(zero-filled data) in two races: zswap and
   a part of entries are in swapcache while some others are not
   in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
 https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t

Barry Song (3):
  mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
    large folios swap-in
  mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large
    folios swap-in
  mm: Introduce per-thpsize swapin control policy

Chuanhua Han (1):
  mm: support large folios swapin as a whole for zRAM-like swapfile

 Documentation/admin-guide/mm/transhuge.rst |   6 +
 include/linux/huge_mm.h                    |   1 +
 include/linux/memcontrol.h                 |  12 ++
 include/linux/swap.h                       |   9 +-
 mm/huge_memory.c                           |  44 +++++
 mm/memory.c                                | 212 ++++++++++++++++++---
 mm/swap.h                                  |  10 +-
 mm/swapfile.c                              | 102 ++++++----
 8 files changed, 329 insertions(+), 67 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in
  2024-07-26  9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
@ 2024-07-26  9:46 ` Barry Song
  2024-07-30  3:00   ` Baolin Wang
  2024-07-30  3:11   ` Matthew Wilcox
  2024-07-26  9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-07-26  9:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, yosryahmed

From: Barry Song <v-songbaohua@oppo.com>

Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
one entry only, to support large folio swap-in, we need to handle multiple
swap entries.

To optimize stack usage, we iterate twice in __swap_duplicate_nr(): the
first time to verify that all entries are valid, and the second time to
apply the modifications to the entries.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |   9 +++-
 mm/swap.h            |  10 ++++-
 mm/swapfile.c        | 102 ++++++++++++++++++++++++++-----------------
 3 files changed, 77 insertions(+), 44 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba7ea95d1c57..f1b28fd04533 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -480,7 +480,7 @@ extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t);
+extern int swapcache_prepare_nr(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -554,7 +554,7 @@ static inline int swap_duplicate(swp_entry_t swp)
 	return 0;
 }
 
-static inline int swapcache_prepare(swp_entry_t swp)
+static inline int swapcache_prepare_nr(swp_entry_t swp, int nr)
 {
 	return 0;
 }
@@ -612,6 +612,11 @@ static inline void swap_free(swp_entry_t entry)
 	swap_free_nr(entry, 1);
 }
 
+static inline int swapcache_prepare(swp_entry_t entry)
+{
+	return swapcache_prepare_nr(entry, 1);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swap.h b/mm/swap.h
index baa1fa946b34..81ff7eb0be9c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -59,7 +59,7 @@ void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -120,7 +120,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+static inline void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
 }
 
@@ -172,4 +172,10 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 	return 0;
 }
 #endif /* CONFIG_SWAP */
+
+static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+{
+	swapcache_clear_nr(si, entry, 1);
+}
+
 #endif /* _MM_SWAP_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5f73a8553371..e688e46f1c62 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3363,7 +3363,7 @@ void si_swapinfo(struct sysinfo *val)
 }
 
 /*
- * Verify that a swap entry is valid and increment its swap map count.
+ * Verify that nr swap entries are valid and increment their swap map counts.
  *
  * Returns error code in following case.
  * - success -> 0
@@ -3373,66 +3373,88 @@ void si_swapinfo(struct sysinfo *val)
  * - swap-cache reference is requested but the entry is not used. -> ENOENT
  * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_nr(swp_entry_t entry, unsigned char usage, int nr)
 {
 	struct swap_info_struct *p;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned char count;
 	unsigned char has_cache;
-	int err;
+	int err, i;
 
 	p = swp_swap_info(entry);
 
 	offset = swp_offset(entry);
+	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	ci = lock_cluster_or_swap_info(p, offset);
 
-	count = p->swap_map[offset];
+	err = 0;
+	for (i = 0; i < nr; i++) {
+		count = p->swap_map[offset + i];
 
-	/*
-	 * swapin_readahead() doesn't check if a swap entry is valid, so the
-	 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
-	 */
-	if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
-		err = -ENOENT;
-		goto unlock_out;
-	}
+		/*
+		 * swapin_readahead() doesn't check if a swap entry is valid, so the
+		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+		 */
+		if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
+			err = -ENOENT;
+			goto unlock_out;
+		}
 
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-	err = 0;
+		has_cache = count & SWAP_HAS_CACHE;
+		count &= ~SWAP_HAS_CACHE;
 
-	if (usage == SWAP_HAS_CACHE) {
+		if (usage == SWAP_HAS_CACHE) {
+			/* set SWAP_HAS_CACHE if there is no cache and entry is used */
+			if (!has_cache && count)
+				continue;
+			else if (has_cache)		/* someone else added cache */
+				err = -EEXIST;
+			else				/* no users remaining */
+				err = -ENOENT;
 
-		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
-		if (!has_cache && count)
-			has_cache = SWAP_HAS_CACHE;
-		else if (has_cache)		/* someone else added cache */
-			err = -EEXIST;
-		else				/* no users remaining */
-			err = -ENOENT;
+		} else if (count || has_cache) {
 
-	} else if (count || has_cache) {
+			if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+				continue;
+			else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
+				err = -EINVAL;
+			else if (swap_count_continued(p, offset + i, count))
+				continue;
+			else
+				err = -ENOMEM;
+		} else
+			err = -ENOENT;			/* unused swap entry */
 
-		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+		if (err)
+			goto unlock_out;
+	}
+
+	for (i = 0; i < nr; i++) {
+		count = p->swap_map[offset + i];
+		has_cache = count & SWAP_HAS_CACHE;
+		count &= ~SWAP_HAS_CACHE;
+
+		if (usage == SWAP_HAS_CACHE)
+			has_cache = SWAP_HAS_CACHE;
+		else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
 			count += usage;
-		else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
-			err = -EINVAL;
-		else if (swap_count_continued(p, offset, count))
-			count = COUNT_CONTINUED;
 		else
-			err = -ENOMEM;
-	} else
-		err = -ENOENT;			/* unused swap entry */
+			count = COUNT_CONTINUED;
 
-	if (!err)
-		WRITE_ONCE(p->swap_map[offset], count | has_cache);
+		WRITE_ONCE(p->swap_map[offset + i], count | has_cache);
+	}
 
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
 	return err;
 }
 
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+	return __swap_duplicate_nr(entry, usage, 1);
+}
+
 /*
  * Help swapoff by noting that swap entry belongs to shmem/tmpfs
  * (in which case its reference count is never incremented).
@@ -3459,23 +3481,23 @@ int swap_duplicate(swp_entry_t entry)
 }
 
 /*
- * @entry: swap entry for which we allocate swap cache.
+ * @entry: first swap entry from which we allocate nr swap cache.
  *
- * Called when allocating swap cache for existing swap entry,
+ * Called when allocating swap cache for existing swap entries,
  * This can return error codes. Returns 0 at success.
  * -EEXIST means there is a swap cache.
  * Note: return code is different from swap_duplicate().
  */
-int swapcache_prepare(swp_entry_t entry)
+int swapcache_prepare_nr(swp_entry_t entry, int nr)
 {
-	return __swap_duplicate(entry, SWAP_HAS_CACHE);
+	return __swap_duplicate_nr(entry, SWAP_HAS_CACHE, nr);
 }
 
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
 	unsigned long offset = swp_offset(entry);
 
-	cluster_swap_free_nr(si, offset, 1, SWAP_HAS_CACHE);
+	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
 
 struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in
  2024-07-26  9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
@ 2024-07-30  3:00   ` Baolin Wang
  2024-07-30  3:11   ` Matthew Wilcox
  1 sibling, 0 replies; 59+ messages in thread
From: Baolin Wang @ 2024-07-30  3:00 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: ying.huang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang,
	yosryahmed

Hi Barry,

On 2024/7/26 17:46, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
> one entry only, to support large folio swap-in, we need to handle multiple
> swap entries.
> 
> To optimize stack usage, we iterate twice in __swap_duplicate_nr(): the
> first time to verify that all entries are valid, and the second time to
> apply the modifications to the entries.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

LGTM. Feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

By the way, my shmem swap patchset[1] also relies on this patch, so I 
wonder if it's possible to merge this patch into the mm-unstable branch 
first (if other patches still need discussion), to make it easier for me 
to rebase and resend my patch set? Thanks.

[1] 
https://lore.kernel.org/all/cover.1720079976.git.baolin.wang@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in
  2024-07-26  9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
  2024-07-30  3:00   ` Baolin Wang
@ 2024-07-30  3:11   ` Matthew Wilcox
  2024-07-30  3:15     ` Barry Song
  1 sibling, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-30  3:11 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed

On Fri, Jul 26, 2024 at 09:46:15PM +1200, Barry Song wrote:
> +static inline int swapcache_prepare(swp_entry_t entry)
> +{
> +	return swapcache_prepare_nr(entry, 1);
> +}

Same comment as 2/4 -- there are only two callers of swapcache_prepre().
Just make that take the 'nr' argument and change both callers to pass 1.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in
  2024-07-30  3:11   ` Matthew Wilcox
@ 2024-07-30  3:15     ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-07-30  3:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed

On Tue, Jul 30, 2024 at 11:11 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jul 26, 2024 at 09:46:15PM +1200, Barry Song wrote:
> > +static inline int swapcache_prepare(swp_entry_t entry)
> > +{
> > +     return swapcache_prepare_nr(entry, 1);
> > +}
>
> Same comment as 2/4 -- there are only two callers of swapcache_prepre().
> Just make that take the 'nr' argument and change both callers to pass 1.

make sense to me. As Baolin also needs this patch for shmem, I'm going
to separate this one from this series and send a new version with the
suggested change so that Andrew can pull it earlier.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in
  2024-07-26  9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
  2024-07-26  9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
@ 2024-07-26  9:46 ` Barry Song
  2024-07-26 16:30   ` Yosry Ahmed
  2024-07-26  9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-26  9:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, yosryahmed

From: Barry Song <v-songbaohua@oppo.com>

With large folios swap-in, we might need to uncharge multiple entries
all together, it is better to introduce a helper for that.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/memcontrol.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b79760af685..55958cbce61b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -684,6 +684,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
 void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
 
+static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++, entry.val++)
+		mem_cgroup_swapin_uncharge_swap(entry);
+}
+
 void __mem_cgroup_uncharge(struct folio *folio);
 
 /**
@@ -1185,6 +1193,10 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
 {
 }
 
+static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
+{
+}
+
 static inline void mem_cgroup_uncharge(struct folio *folio)
 {
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in
  2024-07-26  9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song
@ 2024-07-26 16:30   ` Yosry Ahmed
  2024-07-29  2:02     ` Barry Song
  0 siblings, 1 reply; 59+ messages in thread
From: Yosry Ahmed @ 2024-07-26 16:30 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, willy, xiang

On Fri, Jul 26, 2024 at 2:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> With large folios swap-in, we might need to uncharge multiple entries
> all together, it is better to introduce a helper for that.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/memcontrol.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1b79760af685..55958cbce61b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -684,6 +684,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>                                   gfp_t gfp, swp_entry_t entry);
>  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
>
> +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++, entry.val++)
> +               mem_cgroup_swapin_uncharge_swap(entry);

mem_cgroup_swapin_uncharge_swap() calls mem_cgroup_uncharge_swap()
which already takes in nr_pages, but we currently only pass 1. Would
it be better if we just make mem_cgroup_swapin_uncharge_swap() take in
nr_pages as well and pass it along to mem_cgroup_uncharge_swap(),
instead of calling it in a loop?

This would batch the page counter, stats updates, and refcount updates
in mem_cgroup_uncharge_swap(). You may be able to observe a bit of a
performance gain with this.

> +}
> +
>  void __mem_cgroup_uncharge(struct folio *folio);
>
>  /**
> @@ -1185,6 +1193,10 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
>  {
>  }
>
> +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
> +{
> +}
> +
>  static inline void mem_cgroup_uncharge(struct folio *folio)
>  {
>  }
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in
  2024-07-26 16:30   ` Yosry Ahmed
@ 2024-07-29  2:02     ` Barry Song
  2024-07-29  3:43       ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-29  2:02 UTC (permalink / raw)
  To: yosryahmed
  Cc: 21cnbao, akpm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, willy, xiang, ying.huang

On Sat, Jul 27, 2024 at 4:31 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Jul 26, 2024 at 2:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > With large folios swap-in, we might need to uncharge multiple entries
> > all together, it is better to introduce a helper for that.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/memcontrol.h | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1b79760af685..55958cbce61b 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -684,6 +684,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >                                   gfp_t gfp, swp_entry_t entry);
> >  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> >
> > +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < nr; i++, entry.val++)
> > +               mem_cgroup_swapin_uncharge_swap(entry);
>
> mem_cgroup_swapin_uncharge_swap() calls mem_cgroup_uncharge_swap()
> which already takes in nr_pages, but we currently only pass 1. Would
> it be better if we just make mem_cgroup_swapin_uncharge_swap() take in
> nr_pages as well and pass it along to mem_cgroup_uncharge_swap(),
> instead of calling it in a loop?
>
> This would batch the page counter, stats updates, and refcount updates
> in mem_cgroup_uncharge_swap(). You may be able to observe a bit of a
> performance gain with this.

Good suggestion. I'll send the v6 version below after waiting for some
comments on the other patches.

From 92dfbf300fd51b427d2a6833226d1b777e0b5fee Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Fri, 26 Jul 2024 14:33:54 +1200
Subject: [PATCH v6 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr()
 helper for large folios swap-in

With large folios swap-in, we might need to uncharge multiple entries
all together, it is better to introduce a helper for that.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/memcontrol.h | 10 ++++++++--
 mm/memcontrol.c            |  7 ++++---
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b79760af685..f5dd1e34654a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
 
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
+
+void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages);
 
 void __mem_cgroup_uncharge(struct folio *folio);
 
@@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 	return 0;
 }
 
-static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
 {
 }
 
@@ -1796,6 +1797,11 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
 
 #endif /* CONFIG_MEMCG */
 
+static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+{
+	mem_cgroup_swapin_uncharge_swap_nr(entry, 1);
+}
+
 #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
 bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
 void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eb92c21615eb..25657d6a133f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4573,14 +4573,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 
 /*
  * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
- * @entry: swap entry for which the page is charged
+ * @entry: the first swap entry for which the pages are charged
+ * @nr_pages: number of pages which will be uncharged
  *
  * Call this function after successfully adding the charged page to swapcache.
  *
  * Note: This function assumes the page for which swap slot is being uncharged
  * is order 0 page.
  */
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages)
 {
 	/*
 	 * Cgroup1's unified memory+swap counter has been charged with the
@@ -4600,7 +4601,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry, 1);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
-- 
2.34.1

>
> > +}
> > +
> >  void __mem_cgroup_uncharge(struct folio *folio);
> >
> >  /**
> > @@ -1185,6 +1193,10 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> >  {
> >  }
> >
> > +static inline void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, int nr)
> > +{
> > +}
> > +
> >  static inline void mem_cgroup_uncharge(struct folio *folio)
> >  {
> >  }
> > --
> > 2.34.1
> >


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in
  2024-07-29  2:02     ` Barry Song
@ 2024-07-29  3:43       ` Matthew Wilcox
  2024-07-29  4:52         ` Barry Song
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-29  3:43 UTC (permalink / raw)
  To: Barry Song
  Cc: yosryahmed, akpm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, ying.huang

On Mon, Jul 29, 2024 at 02:02:22PM +1200, Barry Song wrote:
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> +
> +void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages);
[...]
> +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +{
> +	mem_cgroup_swapin_uncharge_swap_nr(entry, 1);
> +}

There are only two callers of mem_cgroup_swapin_uncharge_swap!  Just
add an argument to mem_cgroup_swapin_uncharge_swap() and change the two
callers.  It would be _less_ code than this extra wrapper, and certainly
less confusing.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper for large folios swap-in
  2024-07-29  3:43       ` Matthew Wilcox
@ 2024-07-29  4:52         ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-07-29  4:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: yosryahmed, akpm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, ying.huang

On Mon, Jul 29, 2024 at 3:43 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jul 29, 2024 at 02:02:22PM +1200, Barry Song wrote:
> > -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> > +
> > +void mem_cgroup_swapin_uncharge_swap_nr(swp_entry_t entry, unsigned int nr_pages);
> [...]
> > +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> > +{
> > +     mem_cgroup_swapin_uncharge_swap_nr(entry, 1);
> > +}
>
> There are only two callers of mem_cgroup_swapin_uncharge_swap!  Just
> add an argument to mem_cgroup_swapin_uncharge_swap() and change the two
> callers.  It would be _less_ code than this extra wrapper, and certainly
> less confusing.

sounds good to me. I can totally drop this wrapper -
mem_cgroup_swapin_uncharge_swap()
in v6.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-26  9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
  2024-07-26  9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
  2024-07-26  9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song
@ 2024-07-26  9:46 ` Barry Song
  2024-07-29  3:51   ` Matthew Wilcox
  2024-07-29 14:16   ` Dan Carpenter
  2024-07-26  9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
  2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
  4 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-07-26  9:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, yosryahmed, Chuanhua Han

From: Chuanhua Han <hanchuanhua@oppo.com>

In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app
is switched to the background, most of its memory might be swapped out.

Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.

This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.

Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT without fragmentation.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
   and enhancing compression ratios significantly.

Deploying this on millions of actual products, we haven't observed any
noticeable increase in memory footprint for 64KiB mTHP based on CONT-PTE
on ARM64.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 188 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 833d2cad6eb2..14048e9285d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * ptep must be first one in the range
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	struct swap_info_struct *si;
+	unsigned long addr;
+	swp_entry_t entry;
+	pgoff_t offset;
+	char has_cache;
+	int idx, i;
+	pte_t pte;
+
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	idx = (vmf->address - addr) / PAGE_SIZE;
+	pte = ptep_get(ptep);
+
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+		return false;
+	entry = pte_to_swp_entry(pte);
+	offset = swp_offset(entry);
+	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+		return false;
+
+	si = swp_swap_info(entry);
+	has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
+	for (i = 1; i < nr_pages; i++) {
+		/*
+		 * while allocating a large folio and doing swap_read_folio for the
+		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+		 * doesn't have swapcache. We need to ensure all PTEs have no cache
+		 * as well, otherwise, we might go to swap devices while the content
+		 * is in swapcache
+		 */
+		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
+			return false;
+	}
+
+	return true;
+}
+
+static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
+		unsigned long addr, unsigned long orders)
+{
+	int order, nr;
+
+	order = highest_order(orders);
+
+	/*
+	 * To swap-in a THP with nr pages, we require its first swap_offset
+	 * is aligned with nr. This can filter out most invalid entries.
+	 */
+	while (orders) {
+		nr = 1 << order;
+		if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	return false;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	swp_entry_t entry;
+	spinlock_t *ptl;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * A large swapped out folio could be partially or fully in zswap. We
+	 * lack handling for such cases, so fallback to swapping in order-0
+	 * folio.
+	 */
+	if (!zswap_never_enabled())
+		goto fallback;
+
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * and suitable for swapping THP.
+	 */
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap_unlock(pte, ptl);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio)
+			return folio;
+		order = next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache = true;
-
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_swap_folio(vmf);
 			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
+				nr_pages = folio_nr_pages(folio);
+				if (folio_test_large(folio))
+					entry.val = ALIGN_DOWN(entry.val, nr_pages);
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread may
+				 * finish swapin first, free the entry, and swapout
+				 * reusing the same entry. It's undetectable as
+				 * pte_same() returns true due to entry reuse.
+				 */
+				if (swapcache_prepare_nr(entry, nr_pages)) {
+					/* Relax a bit to prevent rapid repeated page faults */
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+				need_clear_cache = true;
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry);
+				mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/* allocated large folios for SWP_SYNCHRONOUS_IO */
+	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
+		pte_t *folio_ptep = vmf->pte - idx;
+
+		if (!can_swapin_thp(vmf, folio_ptep, nr))
+			goto out_nomap;
+
+		page_idx = idx;
+		address = folio_start;
+		ptep = folio_ptep;
+		goto check_folio;
+	}
+
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
@@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios, which are either
-		 * fully exclusive or fully shared. If we ever get large folios
-		 * here, we have to be careful.
+		 * We currently only expect small !anon folios which are either
+		 * fully exclusive or fully shared, or new allocated large folios
+		 * which are fully exclusive. If we ever get large folios within
+		 * swapcache here, we have to be careful.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio));
+		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
 		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
@@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry);
+		swapcache_clear_nr(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry);
+		swapcache_clear_nr(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-26  9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
@ 2024-07-29  3:51   ` Matthew Wilcox
  2024-07-29  4:41     ` Barry Song
  2024-07-29  6:36     ` Chuanhua Han
  2024-07-29 14:16   ` Dan Carpenter
  1 sibling, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-29  3:51 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> -			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -						vma, vmf->address, false);
> +			folio = alloc_swap_folio(vmf);
>  			page = &folio->page;

This is no longer correct.  You need to set 'page' to the precise page
that is being faulted rather than the first page of the folio.  It was
fine before because it always allocated a single-page folio, but now it
must use folio_page() or folio_file_page() (whichever has the correct
semantics for you).

Also you need to fix your test suite to notice this bug.  I suggest
doing that first so that you know whether you've got the calculation
correct.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29  3:51   ` Matthew Wilcox
@ 2024-07-29  4:41     ` Barry Song
       [not found]       ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>
  2024-07-29  6:36     ` Chuanhua Han
  1 sibling, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-29  4:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                             vma, vmf->address, false);
> > +                     folio = alloc_swap_folio(vmf);
> >                       page = &folio->page;
>
> This is no longer correct.  You need to set 'page' to the precise page
> that is being faulted rather than the first page of the folio.  It was
> fine before because it always allocated a single-page folio, but now it
> must use folio_page() or folio_file_page() (whichever has the correct
> semantics for you).
>
> Also you need to fix your test suite to notice this bug.  I suggest
> doing that first so that you know whether you've got the calculation
> correct.

I don't understand why the code is designed in the way the page
is the first page of this folio. Otherwise, we need lots of changes
later while mapping the folio in ptes and rmap.

>

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

[parent not found: <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>]

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
       [not found]       ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>
@ 2024-07-29 12:49         ` Matthew Wilcox
  2024-07-29 13:11           ` Barry Song
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-29 12:49 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote:
> On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > > > -                                             vma, vmf->address, false);
> > > > +                     folio = alloc_swap_folio(vmf);
> > > >                       page = &folio->page;
> > >
> > > This is no longer correct.  You need to set 'page' to the precise page
> > > that is being faulted rather than the first page of the folio.  It was
> > > fine before because it always allocated a single-page folio, but now it
> > > must use folio_page() or folio_file_page() (whichever has the correct
> > > semantics for you).
> > >
> > > Also you need to fix your test suite to notice this bug.  I suggest
> > > doing that first so that you know whether you've got the calculation
> > > correct.
> >
> > I don't understand why the code is designed in the way the page
> > is the first page of this folio. Otherwise, we need lots of changes
> > later while mapping the folio in ptes and rmap.

What?

        folio = swap_cache_get_folio(entry, vma, vmf->address);
        if (folio)
                page = folio_file_page(folio, swp_offset(entry));

page is the precise page, not the first page of the folio.

> For both accessing large folios in the swapcache and allocating
> new large folios, the page points to the first page of the folio. we
> are mapping the whole folio not the specific page.

But what address are we mapping the whole folio at?

> for swapcache cases, you can find the same thing here,
> 
>         if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>                 ...
>                 entry = folio->swap;
>                 page = &folio->page;
>         }

Yes, but you missed some important lines from your quote:

                page_idx = idx;
                address = folio_start;
                ptep = folio_ptep;
                nr_pages = nr;

We deliberate adjust the address so that, yes, we're mapping the entire
folio, but we're mapping it at an address that means that the page we
actually faulted on ends up at the address that we faulted on.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 12:49         ` Matthew Wilcox
@ 2024-07-29 13:11           ` Barry Song
  2024-07-29 15:13             ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-29 13:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Tue, Jul 30, 2024 at 12:49 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote:
> > On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > > > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > > > > -                                             vma, vmf->address, false);
> > > > > +                     folio = alloc_swap_folio(vmf);
> > > > >                       page = &folio->page;
> > > >
> > > > This is no longer correct.  You need to set 'page' to the precise page
> > > > that is being faulted rather than the first page of the folio.  It was
> > > > fine before because it always allocated a single-page folio, but now it
> > > > must use folio_page() or folio_file_page() (whichever has the correct
> > > > semantics for you).
> > > >
> > > > Also you need to fix your test suite to notice this bug.  I suggest
> > > > doing that first so that you know whether you've got the calculation
> > > > correct.
> > >
> > > I don't understand why the code is designed in the way the page
> > > is the first page of this folio. Otherwise, we need lots of changes
> > > later while mapping the folio in ptes and rmap.
>
> What?
>
>         folio = swap_cache_get_folio(entry, vma, vmf->address);
>         if (folio)
>                 page = folio_file_page(folio, swp_offset(entry));
>
> page is the precise page, not the first page of the folio.

this is the case we may get a large folio in swapcache but we result in
mapping only one subpage due to the condition to map the whole
folio is not met. if we meet the condition, we are going to set page
to the head instead and map the whole mTHP:

        if (folio_test_large(folio) && folio_test_swapcache(folio)) {
                int nr = folio_nr_pages(folio);
                unsigned long idx = folio_page_idx(folio, page);
                unsigned long folio_start = address - idx * PAGE_SIZE;
                unsigned long folio_end = folio_start + nr * PAGE_SIZE;
                pte_t *folio_ptep;
                pte_t folio_pte;

                if (unlikely(folio_start < max(address & PMD_MASK,
vma->vm_start)))
                        goto check_folio;
                if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
                        goto check_folio;

                folio_ptep = vmf->pte - idx;
                folio_pte = ptep_get(folio_ptep);
                if (!pte_same(folio_pte,
pte_move_swp_offset(vmf->orig_pte, -idx)) ||
                    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
                        goto check_folio;

                page_idx = idx;
                address = folio_start;
                ptep = folio_ptep;
                nr_pages = nr;
                entry = folio->swap;
                page = &folio->page;
        }

>
> > For both accessing large folios in the swapcache and allocating
> > new large folios, the page points to the first page of the folio. we
> > are mapping the whole folio not the specific page.
>
> But what address are we mapping the whole folio at?
>
> > for swapcache cases, you can find the same thing here,
> >
> >         if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >                 ...
> >                 entry = folio->swap;
> >                 page = &folio->page;
> >         }
>
> Yes, but you missed some important lines from your quote:
>
>                 page_idx = idx;
>                 address = folio_start;
>                 ptep = folio_ptep;
>                 nr_pages = nr;
>
> We deliberate adjust the address so that, yes, we're mapping the entire
> folio, but we're mapping it at an address that means that the page we
> actually faulted on ends up at the address that we faulted on.

for this zRAM case, it is a new allocated large folio, only
while all conditions are met, we will allocate and map
the whole folio. you can check can_swapin_thp() and
thp_swap_suitable_orders().

static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
{
        struct swap_info_struct *si;
        unsigned long addr;
        swp_entry_t entry;
        pgoff_t offset;
        char has_cache;
        int idx, i;
        pte_t pte;

        addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
        idx = (vmf->address - addr) / PAGE_SIZE;
        pte = ptep_get(ptep);

        if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
                return false;
        entry = pte_to_swp_entry(pte);
        offset = swp_offset(entry);
        if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
                return false;

        si = swp_swap_info(entry);
        has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
        for (i = 1; i < nr_pages; i++) {
                /*
                 * while allocating a large folio and doing
swap_read_folio for the
                 * SWP_SYNCHRONOUS_IO path, which is the case the
being faulted pte
                 * doesn't have swapcache. We need to ensure all PTEs
have no cache
                 * as well, otherwise, we might go to swap devices
while the content
                 * is in swapcache
                 */
                if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
                        return false;
        }

        return true;
}

and

static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
        ....
        entry = pte_to_swp_entry(vmf->orig_pte);
        /*
         * Get a list of all the (large) orders below PMD_ORDER that are enabled
         * and suitable for swapping THP.
         */
        orders = thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS,
                        BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);
        orders = thp_swap_suitable_orders(swp_offset(entry),
vmf->address, orders);
....
}

static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
                unsigned long addr, unsigned long orders)
{
        int order, nr;

        order = highest_order(orders);

        /*
         * To swap-in a THP with nr pages, we require its first swap_offset
         * is aligned with nr. This can filter out most invalid entries.
         */
        while (orders) {
                nr = 1 << order;
                if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
                        break;
                order = next_order(&orders, order);
        }

        return orders;
}


A mTHP is swapped out at aligned swap offset. and we only swap in
aligned mTHP. if somehow one mTHP is mremap() to unaligned address,
we won't swap them in as a large folio. For swapcache case, we are
still checking unaligned mTHP, but for new allocated mTHP, it
is a different story. There is totally no necessity to support
unaligned mTHP and there is no possibility to support unless
something is marked in swap devices to say there was a mTHP.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 13:11           ` Barry Song
@ 2024-07-29 15:13             ` Matthew Wilcox
  2024-07-29 20:03               ` Barry Song
  2024-07-30  8:12               ` Ryan Roberts
  0 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-29 15:13 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote:
> for this zRAM case, it is a new allocated large folio, only
> while all conditions are met, we will allocate and map
> the whole folio. you can check can_swapin_thp() and
> thp_swap_suitable_orders().

YOU ARE DOING THIS WRONGLY!

All of you anonymous memory people are utterly fixated on TLBs AND THIS
IS WRONG.  Yes, TLB performance is important, particularly with crappy
ARM designs, which I know a lot of you are paid to work on.  But you
seem to think this is the only consideration, and you're making bad
design choices as a result.  It's overly complicated, and you're leaving
performance on the table.

Look back at the results Ryan showed in the early days of working on
large anonymous folios.  Half of the performance win on his system came
from using larger TLBs.  But the other half came from _reduced software
overhead_.  The LRU lock is a huge problem, and using large folios cuts
the length of the LRU list, hence LRU lock hold time.

Your _own_ data on how hard it is to get hold of a large folio due to
fragmentation should be enough to convince you that the more large folios
in the system, the better the whole system runs.  We should not decline to
allocate large folios just because they can't be mapped with a single TLB!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 15:13             ` Matthew Wilcox
@ 2024-07-29 20:03               ` Barry Song
  2024-07-29 21:56                 ` Barry Song
  2024-07-30  8:12               ` Ryan Roberts
  1 sibling, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-29 20:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote:
> > for this zRAM case, it is a new allocated large folio, only
> > while all conditions are met, we will allocate and map
> > the whole folio. you can check can_swapin_thp() and
> > thp_swap_suitable_orders().
>
> YOU ARE DOING THIS WRONGLY!
>
> All of you anonymous memory people are utterly fixated on TLBs AND THIS
> IS WRONG.  Yes, TLB performance is important, particularly with crappy
> ARM designs, which I know a lot of you are paid to work on.  But you
> seem to think this is the only consideration, and you're making bad
> design choices as a result.  It's overly complicated, and you're leaving
> performance on the table.
>
> Look back at the results Ryan showed in the early days of working on
> large anonymous folios.  Half of the performance win on his system came
> from using larger TLBs.  But the other half came from _reduced software
> overhead_.  The LRU lock is a huge problem, and using large folios cuts
> the length of the LRU list, hence LRU lock hold time.
>
> Your _own_ data on how hard it is to get hold of a large folio due to
> fragmentation should be enough to convince you that the more large folios
> in the system, the better the whole system runs.  We should not decline to
> allocate large folios just because they can't be mapped with a single TLB!

I am not convinced. for a new allocated large folio, even alloc_anon_folio()
of do_anonymous_page() does the exactly same thing

alloc_anon_folio()
{
        /*
         * Get a list of all the (large) orders below PMD_ORDER that are enabled
         * for this vma. Then filter out the orders that can't be allocated over
         * the faulting address and still be fully contained in the vma.
         */
        orders = thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);

}

you are not going to allocate a mTHP for an unaligned address for a new
PF.
Please point out where it is wrong.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 20:03               ` Barry Song
@ 2024-07-29 21:56                 ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-07-29 21:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Tue, Jul 30, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote:
> > > for this zRAM case, it is a new allocated large folio, only
> > > while all conditions are met, we will allocate and map
> > > the whole folio. you can check can_swapin_thp() and
> > > thp_swap_suitable_orders().
> >
> > YOU ARE DOING THIS WRONGLY!
> >
> > All of you anonymous memory people are utterly fixated on TLBs AND THIS
> > IS WRONG.  Yes, TLB performance is important, particularly with crappy
> > ARM designs, which I know a lot of you are paid to work on.  But you
> > seem to think this is the only consideration, and you're making bad
> > design choices as a result.  It's overly complicated, and you're leaving
> > performance on the table.
> >
> > Look back at the results Ryan showed in the early days of working on
> > large anonymous folios.  Half of the performance win on his system came
> > from using larger TLBs.  But the other half came from _reduced software
> > overhead_.  The LRU lock is a huge problem, and using large folios cuts
> > the length of the LRU list, hence LRU lock hold time.
> >
> > Your _own_ data on how hard it is to get hold of a large folio due to
> > fragmentation should be enough to convince you that the more large folios
> > in the system, the better the whole system runs.  We should not decline to
> > allocate large folios just because they can't be mapped with a single TLB!
>
> I am not convinced. for a new allocated large folio, even alloc_anon_folio()
> of do_anonymous_page() does the exactly same thing
>
> alloc_anon_folio()
> {
>         /*
>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>          * for this vma. Then filter out the orders that can't be allocated over
>          * the faulting address and still be fully contained in the vma.
>          */
>         orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>                         TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
> }
>
> you are not going to allocate a mTHP for an unaligned address for a new
> PF.
> Please point out where it is wrong.

Let's assume we have a folio with the virtual address as
0x500000000000 ~ 0x500000000000 + 64KB
if it is swapped out to 0x10000 ~ 0x100000 + 64KB.

The current code will swap it in as a mTHP if page fault occurs in
any address within (0x500000000000 ~ 0x500000000000 + 64KB)

In this case, the mTHP enjoys both decreased TLB and reduced overhead
such as LRU lock etc. So it sounds we have nothing lost in this case.

But if the folio is mremap-ed to an unaligned address like:
(0x600000000000 + 16KB ~ 0x600000000000 + 80KB)
and its swap offset is still (0x10000 ~ 0x100000 + 64KB).

The current code won't swap in them as mTHP. Sounds like a loss?

If this is the performance problem you are trying to address, my point
is that it is not worth increasing the complexity for this stage though this
might be doable. We once tracked hundreds of phones running apps randomly
for a couple of days, and we didn't encounter such a case. So this is
pretty much a corner case.

If your concern is more than this, for example, if you want to swap in
large folios even when swaps are completely not contiguous, this is a different
story. I agree this is a potential optimization direction to go,  but in that
case, you still need to find an aligned boundary to handle page faults
just like do_anonymous_page(), otherwise, you may result in all
kinds of pointless intersections where PFs can cover the address ranges of
other PFs, then make the PTEs check such as pte_range_none()
completely dis-ordered:

static struct folio *alloc_anon_folio(struct vm_fault *vmf)
{
        ....

        /*
         * Find the highest order where the aligned range is completely
         * pte_none(). Note that all remaining orders will be completely
         * pte_none().
         */
        order = highest_order(orders);
        while (orders) {
                addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
                if (pte_range_none(pte + pte_index(addr), 1 << order))
                        break;
                order = next_order(&orders, order);
        }
}

>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 15:13             ` Matthew Wilcox
  2024-07-29 20:03               ` Barry Song
@ 2024-07-30  8:12               ` Ryan Roberts
  1 sibling, 0 replies; 59+ messages in thread
From: Ryan Roberts @ 2024-07-30  8:12 UTC (permalink / raw)
  To: Matthew Wilcox, Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, xiang, yosryahmed, Chuanhua Han

On 29/07/2024 16:13, Matthew Wilcox wrote:
> On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote:
>> for this zRAM case, it is a new allocated large folio, only
>> while all conditions are met, we will allocate and map
>> the whole folio. you can check can_swapin_thp() and
>> thp_swap_suitable_orders().
> 
> YOU ARE DOING THIS WRONGLY!

I've only scanned the preceeding thread, but I think you're talking about the
design descision to only allocate large folios that are naturally aligned in
virtual address space, and you're arguing to remove that restriction? The main
reason we gave ourselves that constraint for anon mTHP was because allowing it
would create the possibility of wandering off the end of the PTE table and add
significant complexity to manage neighbouring PTE tables and their respective PTLs.

If the proposal is to start doing this, then I don't agree with that approach.

> 
> All of you anonymous memory people are utterly fixated on TLBs AND THIS
> IS WRONG.  Yes, TLB performance is important, particularly with crappy
> ARM designs, which I know a lot of you are paid to work on.  But you
> seem to think this is the only consideration, and you're making bad
> design choices as a result.  It's overly complicated, and you're leaving
> performance on the table.
> 
> Look back at the results Ryan showed in the early days of working on
> large anonymous folios.  Half of the performance win on his system came
> from using larger TLBs.  But the other half came from _reduced software
> overhead_. 

I would just point out that I think the results you are referring to are for the
kernel compilation workload, and yes this is indeed what I observed. But kernel
compilation is a bit of an outlier since it does a huge amount of fork/exec so
the kernel spends a lot of time fiddling with page tables and faulting. The vast
majority of the reduced sw overhead is due to significantly reducing the number
of faults because we map more pages per fault. But in my experience this
workload is a bit of an outlier; most workloads that I've tested with at least
tend to set up their memory at the start and its static forever more, which
means that those workloads benefit mostly from the TLB benefits - there are very
few existing SW overheads to actually reduce.

> The LRU lock is a huge problem, and using large folios cuts
> the length of the LRU list, hence LRU lock hold time.

I'm sure this is true and you have lots more experience and data than me. And it
makes intuitive sense. But I've never personally seen this in any of the
workloads that I've benchmarked.

Thanks,
Ryan

> 
> Your _own_ data on how hard it is to get hold of a large folio due to
> fragmentation should be enough to convince you that the more large folios
> in the system, the better the whole system runs.  We should not decline to
> allocate large folios just because they can't be mapped with a single TLB!
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29  3:51   ` Matthew Wilcox
  2024-07-29  4:41     ` Barry Song
@ 2024-07-29  6:36     ` Chuanhua Han
  2024-07-29 12:55       ` Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: Chuanhua Han @ 2024-07-29  6:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道：
>
> On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                             vma, vmf->address, false);
> > +                     folio = alloc_swap_folio(vmf);
> >                       page = &folio->page;
>
> This is no longer correct.  You need to set 'page' to the precise page
> that is being faulted rather than the first page of the folio.  It was
> fine before because it always allocated a single-page folio, but now it
> must use folio_page() or folio_file_page() (whichever has the correct
> semantics for you).
>
> Also you need to fix your test suite to notice this bug.  I suggest
> doing that first so that you know whether you've got the calculation
> correct.

>
>
This is no problem now, we support large folios swapin as a whole, so
the head page is used here instead of the page that is being faulted.
You can also refer to the current code context, now support large
folios swapin as a whole, and previously only support small page
swapin is not the same.

-- 
Thanks,
Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29  6:36     ` Chuanhua Han
@ 2024-07-29 12:55       ` Matthew Wilcox
  2024-07-29 13:18         ` Barry Song
  2024-07-29 13:32         ` Chuanhua Han
  0 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-29 12:55 UTC (permalink / raw)
  To: Chuanhua Han
  Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote:
> Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道：
> >
> > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > > -                                             vma, vmf->address, false);
> > > +                     folio = alloc_swap_folio(vmf);
> > >                       page = &folio->page;
> >
> > This is no longer correct.  You need to set 'page' to the precise page
> > that is being faulted rather than the first page of the folio.  It was
> > fine before because it always allocated a single-page folio, but now it
> > must use folio_page() or folio_file_page() (whichever has the correct
> > semantics for you).
> >
> > Also you need to fix your test suite to notice this bug.  I suggest
> > doing that first so that you know whether you've got the calculation
> > correct.
> 
> >
> >
> This is no problem now, we support large folios swapin as a whole, so
> the head page is used here instead of the page that is being faulted.
> You can also refer to the current code context, now support large
> folios swapin as a whole, and previously only support small page
> swapin is not the same.

You have completely failed to understand the problem.  Let's try it this
way:

We take a page fault at address 0x123456789000.
If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000.
If you now map page 0 of the folio at 0x123456789000, you've
given the user the wrong page!  That looks like data corruption.

The code in
        if (folio_test_large(folio) && folio_test_swapcache(folio)) {
as Barry pointed out will save you -- but what if those conditions fail?
What if the mmap has been mremap()ed and the folio now crosses a PMD
boundary?  mk_pte() will now be called on the wrong page.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 12:55       ` Matthew Wilcox
@ 2024-07-29 13:18         ` Barry Song
  2024-07-29 13:32         ` Chuanhua Han
  1 sibling, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-07-29 13:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chuanhua Han, akpm, linux-mm, ying.huang, baolin.wang, chrisl,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

On Tue, Jul 30, 2024 at 12:55 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote:
> > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道：
> > >
> > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > > > -                                             vma, vmf->address, false);
> > > > +                     folio = alloc_swap_folio(vmf);
> > > >                       page = &folio->page;
> > >
> > > This is no longer correct.  You need to set 'page' to the precise page
> > > that is being faulted rather than the first page of the folio.  It was
> > > fine before because it always allocated a single-page folio, but now it
> > > must use folio_page() or folio_file_page() (whichever has the correct
> > > semantics for you).
> > >
> > > Also you need to fix your test suite to notice this bug.  I suggest
> > > doing that first so that you know whether you've got the calculation
> > > correct.
> >
> > >
> > >
> > This is no problem now, we support large folios swapin as a whole, so
> > the head page is used here instead of the page that is being faulted.
> > You can also refer to the current code context, now support large
> > folios swapin as a whole, and previously only support small page
> > swapin is not the same.
>
> You have completely failed to understand the problem.  Let's try it this
> way:
>
> We take a page fault at address 0x123456789000.
> If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000.
> If you now map page 0 of the folio at 0x123456789000, you've
> given the user the wrong page!  That looks like data corruption.
>
> The code in
>         if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> as Barry pointed out will save you -- but what if those conditions fail?
> What if the mmap has been mremap()ed and the folio now crosses a PMD
> boundary?  mk_pte() will now be called on the wrong page.

Chuanhua understood everything correctly. I think you might have missed
that we have very strict checks both before allocating large folios and before
mapping them for this new allocated mTHP swap-in case.

to allocate a large folio, we check all alignment requirements;  PTEs have
aligned swap offset and all physically contiguous, that is how mTHP
is swapped out. if a mTHP has been mremap() to be unaligned, we won't
swap them in as mTHP.  two reasons: 1. we have no way to figure out
what is the start address of a previous mTHP for non-swapcache case;
2. mremap() to unaligned addresses is rare.

to map a large folio, we check all PTEs are still there by double confirming
can_swapin_thp() is true. if PTEs have changed, this is a "goto out_nomap"
case.

        /* allocated large folios for SWP_SYNCHRONOUS_IO */
        if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
                unsigned long nr = folio_nr_pages(folio);
                unsigned long folio_start = ALIGN_DOWN(vmf->address,
nr * PAGE_SIZE);
                unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
                pte_t *folio_ptep = vmf->pte - idx;

                if (!can_swapin_thp(vmf, folio_ptep, nr))
                        goto out_nomap;

                page_idx = idx;
                address = folio_start;
                ptep = folio_ptep;
                goto check_folio;
        }

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-29 12:55       ` Matthew Wilcox
  2024-07-29 13:18         ` Barry Song
@ 2024-07-29 13:32         ` Chuanhua Han
  1 sibling, 0 replies; 59+ messages in thread
From: Chuanhua Han @ 2024-07-29 13:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed, Chuanhua Han

Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 20:55写道：
>
> On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote:
> > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道：
> > >
> > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > > > -                                             vma, vmf->address, false);
> > > > +                     folio = alloc_swap_folio(vmf);
> > > >                       page = &folio->page;
> > >
> > > This is no longer correct.  You need to set 'page' to the precise page
> > > that is being faulted rather than the first page of the folio.  It was
> > > fine before because it always allocated a single-page folio, but now it
> > > must use folio_page() or folio_file_page() (whichever has the correct
> > > semantics for you).
> > >
> > > Also you need to fix your test suite to notice this bug.  I suggest
> > > doing that first so that you know whether you've got the calculation
> > > correct.
> >
> > >
> > >
> > This is no problem now, we support large folios swapin as a whole, so
> > the head page is used here instead of the page that is being faulted.
> > You can also refer to the current code context, now support large
> > folios swapin as a whole, and previously only support small page
> > swapin is not the same.
>
> You have completely failed to understand the problem.  Let's try it this
> way:
>
> We take a page fault at address 0x123456789000.
> If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000.
> If you now map page 0 of the folio at 0x123456789000, you've
> given the user the wrong page!  That looks like data corruption.
The user does not get the wrong data because we are mapping the whole,
and for 16KiB folio, we map 16KiB through the page table.
>
> The code in
>         if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> as Barry pointed out will save you -- but what if those conditions fail?
> What if the mmap has been mremap()ed and the folio now crosses a PMD
> boundary?  mk_pte() will now be called on the wrong page.
These special cases have been dealt with in our patch. For mthp's
large folio, mk_pte uses head page to construct pte.


-- 
Thanks,
Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-26  9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
  2024-07-29  3:51   ` Matthew Wilcox
@ 2024-07-29 14:16   ` Dan Carpenter
  1 sibling, 0 replies; 59+ messages in thread
From: Dan Carpenter @ 2024-07-29 14:16 UTC (permalink / raw)
  To: oe-kbuild, Barry Song, akpm, linux-mm
  Cc: lkp, oe-kbuild-all, ying.huang, baolin.wang, chrisl, david,
	hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, willy, xiang, yosryahmed, Chuanhua Han

Hi Barry,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240726094618.401593-4-21cnbao%40gmail.com
patch subject: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile
config: i386-randconfig-141-20240727 (https://download.01.org/0day-ci/archive/20240727/202407270917.18F5rYPH-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202407270917.18F5rYPH-lkp@intel.com/

smatch warnings:
mm/memory.c:4467 do_swap_page() error: uninitialized symbol 'nr_pages'.

vim +/nr_pages +4467 mm/memory.c

2b7403035459c7 Souptick Joarder        2018-08-23  4143  vm_fault_t do_swap_page(struct vm_fault *vmf)
^1da177e4c3f41 Linus Torvalds          2005-04-16  4144  {
82b0f8c39a3869 Jan Kara                2016-12-14  4145  	struct vm_area_struct *vma = vmf->vma;
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4146) 	struct folio *swapcache, *folio = NULL;
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4147) 	struct page *page;
2799e77529c2a2 Miaohe Lin              2021-06-28  4148  	struct swap_info_struct *si = NULL;
14f9135d547060 David Hildenbrand       2022-05-09  4149  	rmap_t rmap_flags = RMAP_NONE;
13ddaf26be324a Kairui Song             2024-02-07  4150  	bool need_clear_cache = false;
1493a1913e34b0 David Hildenbrand       2022-05-09  4151  	bool exclusive = false;
65500d234e74fc Hugh Dickins            2005-10-29  4152  	swp_entry_t entry;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4153  	pte_t pte;
2b7403035459c7 Souptick Joarder        2018-08-23  4154  	vm_fault_t ret = 0;
aae466b0052e18 Joonsoo Kim             2020-08-11  4155  	void *shadow = NULL;
508758960b8d89 Chuanhua Han            2024-05-29  4156  	int nr_pages;
508758960b8d89 Chuanhua Han            2024-05-29  4157  	unsigned long page_idx;
508758960b8d89 Chuanhua Han            2024-05-29  4158  	unsigned long address;
508758960b8d89 Chuanhua Han            2024-05-29  4159  	pte_t *ptep;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4160  
2ca99358671ad3 Peter Xu                2021-11-05  4161  	if (!pte_unmap_same(vmf))
8f4e2101fd7df9 Hugh Dickins            2005-10-29  4162  		goto out;
65500d234e74fc Hugh Dickins            2005-10-29  4163  
2994302bc8a171 Jan Kara                2016-12-14  4164  	entry = pte_to_swp_entry(vmf->orig_pte);
d1737fdbec7f90 Andi Kleen              2009-09-16  4165  	if (unlikely(non_swap_entry(entry))) {
0697212a411c1d Christoph Lameter       2006-06-23  4166  		if (is_migration_entry(entry)) {
82b0f8c39a3869 Jan Kara                2016-12-14  4167  			migration_entry_wait(vma->vm_mm, vmf->pmd,
82b0f8c39a3869 Jan Kara                2016-12-14  4168  					     vmf->address);
b756a3b5e7ead8 Alistair Popple         2021-06-30  4169  		} else if (is_device_exclusive_entry(entry)) {
b756a3b5e7ead8 Alistair Popple         2021-06-30  4170  			vmf->page = pfn_swap_entry_to_page(entry);
b756a3b5e7ead8 Alistair Popple         2021-06-30  4171  			ret = remove_device_exclusive_entry(vmf);
5042db43cc26f5 Jérôme Glisse           2017-09-08  4172  		} else if (is_device_private_entry(entry)) {
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4173  			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4174  				/*
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4175  				 * migrate_to_ram is not yet ready to operate
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4176  				 * under VMA lock.
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4177  				 */
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4178  				vma_end_read(vma);
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4179  				ret = VM_FAULT_RETRY;
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4180  				goto out;
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4181  			}
1235ccd05b6dd6 Suren Baghdasaryan      2023-06-30  4182  
af5cdaf82238fb Alistair Popple         2021-06-30  4183  			vmf->page = pfn_swap_entry_to_page(entry);
16ce101db85db6 Alistair Popple         2022-09-28  4184  			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
16ce101db85db6 Alistair Popple         2022-09-28  4185  					vmf->address, &vmf->ptl);
3db82b9374ca92 Hugh Dickins            2023-06-08  4186  			if (unlikely(!vmf->pte ||
c33c794828f212 Ryan Roberts            2023-06-12  4187  				     !pte_same(ptep_get(vmf->pte),
c33c794828f212 Ryan Roberts            2023-06-12  4188  							vmf->orig_pte)))
3b65f437d9e8dd Ryan Roberts            2023-06-02  4189  				goto unlock;
16ce101db85db6 Alistair Popple         2022-09-28  4190  
16ce101db85db6 Alistair Popple         2022-09-28  4191  			/*
16ce101db85db6 Alistair Popple         2022-09-28  4192  			 * Get a page reference while we know the page can't be
16ce101db85db6 Alistair Popple         2022-09-28  4193  			 * freed.
16ce101db85db6 Alistair Popple         2022-09-28  4194  			 */
16ce101db85db6 Alistair Popple         2022-09-28  4195  			get_page(vmf->page);
16ce101db85db6 Alistair Popple         2022-09-28  4196  			pte_unmap_unlock(vmf->pte, vmf->ptl);
4a955bed882e73 Alistair Popple         2022-11-14  4197  			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
16ce101db85db6 Alistair Popple         2022-09-28  4198  			put_page(vmf->page);
d1737fdbec7f90 Andi Kleen              2009-09-16  4199  		} else if (is_hwpoison_entry(entry)) {
d1737fdbec7f90 Andi Kleen              2009-09-16  4200  			ret = VM_FAULT_HWPOISON;
5c041f5d1f23d3 Peter Xu                2022-05-12  4201  		} else if (is_pte_marker_entry(entry)) {
5c041f5d1f23d3 Peter Xu                2022-05-12  4202  			ret = handle_pte_marker(vmf);
d1737fdbec7f90 Andi Kleen              2009-09-16  4203  		} else {
2994302bc8a171 Jan Kara                2016-12-14  4204  			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
d99be1a8ecf377 Hugh Dickins            2009-12-14  4205  			ret = VM_FAULT_SIGBUS;
d1737fdbec7f90 Andi Kleen              2009-09-16  4206  		}
0697212a411c1d Christoph Lameter       2006-06-23  4207  		goto out;
0697212a411c1d Christoph Lameter       2006-06-23  4208  	}
0bcac06f27d752 Minchan Kim             2017-11-15  4209  
2799e77529c2a2 Miaohe Lin              2021-06-28  4210  	/* Prevent swapoff from happening to us. */
2799e77529c2a2 Miaohe Lin              2021-06-28  4211  	si = get_swap_device(entry);
2799e77529c2a2 Miaohe Lin              2021-06-28  4212  	if (unlikely(!si))
2799e77529c2a2 Miaohe Lin              2021-06-28  4213  		goto out;
0bcac06f27d752 Minchan Kim             2017-11-15  4214  
5a423081b2465d Matthew Wilcox (Oracle  2022-09-02  4215) 	folio = swap_cache_get_folio(entry, vma, vmf->address);
5a423081b2465d Matthew Wilcox (Oracle  2022-09-02  4216) 	if (folio)
5a423081b2465d Matthew Wilcox (Oracle  2022-09-02  4217) 		page = folio_file_page(folio, swp_offset(entry));
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4218) 	swapcache = folio;
f80207727aaca3 Minchan Kim             2018-01-18  4219  
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4220) 	if (!folio) {
a449bf58e45abf Qian Cai                2020-08-14  4221  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
eb085574a7526c Huang Ying              2019-07-11  4222  		    __swap_count(entry) == 1) {
684d098daf0b3a Chuanhua Han            2024-07-26  4223  			/* skip swapcache */
684d098daf0b3a Chuanhua Han            2024-07-26  4224  			folio = alloc_swap_folio(vmf);
684d098daf0b3a Chuanhua Han            2024-07-26  4225  			page = &folio->page;
684d098daf0b3a Chuanhua Han            2024-07-26  4226  			if (folio) {
684d098daf0b3a Chuanhua Han            2024-07-26  4227  				__folio_set_locked(folio);
684d098daf0b3a Chuanhua Han            2024-07-26  4228  				__folio_set_swapbacked(folio);
684d098daf0b3a Chuanhua Han            2024-07-26  4229  
684d098daf0b3a Chuanhua Han            2024-07-26  4230  				nr_pages = folio_nr_pages(folio);

nr_pages is initialized here

684d098daf0b3a Chuanhua Han            2024-07-26  4231  				if (folio_test_large(folio))
684d098daf0b3a Chuanhua Han            2024-07-26  4232  					entry.val = ALIGN_DOWN(entry.val, nr_pages);
13ddaf26be324a Kairui Song             2024-02-07  4233  				/*
13ddaf26be324a Kairui Song             2024-02-07  4234  				 * Prevent parallel swapin from proceeding with
13ddaf26be324a Kairui Song             2024-02-07  4235  				 * the cache flag. Otherwise, another thread may
13ddaf26be324a Kairui Song             2024-02-07  4236  				 * finish swapin first, free the entry, and swapout
13ddaf26be324a Kairui Song             2024-02-07  4237  				 * reusing the same entry. It's undetectable as
13ddaf26be324a Kairui Song             2024-02-07  4238  				 * pte_same() returns true due to entry reuse.
13ddaf26be324a Kairui Song             2024-02-07  4239  				 */
684d098daf0b3a Chuanhua Han            2024-07-26  4240  				if (swapcache_prepare_nr(entry, nr_pages)) {
13ddaf26be324a Kairui Song             2024-02-07  4241  					/* Relax a bit to prevent rapid repeated page faults */
13ddaf26be324a Kairui Song             2024-02-07  4242  					schedule_timeout_uninterruptible(1);
684d098daf0b3a Chuanhua Han            2024-07-26  4243  					goto out_page;
13ddaf26be324a Kairui Song             2024-02-07  4244  				}
13ddaf26be324a Kairui Song             2024-02-07  4245  				need_clear_cache = true;
13ddaf26be324a Kairui Song             2024-02-07  4246  
6599591816f522 Matthew Wilcox (Oracle  2022-09-02  4247) 				if (mem_cgroup_swapin_charge_folio(folio,
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4248) 							vma->vm_mm, GFP_KERNEL,
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4249) 							entry)) {
545b1b077ca6b3 Michal Hocko            2020-06-25  4250  					ret = VM_FAULT_OOM;
4c6355b25e8bb8 Johannes Weiner         2020-06-03  4251  					goto out_page;
545b1b077ca6b3 Michal Hocko            2020-06-25  4252  				}
684d098daf0b3a Chuanhua Han            2024-07-26  4253  				mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages);
4c6355b25e8bb8 Johannes Weiner         2020-06-03  4254  
aae466b0052e18 Joonsoo Kim             2020-08-11  4255  				shadow = get_shadow_from_swap_cache(entry);
aae466b0052e18 Joonsoo Kim             2020-08-11  4256  				if (shadow)
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4257) 					workingset_refault(folio, shadow);
0076f029cb2906 Joonsoo Kim             2020-06-25  4258  
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4259) 				folio_add_lru(folio);
0add0c77a9bd0c Shakeel Butt            2021-04-29  4260  
c9bdf768dd9319 Matthew Wilcox (Oracle  2023-12-13  4261) 				/* To provide entry to swap_read_folio() */
3d2c9087688777 David Hildenbrand       2023-08-21  4262  				folio->swap = entry;
b2d1f38b524121 Yosry Ahmed             2024-06-07  4263  				swap_read_folio(folio, NULL);
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4264) 				folio->private = NULL;
0bcac06f27d752 Minchan Kim             2017-11-15  4265  			}
aa8d22a11da933 Minchan Kim             2017-11-15  4266  		} else {
e9e9b7ecee4a13 Minchan Kim             2018-04-05  4267  			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
e9e9b7ecee4a13 Minchan Kim             2018-04-05  4268  						vmf);
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4269) 			if (page)
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4270) 				folio = page_folio(page);
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4271) 			swapcache = folio;
0bcac06f27d752 Minchan Kim             2017-11-15  4272  		}
0bcac06f27d752 Minchan Kim             2017-11-15  4273  
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4274) 		if (!folio) {
^1da177e4c3f41 Linus Torvalds          2005-04-16  4275  			/*
8f4e2101fd7df9 Hugh Dickins            2005-10-29  4276  			 * Back out if somebody else faulted in this pte
8f4e2101fd7df9 Hugh Dickins            2005-10-29  4277  			 * while we released the pte lock.
^1da177e4c3f41 Linus Torvalds          2005-04-16  4278  			 */
82b0f8c39a3869 Jan Kara                2016-12-14  4279  			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
82b0f8c39a3869 Jan Kara                2016-12-14  4280  					vmf->address, &vmf->ptl);
c33c794828f212 Ryan Roberts            2023-06-12  4281  			if (likely(vmf->pte &&
c33c794828f212 Ryan Roberts            2023-06-12  4282  				   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
^1da177e4c3f41 Linus Torvalds          2005-04-16  4283  				ret = VM_FAULT_OOM;
65500d234e74fc Hugh Dickins            2005-10-29  4284  			goto unlock;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4285  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  4286  
^1da177e4c3f41 Linus Torvalds          2005-04-16  4287  		/* Had to read the page from swap area: Major fault */
^1da177e4c3f41 Linus Torvalds          2005-04-16  4288  		ret = VM_FAULT_MAJOR;
f8891e5e1f93a1 Christoph Lameter       2006-06-30  4289  		count_vm_event(PGMAJFAULT);
2262185c5b287f Roman Gushchin          2017-07-06  4290  		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
d1737fdbec7f90 Andi Kleen              2009-09-16  4291  	} else if (PageHWPoison(page)) {
71f72525dfaaec Wu Fengguang            2009-12-16  4292  		/*
71f72525dfaaec Wu Fengguang            2009-12-16  4293  		 * hwpoisoned dirty swapcache pages are kept for killing
71f72525dfaaec Wu Fengguang            2009-12-16  4294  		 * owner processes (which may be unknown at hwpoison time)
71f72525dfaaec Wu Fengguang            2009-12-16  4295  		 */
d1737fdbec7f90 Andi Kleen              2009-09-16  4296  		ret = VM_FAULT_HWPOISON;
4779cb31c0ee3b Andi Kleen              2009-10-14  4297  		goto out_release;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4298  	}
^1da177e4c3f41 Linus Torvalds          2005-04-16  4299  
fdc724d6aa44ef Suren Baghdasaryan      2023-06-30  4300  	ret |= folio_lock_or_retry(folio, vmf);
fdc724d6aa44ef Suren Baghdasaryan      2023-06-30  4301  	if (ret & VM_FAULT_RETRY)
d065bd810b6deb Michel Lespinasse       2010-10-26  4302  		goto out_release;
073e587ec2cc37 KAMEZAWA Hiroyuki       2008-10-18  4303  
84d60fdd3733fb David Hildenbrand       2022-03-24  4304  	if (swapcache) {
4969c1192d15af Andrea Arcangeli        2010-09-09  4305  		/*
3b344157c0c15b Matthew Wilcox (Oracle  2022-09-02  4306) 		 * Make sure folio_free_swap() or swapoff did not release the
84d60fdd3733fb David Hildenbrand       2022-03-24  4307  		 * swapcache from under us.  The page pin, and pte_same test
84d60fdd3733fb David Hildenbrand       2022-03-24  4308  		 * below, are not enough to exclude that.  Even if it is still
84d60fdd3733fb David Hildenbrand       2022-03-24  4309  		 * swapcache, we need to check that the page's swap has not
84d60fdd3733fb David Hildenbrand       2022-03-24  4310  		 * changed.
4969c1192d15af Andrea Arcangeli        2010-09-09  4311  		 */
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4312) 		if (unlikely(!folio_test_swapcache(folio) ||
cfeed8ffe55b37 David Hildenbrand       2023-08-21  4313  			     page_swap_entry(page).val != entry.val))
4969c1192d15af Andrea Arcangeli        2010-09-09  4314  			goto out_page;
4969c1192d15af Andrea Arcangeli        2010-09-09  4315  
84d60fdd3733fb David Hildenbrand       2022-03-24  4316  		/*
84d60fdd3733fb David Hildenbrand       2022-03-24  4317  		 * KSM sometimes has to copy on read faults, for example, if
84d60fdd3733fb David Hildenbrand       2022-03-24  4318  		 * page->index of !PageKSM() pages would be nonlinear inside the
84d60fdd3733fb David Hildenbrand       2022-03-24  4319  		 * anon VMA -- PageKSM() is lost on actual swapout.
84d60fdd3733fb David Hildenbrand       2022-03-24  4320  		 */
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4321) 		folio = ksm_might_need_to_copy(folio, vma, vmf->address);
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4322) 		if (unlikely(!folio)) {
5ad6468801d28c Hugh Dickins            2009-12-14  4323  			ret = VM_FAULT_OOM;
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4324) 			folio = swapcache;
4969c1192d15af Andrea Arcangeli        2010-09-09  4325  			goto out_page;
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4326) 		} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
6b970599e807ea Kefeng Wang             2022-12-09  4327  			ret = VM_FAULT_HWPOISON;
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4328) 			folio = swapcache;
6b970599e807ea Kefeng Wang             2022-12-09  4329  			goto out_page;
4969c1192d15af Andrea Arcangeli        2010-09-09  4330  		}
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4331) 		if (folio != swapcache)
96db66d9c8f3c1 Matthew Wilcox (Oracle  2023-12-11  4332) 			page = folio_page(folio, 0);
c145e0b47c77eb David Hildenbrand       2022-03-24  4333  
c145e0b47c77eb David Hildenbrand       2022-03-24  4334  		/*
c145e0b47c77eb David Hildenbrand       2022-03-24  4335  		 * If we want to map a page that's in the swapcache writable, we
c145e0b47c77eb David Hildenbrand       2022-03-24  4336  		 * have to detect via the refcount if we're really the exclusive
c145e0b47c77eb David Hildenbrand       2022-03-24  4337  		 * owner. Try removing the extra reference from the local LRU
1fec6890bf2247 Matthew Wilcox (Oracle  2023-06-21  4338) 		 * caches if required.
c145e0b47c77eb David Hildenbrand       2022-03-24  4339  		 */
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4340) 		if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4341) 		    !folio_test_ksm(folio) && !folio_test_lru(folio))
c145e0b47c77eb David Hildenbrand       2022-03-24  4342  			lru_add_drain();
84d60fdd3733fb David Hildenbrand       2022-03-24  4343  	}
5ad6468801d28c Hugh Dickins            2009-12-14  4344  
4231f8425833b1 Kefeng Wang             2023-03-02  4345  	folio_throttle_swaprate(folio, GFP_KERNEL);
8a9f3ccd24741b Balbir Singh            2008-02-07  4346  
^1da177e4c3f41 Linus Torvalds          2005-04-16  4347  	/*
8f4e2101fd7df9 Hugh Dickins            2005-10-29  4348  	 * Back out if somebody else already faulted in this pte.
^1da177e4c3f41 Linus Torvalds          2005-04-16  4349  	 */
82b0f8c39a3869 Jan Kara                2016-12-14  4350  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
82b0f8c39a3869 Jan Kara                2016-12-14  4351  			&vmf->ptl);
c33c794828f212 Ryan Roberts            2023-06-12  4352  	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
b81074800b98ac Kirill Korotaev         2005-05-16  4353  		goto out_nomap;
b81074800b98ac Kirill Korotaev         2005-05-16  4354  
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4355) 	if (unlikely(!folio_test_uptodate(folio))) {
b81074800b98ac Kirill Korotaev         2005-05-16  4356  		ret = VM_FAULT_SIGBUS;
b81074800b98ac Kirill Korotaev         2005-05-16  4357  		goto out_nomap;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4358  	}
^1da177e4c3f41 Linus Torvalds          2005-04-16  4359  
684d098daf0b3a Chuanhua Han            2024-07-26  4360  	/* allocated large folios for SWP_SYNCHRONOUS_IO */
684d098daf0b3a Chuanhua Han            2024-07-26  4361  	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
684d098daf0b3a Chuanhua Han            2024-07-26  4362  		unsigned long nr = folio_nr_pages(folio);
684d098daf0b3a Chuanhua Han            2024-07-26  4363  		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
684d098daf0b3a Chuanhua Han            2024-07-26  4364  		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
684d098daf0b3a Chuanhua Han            2024-07-26  4365  		pte_t *folio_ptep = vmf->pte - idx;
684d098daf0b3a Chuanhua Han            2024-07-26  4366  
684d098daf0b3a Chuanhua Han            2024-07-26  4367  		if (!can_swapin_thp(vmf, folio_ptep, nr))
684d098daf0b3a Chuanhua Han            2024-07-26  4368  			goto out_nomap;
684d098daf0b3a Chuanhua Han            2024-07-26  4369  
684d098daf0b3a Chuanhua Han            2024-07-26  4370  		page_idx = idx;
684d098daf0b3a Chuanhua Han            2024-07-26  4371  		address = folio_start;
684d098daf0b3a Chuanhua Han            2024-07-26  4372  		ptep = folio_ptep;
684d098daf0b3a Chuanhua Han            2024-07-26  4373  		goto check_folio;

Let's say we hit this goto

684d098daf0b3a Chuanhua Han            2024-07-26  4374  	}
684d098daf0b3a Chuanhua Han            2024-07-26  4375  
508758960b8d89 Chuanhua Han            2024-05-29  4376  	nr_pages = 1;
508758960b8d89 Chuanhua Han            2024-05-29  4377  	page_idx = 0;
508758960b8d89 Chuanhua Han            2024-05-29  4378  	address = vmf->address;
508758960b8d89 Chuanhua Han            2024-05-29  4379  	ptep = vmf->pte;
508758960b8d89 Chuanhua Han            2024-05-29  4380  	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
508758960b8d89 Chuanhua Han            2024-05-29  4381  		int nr = folio_nr_pages(folio);
508758960b8d89 Chuanhua Han            2024-05-29  4382  		unsigned long idx = folio_page_idx(folio, page);
508758960b8d89 Chuanhua Han            2024-05-29  4383  		unsigned long folio_start = address - idx * PAGE_SIZE;
508758960b8d89 Chuanhua Han            2024-05-29  4384  		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
508758960b8d89 Chuanhua Han            2024-05-29  4385  		pte_t *folio_ptep;
508758960b8d89 Chuanhua Han            2024-05-29  4386  		pte_t folio_pte;
508758960b8d89 Chuanhua Han            2024-05-29  4387  
508758960b8d89 Chuanhua Han            2024-05-29  4388  		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
508758960b8d89 Chuanhua Han            2024-05-29  4389  			goto check_folio;
508758960b8d89 Chuanhua Han            2024-05-29  4390  		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
508758960b8d89 Chuanhua Han            2024-05-29  4391  			goto check_folio;
508758960b8d89 Chuanhua Han            2024-05-29  4392  
508758960b8d89 Chuanhua Han            2024-05-29  4393  		folio_ptep = vmf->pte - idx;
508758960b8d89 Chuanhua Han            2024-05-29  4394  		folio_pte = ptep_get(folio_ptep);
508758960b8d89 Chuanhua Han            2024-05-29  4395  		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
508758960b8d89 Chuanhua Han            2024-05-29  4396  		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
508758960b8d89 Chuanhua Han            2024-05-29  4397  			goto check_folio;
508758960b8d89 Chuanhua Han            2024-05-29  4398  
508758960b8d89 Chuanhua Han            2024-05-29  4399  		page_idx = idx;
508758960b8d89 Chuanhua Han            2024-05-29  4400  		address = folio_start;
508758960b8d89 Chuanhua Han            2024-05-29  4401  		ptep = folio_ptep;
508758960b8d89 Chuanhua Han            2024-05-29  4402  		nr_pages = nr;
508758960b8d89 Chuanhua Han            2024-05-29  4403  		entry = folio->swap;
508758960b8d89 Chuanhua Han            2024-05-29  4404  		page = &folio->page;
508758960b8d89 Chuanhua Han            2024-05-29  4405  	}
508758960b8d89 Chuanhua Han            2024-05-29  4406  
508758960b8d89 Chuanhua Han            2024-05-29  4407  check_folio:
78fbe906cc900b David Hildenbrand       2022-05-09  4408  	/*
78fbe906cc900b David Hildenbrand       2022-05-09  4409  	 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
78fbe906cc900b David Hildenbrand       2022-05-09  4410  	 * must never point at an anonymous page in the swapcache that is
78fbe906cc900b David Hildenbrand       2022-05-09  4411  	 * PG_anon_exclusive. Sanity check that this holds and especially, that
78fbe906cc900b David Hildenbrand       2022-05-09  4412  	 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
78fbe906cc900b David Hildenbrand       2022-05-09  4413  	 * check after taking the PT lock and making sure that nobody
78fbe906cc900b David Hildenbrand       2022-05-09  4414  	 * concurrently faulted in this page and set PG_anon_exclusive.
78fbe906cc900b David Hildenbrand       2022-05-09  4415  	 */
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4416) 	BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4417) 	BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
78fbe906cc900b David Hildenbrand       2022-05-09  4418  
1493a1913e34b0 David Hildenbrand       2022-05-09  4419  	/*
1493a1913e34b0 David Hildenbrand       2022-05-09  4420  	 * Check under PT lock (to protect against concurrent fork() sharing
1493a1913e34b0 David Hildenbrand       2022-05-09  4421  	 * the swap entry concurrently) for certainly exclusive pages.
1493a1913e34b0 David Hildenbrand       2022-05-09  4422  	 */
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4423) 	if (!folio_test_ksm(folio)) {
1493a1913e34b0 David Hildenbrand       2022-05-09  4424  		exclusive = pte_swp_exclusive(vmf->orig_pte);
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4425) 		if (folio != swapcache) {
1493a1913e34b0 David Hildenbrand       2022-05-09  4426  			/*
1493a1913e34b0 David Hildenbrand       2022-05-09  4427  			 * We have a fresh page that is not exposed to the
1493a1913e34b0 David Hildenbrand       2022-05-09  4428  			 * swapcache -> certainly exclusive.
1493a1913e34b0 David Hildenbrand       2022-05-09  4429  			 */
1493a1913e34b0 David Hildenbrand       2022-05-09  4430  			exclusive = true;
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4431) 		} else if (exclusive && folio_test_writeback(folio) &&
eacde32757c756 Miaohe Lin              2022-05-19  4432  			  data_race(si->flags & SWP_STABLE_WRITES)) {
1493a1913e34b0 David Hildenbrand       2022-05-09  4433  			/*
1493a1913e34b0 David Hildenbrand       2022-05-09  4434  			 * This is tricky: not all swap backends support
1493a1913e34b0 David Hildenbrand       2022-05-09  4435  			 * concurrent page modifications while under writeback.
1493a1913e34b0 David Hildenbrand       2022-05-09  4436  			 *
1493a1913e34b0 David Hildenbrand       2022-05-09  4437  			 * So if we stumble over such a page in the swapcache
1493a1913e34b0 David Hildenbrand       2022-05-09  4438  			 * we must not set the page exclusive, otherwise we can
1493a1913e34b0 David Hildenbrand       2022-05-09  4439  			 * map it writable without further checks and modify it
1493a1913e34b0 David Hildenbrand       2022-05-09  4440  			 * while still under writeback.
1493a1913e34b0 David Hildenbrand       2022-05-09  4441  			 *
1493a1913e34b0 David Hildenbrand       2022-05-09  4442  			 * For these problematic swap backends, simply drop the
1493a1913e34b0 David Hildenbrand       2022-05-09  4443  			 * exclusive marker: this is perfectly fine as we start
1493a1913e34b0 David Hildenbrand       2022-05-09  4444  			 * writeback only if we fully unmapped the page and
1493a1913e34b0 David Hildenbrand       2022-05-09  4445  			 * there are no unexpected references on the page after
1493a1913e34b0 David Hildenbrand       2022-05-09  4446  			 * unmapping succeeded. After fully unmapped, no
1493a1913e34b0 David Hildenbrand       2022-05-09  4447  			 * further GUP references (FOLL_GET and FOLL_PIN) can
1493a1913e34b0 David Hildenbrand       2022-05-09  4448  			 * appear, so dropping the exclusive marker and mapping
1493a1913e34b0 David Hildenbrand       2022-05-09  4449  			 * it only R/O is fine.
1493a1913e34b0 David Hildenbrand       2022-05-09  4450  			 */
1493a1913e34b0 David Hildenbrand       2022-05-09  4451  			exclusive = false;
1493a1913e34b0 David Hildenbrand       2022-05-09  4452  		}
1493a1913e34b0 David Hildenbrand       2022-05-09  4453  	}
1493a1913e34b0 David Hildenbrand       2022-05-09  4454  
6dca4ac6fc91fd Peter Collingbourne     2023-05-22  4455  	/*
6dca4ac6fc91fd Peter Collingbourne     2023-05-22  4456  	 * Some architectures may have to restore extra metadata to the page
6dca4ac6fc91fd Peter Collingbourne     2023-05-22  4457  	 * when reading from swap. This metadata may be indexed by swap entry
6dca4ac6fc91fd Peter Collingbourne     2023-05-22  4458  	 * so this must be called before swap_free().
6dca4ac6fc91fd Peter Collingbourne     2023-05-22  4459  	 */
f238b8c33c6738 Barry Song              2024-03-23  4460  	arch_swap_restore(folio_swap(entry, folio), folio);
6dca4ac6fc91fd Peter Collingbourne     2023-05-22  4461  
8c7c6e34a1256a KAMEZAWA Hiroyuki       2009-01-07  4462  	/*
c145e0b47c77eb David Hildenbrand       2022-03-24  4463  	 * Remove the swap entry and conditionally try to free up the swapcache.
c145e0b47c77eb David Hildenbrand       2022-03-24  4464  	 * We're already holding a reference on the page but haven't mapped it
c145e0b47c77eb David Hildenbrand       2022-03-24  4465  	 * yet.
8c7c6e34a1256a KAMEZAWA Hiroyuki       2009-01-07  4466  	 */
508758960b8d89 Chuanhua Han            2024-05-29 @4467  	swap_free_nr(entry, nr_pages);
                                                                                    ^^^^^^^^
Smatch warning.  The code is a bit complicated so it could be a false
positive.

a160e5377b55bc Matthew Wilcox (Oracle  2022-09-02  4468) 	if (should_try_to_free_swap(folio, vma, vmf->flags))
a160e5377b55bc Matthew Wilcox (Oracle  2022-09-02  4469) 		folio_free_swap(folio);
^1da177e4c3f41 Linus Torvalds          2005-04-16  4470  
508758960b8d89 Chuanhua Han            2024-05-29  4471  	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
508758960b8d89 Chuanhua Han            2024-05-29  4472  	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
^1da177e4c3f41 Linus Torvalds          2005-04-16  4473  	pte = mk_pte(page, vma->vm_page_prot);
c18160dba5ff63 Barry Song              2024-06-02  4474  	if (pte_swp_soft_dirty(vmf->orig_pte))
c18160dba5ff63 Barry Song              2024-06-02  4475  		pte = pte_mksoft_dirty(pte);
c18160dba5ff63 Barry Song              2024-06-02  4476  	if (pte_swp_uffd_wp(vmf->orig_pte))
c18160dba5ff63 Barry Song              2024-06-02  4477  		pte = pte_mkuffd_wp(pte);
c145e0b47c77eb David Hildenbrand       2022-03-24  4478  
c145e0b47c77eb David Hildenbrand       2022-03-24  4479  	/*
1493a1913e34b0 David Hildenbrand       2022-05-09  4480  	 * Same logic as in do_wp_page(); however, optimize for pages that are
1493a1913e34b0 David Hildenbrand       2022-05-09  4481  	 * certainly not shared either because we just allocated them without
1493a1913e34b0 David Hildenbrand       2022-05-09  4482  	 * exposing them to the swapcache or because the swap entry indicates
1493a1913e34b0 David Hildenbrand       2022-05-09  4483  	 * exclusivity.
c145e0b47c77eb David Hildenbrand       2022-03-24  4484  	 */
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4485) 	if (!folio_test_ksm(folio) &&
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4486) 	    (exclusive || folio_ref_count(folio) == 1)) {
c18160dba5ff63 Barry Song              2024-06-02  4487  		if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
20dfa5b7adc5a1 Barry Song              2024-06-08  4488  		    !pte_needs_soft_dirty_wp(vma, pte)) {
c18160dba5ff63 Barry Song              2024-06-02  4489  			pte = pte_mkwrite(pte, vma);
6c287605fd5646 David Hildenbrand       2022-05-09  4490  			if (vmf->flags & FAULT_FLAG_WRITE) {
c18160dba5ff63 Barry Song              2024-06-02  4491  				pte = pte_mkdirty(pte);
82b0f8c39a3869 Jan Kara                2016-12-14  4492  				vmf->flags &= ~FAULT_FLAG_WRITE;
6c287605fd5646 David Hildenbrand       2022-05-09  4493  			}
c18160dba5ff63 Barry Song              2024-06-02  4494  		}
14f9135d547060 David Hildenbrand       2022-05-09  4495  		rmap_flags |= RMAP_EXCLUSIVE;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4496  	}
508758960b8d89 Chuanhua Han            2024-05-29  4497  	folio_ref_add(folio, nr_pages - 1);
508758960b8d89 Chuanhua Han            2024-05-29  4498  	flush_icache_pages(vma, page, nr_pages);
508758960b8d89 Chuanhua Han            2024-05-29  4499  	vmf->orig_pte = pte_advance_pfn(pte, page_idx);
0bcac06f27d752 Minchan Kim             2017-11-15  4500  
0bcac06f27d752 Minchan Kim             2017-11-15  4501  	/* ksm created a completely new copy */
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4502) 	if (unlikely(folio != swapcache && swapcache)) {
15bde4abab734c Barry Song              2024-06-18  4503  		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4504) 		folio_add_lru_vma(folio, vma);
9ae2feacedde16 Barry Song              2024-06-18  4505  	} else if (!folio_test_anon(folio)) {
9ae2feacedde16 Barry Song              2024-06-18  4506  		/*
684d098daf0b3a Chuanhua Han            2024-07-26  4507  		 * We currently only expect small !anon folios which are either
684d098daf0b3a Chuanhua Han            2024-07-26  4508  		 * fully exclusive or fully shared, or new allocated large folios
684d098daf0b3a Chuanhua Han            2024-07-26  4509  		 * which are fully exclusive. If we ever get large folios within
684d098daf0b3a Chuanhua Han            2024-07-26  4510  		 * swapcache here, we have to be careful.
9ae2feacedde16 Barry Song              2024-06-18  4511  		 */
684d098daf0b3a Chuanhua Han            2024-07-26  4512  		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
9ae2feacedde16 Barry Song              2024-06-18  4513  		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
9ae2feacedde16 Barry Song              2024-06-18  4514  		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
0bcac06f27d752 Minchan Kim             2017-11-15  4515  	} else {
508758960b8d89 Chuanhua Han            2024-05-29  4516  		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
b832a354d787bf David Hildenbrand       2023-12-20  4517  					rmap_flags);
00501b531c4723 Johannes Weiner         2014-08-08  4518  	}
^1da177e4c3f41 Linus Torvalds          2005-04-16  4519  
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4520) 	VM_BUG_ON(!folio_test_anon(folio) ||
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4521) 			(pte_write(pte) && !PageAnonExclusive(page)));
508758960b8d89 Chuanhua Han            2024-05-29  4522  	set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
508758960b8d89 Chuanhua Han            2024-05-29  4523  	arch_do_swap_page_nr(vma->vm_mm, vma, address,
508758960b8d89 Chuanhua Han            2024-05-29  4524  			pte, pte, nr_pages);
1eba86c096e35e Pasha Tatashin          2022-01-14  4525  
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4526) 	folio_unlock(folio);
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4527) 	if (folio != swapcache && swapcache) {
4969c1192d15af Andrea Arcangeli        2010-09-09  4528  		/*
4969c1192d15af Andrea Arcangeli        2010-09-09  4529  		 * Hold the lock to avoid the swap entry to be reused
4969c1192d15af Andrea Arcangeli        2010-09-09  4530  		 * until we take the PT lock for the pte_same() check
4969c1192d15af Andrea Arcangeli        2010-09-09  4531  		 * (to avoid false positives from pte_same). For
4969c1192d15af Andrea Arcangeli        2010-09-09  4532  		 * further safety release the lock after the swap_free
4969c1192d15af Andrea Arcangeli        2010-09-09  4533  		 * so that the swap count won't change under a
4969c1192d15af Andrea Arcangeli        2010-09-09  4534  		 * parallel locked swapcache.
4969c1192d15af Andrea Arcangeli        2010-09-09  4535  		 */
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4536) 		folio_unlock(swapcache);
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4537) 		folio_put(swapcache);
4969c1192d15af Andrea Arcangeli        2010-09-09  4538  	}
c475a8ab625d56 Hugh Dickins            2005-06-21  4539  
82b0f8c39a3869 Jan Kara                2016-12-14  4540  	if (vmf->flags & FAULT_FLAG_WRITE) {
2994302bc8a171 Jan Kara                2016-12-14  4541  		ret |= do_wp_page(vmf);
61469f1d51777f Hugh Dickins            2008-03-04  4542  		if (ret & VM_FAULT_ERROR)
61469f1d51777f Hugh Dickins            2008-03-04  4543  			ret &= VM_FAULT_ERROR;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4544  		goto out;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4545  	}
^1da177e4c3f41 Linus Torvalds          2005-04-16  4546  
^1da177e4c3f41 Linus Torvalds          2005-04-16  4547  	/* No need to invalidate - it was non-present before */
508758960b8d89 Chuanhua Han            2024-05-29  4548  	update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
65500d234e74fc Hugh Dickins            2005-10-29  4549  unlock:
3db82b9374ca92 Hugh Dickins            2023-06-08  4550  	if (vmf->pte)
82b0f8c39a3869 Jan Kara                2016-12-14  4551  		pte_unmap_unlock(vmf->pte, vmf->ptl);
^1da177e4c3f41 Linus Torvalds          2005-04-16  4552  out:
13ddaf26be324a Kairui Song             2024-02-07  4553  	/* Clear the swap cache pin for direct swapin after PTL unlock */
13ddaf26be324a Kairui Song             2024-02-07  4554  	if (need_clear_cache)
684d098daf0b3a Chuanhua Han            2024-07-26  4555  		swapcache_clear_nr(si, entry, nr_pages);
2799e77529c2a2 Miaohe Lin              2021-06-28  4556  	if (si)
2799e77529c2a2 Miaohe Lin              2021-06-28  4557  		put_swap_device(si);
^1da177e4c3f41 Linus Torvalds          2005-04-16  4558  	return ret;
b81074800b98ac Kirill Korotaev         2005-05-16  4559  out_nomap:
3db82b9374ca92 Hugh Dickins            2023-06-08  4560  	if (vmf->pte)
82b0f8c39a3869 Jan Kara                2016-12-14  4561  		pte_unmap_unlock(vmf->pte, vmf->ptl);
bc43f75cd98158 Johannes Weiner         2009-04-30  4562  out_page:
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4563) 	folio_unlock(folio);
4779cb31c0ee3b Andi Kleen              2009-10-14  4564  out_release:
63ad4add382305 Matthew Wilcox (Oracle  2022-09-02  4565) 	folio_put(folio);
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4566) 	if (folio != swapcache && swapcache) {
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4567) 		folio_unlock(swapcache);
d4f9565ae598bd Matthew Wilcox (Oracle  2022-09-02  4568) 		folio_put(swapcache);
4969c1192d15af Andrea Arcangeli        2010-09-09  4569  	}
13ddaf26be324a Kairui Song             2024-02-07  4570  	if (need_clear_cache)
684d098daf0b3a Chuanhua Han            2024-07-26  4571  		swapcache_clear_nr(si, entry, nr_pages);
2799e77529c2a2 Miaohe Lin              2021-06-28  4572  	if (si)
2799e77529c2a2 Miaohe Lin              2021-06-28  4573  		put_swap_device(si);
65500d234e74fc Hugh Dickins            2005-10-29  4574  	return ret;
^1da177e4c3f41 Linus Torvalds          2005-04-16  4575  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-26  9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
                   ` (2 preceding siblings ...)
  2024-07-26  9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
@ 2024-07-26  9:46 ` Barry Song
  2024-07-27  5:58   ` kernel test robot
  2024-07-29  3:52   ` Matthew Wilcox
  2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
  4 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-07-26  9:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: ying.huang, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, yosryahmed

From: Barry Song <v-songbaohua@oppo.com>

Quote Ying's comment:
A user space interface can be implemented to select different swap-in
order policies, similar to the mTHP allocation order policy. We need
a distinct policy because the performance characteristics of memory
allocation differ significantly from those of swap-in. For example,
SSD read speeds can be much slower than memory allocation. With
policy selection, I believe we can implement mTHP swap-in for
non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
the implications of their choices. I think that it's better to start
with at least always never. I believe that we will add auto in the
future to tune automatically, which can be used as default finally.

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  6 +++
 include/linux/huge_mm.h                    |  1 +
 mm/huge_memory.c                           | 44 ++++++++++++++++++++++
 mm/memory.c                                |  3 +-
 4 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..2e94e956ee12 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -144,6 +144,12 @@ hugepage sizes have enabled="never". If enabling multiple hugepage
 sizes, the kernel will select the most appropriate enabled size for a
 given allocation.
 
+Transparent Hugepage Swap-in for anonymous memory can be disabled or enabled
+by per-supported-THP-size with one of::
+
+	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled
+	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled
+
 It's also possible to limit defrag efforts in the VM to generate
 anonymous hugepages in case they're not immediately free to madvise
 regions or to never try to defrag memory and simply fallback to regular
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..25174305b17f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
 #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
 #define TVA_ENFORCE_SYSFS	(1 << 2)	/* Obey sysfs configuration */
+#define TVA_IN_SWAPIN          (1 << 3)        /* Do swap-in */
 
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0167dc27e365..41460847988c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -80,6 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
+unsigned long huge_anon_orders_swapin_always __read_mostly;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
@@ -88,6 +89,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 {
 	bool smaps = tva_flags & TVA_SMAPS;
 	bool in_pf = tva_flags & TVA_IN_PF;
+	bool in_swapin = tva_flags & TVA_IN_SWAPIN;
 	bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
 	unsigned long supported_orders;
 
@@ -100,6 +102,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 		supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;
 
 	orders &= supported_orders;
+	if (in_swapin)
+		orders &= READ_ONCE(huge_anon_orders_swapin_always);
 	if (!orders)
 		return 0;
 
@@ -523,8 +527,48 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj,
 static struct kobj_attribute thpsize_enabled_attr =
 	__ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store);
 
+static DEFINE_SPINLOCK(huge_anon_orders_swapin_lock);
+
+static ssize_t thpsize_swapin_enabled_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_anon_orders_swapin_always))
+		output = "[always] never";
+	else
+		output = "always [never]";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_swapin_enabled_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always")) {
+		spin_lock(&huge_anon_orders_swapin_lock);
+		set_bit(order, &huge_anon_orders_swapin_always);
+		spin_unlock(&huge_anon_orders_swapin_lock);
+	} else if (sysfs_streq(buf, "never")) {
+		spin_lock(&huge_anon_orders_swapin_lock);
+		clear_bit(order, &huge_anon_orders_swapin_always);
+		spin_unlock(&huge_anon_orders_swapin_lock);
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+static struct kobj_attribute thpsize_swapin_enabled_attr =
+	__ATTR(swapin_enabled, 0644, thpsize_swapin_enabled_show, thpsize_swapin_enabled_store);
+
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+	&thpsize_swapin_enabled_attr.attr,
 #ifdef CONFIG_SHMEM
 	&thpsize_shmem_enabled_attr.attr,
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 14048e9285d4..27c77f739a2c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4091,7 +4091,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * and suitable for swapping THP.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+			TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS,
+			BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-26  9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
@ 2024-07-27  5:58   ` kernel test robot
  2024-07-29  1:37     ` Barry Song
  2024-07-29  3:52   ` Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: kernel test robot @ 2024-07-27  5:58 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: oe-kbuild-all, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, willy, xiang, yosryahmed

Hi Barry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com
patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config)
compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static?
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock
   mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock
   mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock
   mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit
   mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit
   mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock
   mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock
   mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock
   mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false

vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c

    51	
    52	/*
    53	 * By default, transparent hugepage support is disabled in order to avoid
    54	 * risking an increased memory footprint for applications that are not
    55	 * guaranteed to benefit from it. When transparent hugepage support is
    56	 * enabled, it is for all mappings, and khugepaged scans all mappings.
    57	 * Defrag is invoked by khugepaged hugepage allocations and by page faults
    58	 * for all hugepage allocations.
    59	 */
    60	unsigned long transparent_hugepage_flags __read_mostly =
    61	#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
    62		(1<<TRANSPARENT_HUGEPAGE_FLAG)|
    63	#endif
    64	#ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
    65		(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
    66	#endif
    67		(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
    68		(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
    69		(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
    70	
    71	static struct shrinker *deferred_split_shrinker;
    72	static unsigned long deferred_split_count(struct shrinker *shrink,
    73						  struct shrink_control *sc);
    74	static unsigned long deferred_split_scan(struct shrinker *shrink,
    75						 struct shrink_control *sc);
    76	
    77	static atomic_t huge_zero_refcount;
    78	struct folio *huge_zero_folio __read_mostly;
    79	unsigned long huge_zero_pfn __read_mostly = ~0UL;
    80	unsigned long huge_anon_orders_always __read_mostly;
    81	unsigned long huge_anon_orders_madvise __read_mostly;
    82	unsigned long huge_anon_orders_inherit __read_mostly;
  > 83	unsigned long huge_anon_orders_swapin_always __read_mostly;
    84	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-27  5:58   ` kernel test robot
@ 2024-07-29  1:37     ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-07-29  1:37 UTC (permalink / raw)
  To: lkp
  Cc: 21cnbao, akpm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, mhocko, minchan,
	nphamcs, oe-kbuild-all, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed


On Sat, Jul 27, 2024 at 5:58 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Barry,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]

Hi Thanks!
Would you check if the below patch fixes the problem?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41460847988c..06984a325af7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -80,7 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
-unsigned long huge_anon_orders_swapin_always __read_mostly;
+static unsigned long huge_anon_orders_swapin_always __read_mostly;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com
> patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
> config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config)
> compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/
>
> sparse warnings: (new ones prefixed by >>)
> >> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static?
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock
>    mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock
>    mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock
>    mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit
>    mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit
>    mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock
>    mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock
>    mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock
>    mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>
> vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c
>
>     51 
>     52  /*
>     53   * By default, transparent hugepage support is disabled in order to avoid
>     54   * risking an increased memory footprint for applications that are not
>     55   * guaranteed to benefit from it. When transparent hugepage support is
>     56   * enabled, it is for all mappings, and khugepaged scans all mappings.
>     57   * Defrag is invoked by khugepaged hugepage allocations and by page faults
>     58   * for all hugepage allocations.
>     59   */
>     60  unsigned long transparent_hugepage_flags __read_mostly =
>     61  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
>     62          (1<<TRANSPARENT_HUGEPAGE_FLAG)|
>     63  #endif
>     64  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>     65          (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>     66  #endif
>     67          (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>     68          (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>     69          (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>     70 
>     71  static struct shrinker *deferred_split_shrinker;
>     72  static unsigned long deferred_split_count(struct shrinker *shrink,
>     73                                            struct shrink_control *sc);
>     74  static unsigned long deferred_split_scan(struct shrinker *shrink,
>     75                                           struct shrink_control *sc);
>     76 
>     77  static atomic_t huge_zero_refcount;
>     78  struct folio *huge_zero_folio __read_mostly;
>     79  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>     80  unsigned long huge_anon_orders_always __read_mostly;
>     81  unsigned long huge_anon_orders_madvise __read_mostly;
>     82  unsigned long huge_anon_orders_inherit __read_mostly;
>   > 83  unsigned long huge_anon_orders_swapin_always __read_mostly;
>     84 
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

Thanks
Barry



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-26  9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
  2024-07-27  5:58   ` kernel test robot
@ 2024-07-29  3:52   ` Matthew Wilcox
  2024-07-29  4:49     ` Barry Song
                       ` (3 more replies)
  1 sibling, 4 replies; 59+ messages in thread
From: Matthew Wilcox @ 2024-07-29  3:52 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed

On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
> A user space interface can be implemented to select different swap-in
> order policies, similar to the mTHP allocation order policy. We need
> a distinct policy because the performance characteristics of memory
> allocation differ significantly from those of swap-in. For example,
> SSD read speeds can be much slower than memory allocation. With
> policy selection, I believe we can implement mTHP swap-in for
> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
> the implications of their choices. I think that it's better to start
> with at least always never. I believe that we will add auto in the
> future to tune automatically, which can be used as default finally.

I strongly disagree.  Use the same sysctl as the other anonymous memory
allocations.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29  3:52   ` Matthew Wilcox
@ 2024-07-29  4:49     ` Barry Song
  2024-07-29 16:11     ` Christoph Hellwig
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-07-29  4:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed

On Mon, Jul 29, 2024 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
> > A user space interface can be implemented to select different swap-in
> > order policies, similar to the mTHP allocation order policy. We need
> > a distinct policy because the performance characteristics of memory
> > allocation differ significantly from those of swap-in. For example,
> > SSD read speeds can be much slower than memory allocation. With
> > policy selection, I believe we can implement mTHP swap-in for
> > non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
> > the implications of their choices. I think that it's better to start
> > with at least always never. I believe that we will add auto in the
> > future to tune automatically, which can be used as default finally.
>
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

In versions v1-v4, we used the same controls as anonymous memory allocations.
Ying expressed concerns that this approach isn't always ideal, especially for
non-zRAM devices, as SSD read speeds can be much slower than memory
allocation. I think his concern is reasonable to some extent.

However, this patchset only addresses scenarios involving zRAM-like devices
and will not impact SSDs. I would like to get Ying's feedback on whether
it's acceptable to drop this one in v6.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29  3:52   ` Matthew Wilcox
  2024-07-29  4:49     ` Barry Song
@ 2024-07-29 16:11     ` Christoph Hellwig
  2024-07-29 20:11       ` Barry Song
  2024-07-30  2:27       ` Chuanhua Han
  2024-07-30  8:36     ` Ryan Roberts
  2024-08-05  6:10     ` Huang, Ying
  3 siblings, 2 replies; 59+ messages in thread
From: Christoph Hellwig @ 2024-07-29 16:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, ying.huang, baolin.wang, chrisl,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote:
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

I agree with Matthew here.

We also really need to stop optimizing for this weird zram case and move
people to zswap instead after fixing the various issues.  A special
block device that isn't really a block device and needs various special
hooks isn't the right abstraction for different zwap strategies.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29 16:11     ` Christoph Hellwig
@ 2024-07-29 20:11       ` Barry Song
  2024-07-30 16:30         ` Christoph Hellwig
  2024-07-30  2:27       ` Chuanhua Han
  1 sibling, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-29 20:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang, chrisl,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Tue, Jul 30, 2024 at 4:11 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote:
> > I strongly disagree.  Use the same sysctl as the other anonymous memory
> > allocations.
>
> I agree with Matthew here.

The whole anonymous memory allocation control is still used here. this is
just an addition: anonymous memory allocation control & swapin policy,
primarily for addressing SSD concern not for zRAM in the original v4's
comment.

>
> We also really need to stop optimizing for this weird zram case and move
> people to zswap instead after fixing the various issues.  A special
> block device that isn't really a block device and needs various special
> hooks isn't the right abstraction for different zwap strategies.

My understanding is zRAM is much more popularly used in embedded
systems than zswap. I seldomly(or never) hear who is using zswap
in Android. it seems pointless to force people to move to zswap, in
embedded systems we don't have a backend real block disk device
after zswap.

>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29 20:11       ` Barry Song
@ 2024-07-30 16:30         ` Christoph Hellwig
  2024-07-30 19:28           ` Nhat Pham
  2024-08-01 20:55           ` Chris Li
  0 siblings, 2 replies; 59+ messages in thread
From: Christoph Hellwig @ 2024-07-30 16:30 UTC (permalink / raw)
  To: Barry Song
  Cc: Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang,
	baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote:
> > We also really need to stop optimizing for this weird zram case and move
> > people to zswap instead after fixing the various issues.  A special
> > block device that isn't really a block device and needs various special
> > hooks isn't the right abstraction for different zwap strategies.
> 
> My understanding is zRAM is much more popularly used in embedded
> systems than zswap. I seldomly(or never) hear who is using zswap
> in Android. it seems pointless to force people to move to zswap, in
> embedded systems we don't have a backend real block disk device
> after zswap.

Well, that is the point.  zram is a horrible hack that abuses a block
device to implement a feature missing the VM layer.  Right now people
have a reason for it because zswap requires a "real" backing device
and that's fine for them and for now.  But instead of building VM
infrastructure around these kinds of hacks we need to fix the VM
infrastructure.  Chris Li has been talking about and working towards
a proper swap abstraction and that needs to happen.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-30 16:30         ` Christoph Hellwig
@ 2024-07-30 19:28           ` Nhat Pham
  2024-07-30 21:06             ` Barry Song
  2024-08-01 20:55           ` Chris Li
  1 sibling, 1 reply; 59+ messages in thread
From: Nhat Pham @ 2024-07-30 19:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, ying.huang,
	baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote:
>
>
> Well, that is the point.  zram is a horrible hack that abuses a block
> device to implement a feature missing the VM layer.  Right now people
> have a reason for it because zswap requires a "real" backing device
> and that's fine for them and for now.  But instead of building VM

I completely agree with this assessment.

> infrastructure around these kinds of hacks we need to fix the VM
> infrastructure.  Chris Li has been talking about and working towards
> a proper swap abstraction and that needs to happen.

I'm also working towards something along this line. My design would
add a "virtual" swap ID that will be stored in the page table, and can
refer to either a real, storage-backed swap entry, or a zswap entry.
zswap can then exist without any backing swap device.

There are several additional benefits of this approach:

1. We can optimize swapoff as well - the page table can still refer to
the swap ID, but the ID now points to a physical page frame. swapoff
code just needs to sever the link from the swap ID to the physical
swap entry (which either just requires a swap ID mapping walk, or even
faster if we have a reverse mapping mechanism), and update the link to
the page frame instead.

2. We can take this opportunity to clean up the swap count code.

I'd be happy to collaborate/compare notes :)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-30 19:28           ` Nhat Pham
@ 2024-07-30 21:06             ` Barry Song
  2024-07-31 18:35               ` Nhat Pham
  0 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-07-30 21:06 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang,
	baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Wed, Jul 31, 2024 at 7:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> >
> > Well, that is the point.  zram is a horrible hack that abuses a block
> > device to implement a feature missing the VM layer.  Right now people
> > have a reason for it because zswap requires a "real" backing device
> > and that's fine for them and for now.  But instead of building VM
>
> I completely agree with this assessment.
>
> > infrastructure around these kinds of hacks we need to fix the VM
> > infrastructure.  Chris Li has been talking about and working towards
> > a proper swap abstraction and that needs to happen.
>
> I'm also working towards something along this line. My design would
> add a "virtual" swap ID that will be stored in the page table, and can
> refer to either a real, storage-backed swap entry, or a zswap entry.
> zswap can then exist without any backing swap device.
>
> There are several additional benefits of this approach:
>
> 1. We can optimize swapoff as well - the page table can still refer to
> the swap ID, but the ID now points to a physical page frame. swapoff
> code just needs to sever the link from the swap ID to the physical
> swap entry (which either just requires a swap ID mapping walk, or even
> faster if we have a reverse mapping mechanism), and update the link to
> the page frame instead.
>
> 2. We can take this opportunity to clean up the swap count code.
>
> I'd be happy to collaborate/compare notes :)

I appreciate that you have a good plan, and I welcome the improvements in zswap.
However, we need to face reality. Having a good plan doesn't mean we should
wait for you to proceed.

In my experience, I've never heard of anyone using zswap in an embedded
system, especially among the billions of Android devices.(Correct me if you
know one.) How soon do you expect embedded systems and Android to adopt
zswap? In one year, two years, five years, or ten years? Have you asked if
Google plans to use zswap in Android?

Currently, zswap does not support large folios, which is why Yosry has
introduced
an API like zswap_never_enabled() to allow others to explore parallel
options like
mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS
error.  I believe you were involved in those discussions:

mm: zswap: add zswap_never_enabled()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6
mm: zswap: handle incorrect attempts to load large folios
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1

Should everyone around the world hold off on working on mTHP swap until
zswap has addressed the issue to support large folios? Not to mention whether
people are ready and happy to switch to zswap.

I don't see any reason why we should wait and not start implementing something
that could benefit billions of devices worldwide. Parallel exploration leads to
human progress in different fields. That's why I believe Yosry's patch, which
allows others to move forward, is a more considerate approach.

Thanks
Barry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-30 21:06             ` Barry Song
@ 2024-07-31 18:35               ` Nhat Pham
  2024-08-01  3:00                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 59+ messages in thread
From: Nhat Pham @ 2024-07-31 18:35 UTC (permalink / raw)
  To: Barry Song
  Cc: Christoph Hellwig, Matthew Wilcox, akpm, linux-mm, ying.huang,
	baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Tue, Jul 30, 2024 at 2:06 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > I'd be happy to collaborate/compare notes :)
>
> I appreciate that you have a good plan, and I welcome the improvements in zswap.
> However, we need to face reality. Having a good plan doesn't mean we should
> wait for you to proceed.
>
> In my experience, I've never heard of anyone using zswap in an embedded
> system, especially among the billions of Android devices.(Correct me if you
> know one.) How soon do you expect embedded systems and Android to adopt
> zswap? In one year, two years, five years, or ten years? Have you asked if
> Google plans to use zswap in Android?

Well, no one uses zswap in an embedded environment precisely because
of the aforementioned issues, which we are working to resolve :)

>
> Currently, zswap does not support large folios, which is why Yosry has
> introduced
> an API like zswap_never_enabled() to allow others to explore parallel
> options like
> mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS
> error.  I believe you were involved in those discussions:
>
> mm: zswap: add zswap_never_enabled()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6
> mm: zswap: handle incorrect attempts to load large folios
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1
>

I am, and for the record I reviewed and/or ack-ed all of these
patches, and provided my inputs on how to move forward with zswap's
support for large folios. I do not want zswap to prevent the
development of the rest of the swap ecosystem.

> Should everyone around the world hold off on working on mTHP swap until
> zswap has addressed the issue to support large folios? Not to mention whether
> people are ready and happy to switch to zswap.
>

I think you misunderstood my intention. For the record, I'm not trying
to stop you from improving zram, and I'm not proposing that we kill
zram right away. Well, at least not until zswap reaches feature parity
with zram, which, as you point out, will take awhile. Both support for
large folios and swap/zswap decoupling are on our agenda, and you're
welcome to participate in the discussion - for what it's worth, your
attempt with zram (+zstd) is the basis/proof-of-concept for our future
efforts :)

That said, I believe that there is a fundamental redundancy here,
which we (zram and zswap developers) should resolve at some point by
unifying the two memory compression systems. The sooner we can unify
these two, the less effort we will have to spend on developing and
maintaining two separate mechanisms for the same (or very similar)
purpose. For instance, large folio support has to be done twice. Same
goes with writeback/offloading to backend storage, etc. And I
(admittedly with a bias), agree with Christoph that zswap is the way
to go moving forwards.

I will not address the rest - seems like there isn't something to
disagree or discuss down there :)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-31 18:35               ` Nhat Pham
@ 2024-08-01  3:00                 ` Sergey Senozhatsky
  0 siblings, 0 replies; 59+ messages in thread
From: Sergey Senozhatsky @ 2024-08-01  3:00 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Barry Song, Christoph Hellwig, Matthew Wilcox, akpm, linux-mm,
	ying.huang, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, ryan.roberts,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang,
	yosryahmed

On (24/07/31 11:35), Nhat Pham wrote:
>
> I'm not proposing that we kill zram right away.
>

Just for the record, zram is a generic block device and has
use-cases outside of swap. Just mkfs on /dev/zram0, mount it
and do whatever. The "kill zram" thing is not going to fly.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-30 16:30         ` Christoph Hellwig
  2024-07-30 19:28           ` Nhat Pham
@ 2024-08-01 20:55           ` Chris Li
  2024-08-12  8:27             ` Christoph Hellwig
  1 sibling, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-08-01 20:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Barry Song, Matthew Wilcox, akpm, linux-mm, ying.huang,
	baolin.wang, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote:
> > > We also really need to stop optimizing for this weird zram case and move
> > > people to zswap instead after fixing the various issues.  A special
> > > block device that isn't really a block device and needs various special
> > > hooks isn't the right abstraction for different zwap strategies.
> >
> > My understanding is zRAM is much more popularly used in embedded
> > systems than zswap. I seldomly(or never) hear who is using zswap
> > in Android. it seems pointless to force people to move to zswap, in
> > embedded systems we don't have a backend real block disk device
> > after zswap.
>
> Well, that is the point.  zram is a horrible hack that abuses a block
> device to implement a feature missing the VM layer.  Right now people
> have a reason for it because zswap requires a "real" backing device
> and that's fine for them and for now.  But instead of building VM
> infrastructure around these kinds of hacks we need to fix the VM
> infrastructure.  Chris Li has been talking about and working towards
> a proper swap abstraction and that needs to happen.

Yes, I have been working on the swap allocator for the mTHP usage
case. Haven't got to the zswap vs zram yet.
Currently there is a feature gap between zswap and zram, so zswap
doesn't do all the stuff zram does. For the zswap "real" backend
issue, Google has been using the ghost swapfile for many years. That
can be one way to get around that. The patch is much smaller than
overhauling the swap back end abstraction.

Currently Android uses zram and it needs to be the Android team's
decision to move from zram to something else. I don't see that
happening any time soon. There are practical limitations.
Personally I have been using zram as some way to provide a block like
device as my goto route for testing the swap stack. I still do an SSD
drive swap test, but at the same time I want to reduce the SSD swap
usage to avoid the wear on my SSD drive. I already destroyed two of my
old HDD drives during the swap testing. The swap random seek is very
unfriendly to HDD, not sure who is still using HDD for swap any more.

Anyway, removing zram is never a goal of the swap abstraction because
I am still using it. We can start with reducing the feature gap
between zswap and ZRAM. The end of the day, it is the Android team's
call using zram or not.

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-08-01 20:55           ` Chris Li
@ 2024-08-12  8:27             ` Christoph Hellwig
  2024-08-12  8:44               ` Barry Song
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2024-08-12  8:27 UTC (permalink / raw)
  To: Chris Li
  Cc: Christoph Hellwig, Barry Song, Matthew Wilcox, akpm, linux-mm,
	ying.huang, baolin.wang, david, hannes, hughd, kaleshsingh,
	kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang,
	yosryahmed

On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote:
> Currently Android uses zram and it needs to be the Android team's
> decision to move from zram to something else. I don't see that
> happening any time soon. There are practical limitations.

No one can tell anyone to stop using things.  But we can stop adding
new hacks for this, and especially user facing controls.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-08-12  8:27             ` Christoph Hellwig
@ 2024-08-12  8:44               ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-08-12  8:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Li, Matthew Wilcox, akpm, linux-mm, ying.huang, baolin.wang,
	david, hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, yosryahmed

On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote:
> > Currently Android uses zram and it needs to be the Android team's
> > decision to move from zram to something else. I don't see that
> > happening any time soon. There are practical limitations.
>
> No one can tell anyone to stop using things.  But we can stop adding
> new hacks for this, and especially user facing controls.

Well, this user-facing control has absolutely nothing to do with zram-related
hacks.  It's meant to address a general issue, mainly concerning slow-speed
swap devices like SSDs, as suggested in Ying's comment on v4.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29 16:11     ` Christoph Hellwig
  2024-07-29 20:11       ` Barry Song
@ 2024-07-30  2:27       ` Chuanhua Han
  1 sibling, 0 replies; 59+ messages in thread
From: Chuanhua Han @ 2024-07-30  2:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Barry Song, akpm, linux-mm, ying.huang,
	baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, xiang, yosryahmed

Christoph Hellwig <hch@infradead.org> 于2024年7月30日周二 00:11写道：
>
> On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote:
> > I strongly disagree.  Use the same sysctl as the other anonymous memory
> > allocations.
>
> I agree with Matthew here.
>
> We also really need to stop optimizing for this weird zram case and move
> people to zswap instead after fixing the various issues.  A special
> block device that isn't really a block device and needs various special
> hooks isn't the right abstraction for different zwap strategies.
I disagree, zram is most popular in embedded systems (like Android).
>
>


-- 
Thanks,
Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29  3:52   ` Matthew Wilcox
  2024-07-29  4:49     ` Barry Song
  2024-07-29 16:11     ` Christoph Hellwig
@ 2024-07-30  8:36     ` Ryan Roberts
  2024-07-30  8:47       ` David Hildenbrand
  2024-08-05  6:10     ` Huang, Ying
  3 siblings, 1 reply; 59+ messages in thread
From: Ryan Roberts @ 2024-07-30  8:36 UTC (permalink / raw)
  To: Matthew Wilcox, Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, xiang, yosryahmed

On 29/07/2024 04:52, Matthew Wilcox wrote:
> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>> A user space interface can be implemented to select different swap-in
>> order policies, similar to the mTHP allocation order policy. We need
>> a distinct policy because the performance characteristics of memory
>> allocation differ significantly from those of swap-in. For example,
>> SSD read speeds can be much slower than memory allocation. With
>> policy selection, I believe we can implement mTHP swap-in for
>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>> the implications of their choices. I think that it's better to start
>> with at least always never. I believe that we will add auto in the
>> future to tune automatically, which can be used as default finally.
> 
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

I vaguely recall arguing in the past that just because the user has requested 2M
THP that doesn't mean its the right thing to do for performance to swap-in the
whole 2M in one go. That's potentially a pretty huge latency, depending on where
the backend is, and it could be a waste of IO if the application never touches
most of the 2M. Although the fact that the application hinted for a 2M THP in
the first place hopefully means that they are storing objects that need to be
accessed at similar times. Today it will be swapped in page-by-page then
eventually collapsed by khugepaged.

But I think those arguments become weaker as the THP size gets smaller. 16K/64K
swap-in will likely yield significant performance improvements, and I think
Barry has numbers for this?

So I guess we have a few options:

 - Just use the same sysfs interface as for anon allocation, And see if anyone
reports performance regressions. Investigate one of the options below if an
issue is raised. That's the simplest and cleanest approach, I think.

 - New sysfs interface as Barry has implemented; nobody really wants more
controls if it can be helped.

 - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts
and never got any traction.

 - Secret option 4: Can we allocate a full-size folio but only choose to swap-in
to it bit-by-bit? You would need a way to mark which pages of the folio are
valid (e.g. per-page flag) but guess that's a non-starter given the strategy to
remove per-page flags?

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-30  8:36     ` Ryan Roberts
@ 2024-07-30  8:47       ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2024-07-30  8:47 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox, Barry Song
  Cc: akpm, linux-mm, ying.huang, baolin.wang, chrisl, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, xiang,
	yosryahmed

On 30.07.24 10:36, Ryan Roberts wrote:
> On 29/07/2024 04:52, Matthew Wilcox wrote:
>> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>>> A user space interface can be implemented to select different swap-in
>>> order policies, similar to the mTHP allocation order policy. We need
>>> a distinct policy because the performance characteristics of memory
>>> allocation differ significantly from those of swap-in. For example,
>>> SSD read speeds can be much slower than memory allocation. With
>>> policy selection, I believe we can implement mTHP swap-in for
>>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>>> the implications of their choices. I think that it's better to start
>>> with at least always never. I believe that we will add auto in the
>>> future to tune automatically, which can be used as default finally.
>>
>> I strongly disagree.  Use the same sysctl as the other anonymous memory
>> allocations.
> 
> I vaguely recall arguing in the past that just because the user has requested 2M
> THP that doesn't mean its the right thing to do for performance to swap-in the
> whole 2M in one go. That's potentially a pretty huge latency, depending on where
> the backend is, and it could be a waste of IO if the application never touches
> most of the 2M. Although the fact that the application hinted for a 2M THP in
> the first place hopefully means that they are storing objects that need to be
> accessed at similar times. Today it will be swapped in page-by-page then
> eventually collapsed by khugepaged.
> 
> But I think those arguments become weaker as the THP size gets smaller. 16K/64K
> swap-in will likely yield significant performance improvements, and I think
> Barry has numbers for this?
> 
> So I guess we have a few options:
> 
>   - Just use the same sysfs interface as for anon allocation, And see if anyone
> reports performance regressions. Investigate one of the options below if an
> issue is raised. That's the simplest and cleanest approach, I think.
> 
>   - New sysfs interface as Barry has implemented; nobody really wants more
> controls if it can be helped.
> 
>   - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts
> and never got any traction.
> 
>   - Secret option 4: Can we allocate a full-size folio but only choose to swap-in
> to it bit-by-bit? You would need a way to mark which pages of the folio are
> valid (e.g. per-page flag) but guess that's a non-starter given the strategy to
> remove per-page flags?

Maybe we could allocate for folios in the swapcache a bitmap to store 
that information (folio->private).

But I am not convinced that is the right thing to do.

If we know some basic properties of the backend, can't we automatically 
make a pretty good decision regarding the folio size to use? E.g., slow 
disk, avoid 2M ...

Avoiding sysctls if possible here would really be preferable...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
  2024-07-29  3:52   ` Matthew Wilcox
                       ` (2 preceding siblings ...)
  2024-07-30  8:36     ` Ryan Roberts
@ 2024-08-05  6:10     ` Huang, Ying
  3 siblings, 0 replies; 59+ messages in thread
From: Huang, Ying @ 2024-08-05  6:10 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: Barry Song, akpm, linux-mm, baolin.wang, chrisl, david, hannes,
	hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, xiang, yosryahmed

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>> A user space interface can be implemented to select different swap-in
>> order policies, similar to the mTHP allocation order policy. We need
>> a distinct policy because the performance characteristics of memory
>> allocation differ significantly from those of swap-in. For example,
>> SSD read speeds can be much slower than memory allocation. With
>> policy selection, I believe we can implement mTHP swap-in for
>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>> the implications of their choices. I think that it's better to start
>> with at least always never. I believe that we will add auto in the
>> future to tune automatically, which can be used as default finally.
>
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

I still believe we have some reasons for this tunable.

1. As Ryan pointed out in [1], swap-in with large mTHP orders may cause
   long latency, which some users might want to avoid.

[1] https://lore.kernel.org/lkml/f0c7f061-6284-4fe5-8cbf-93281070895b@arm.com/

2. We have readahead information available for swap-in, which is
   unavailable for anonymous page allocation.  This enables us to build
   an automatic swap-in order policy similar to that for page cache
   order based on readahead.

3. Swap-out/swap-in cycles present an opportunity to identify hot pages.
   In many use cases, we can utilize mTHP for hot pages and order-0 page
   for cold pages, especially under memory pressure.  When an mTHP has
   been swapped out, it indicates that it could be a cold page.
   Converting it to order-0 pages might be a beneficial policy.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 0/2] mm: Ignite large folios swap-in support
  2024-07-26  9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
                   ` (3 preceding siblings ...)
  2024-07-26  9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
@ 2024-08-02 12:20 ` Barry Song
  2024-08-02 12:20   ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
  2024-08-02 12:20   ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
  4 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-08-02 12:20 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, hch

From: Barry Song <v-songbaohua@oppo.com>

Currently, we support mTHP swapout but not swapin. This means that once mTHP
is swapped out, it will come back as small folios when swapped in. This is
particularly detrimental for devices like Android, where more than half of
the memory is in swap.

The lack of mTHP swapin functionality makes mTHP a showstopper in scenarios
that heavily rely on swap. This patchset introduces mTHP swap-in support.
It starts with synchronous devices similar to zRAM, aiming to benefit as
many users as possible with minimal changes.

-v6:
 * remove the swapin control added in v5, per Willy, Christoph;
   The original reason for adding the swpin_enabled control was primarily
   to address concerns for slower devices. Currently, since we only support
   fast sync devices, swap-in size is less of a concern.
   We’ll gain a clearer understanding of the next steps while more devices
   begin to support mTHP swap-in.
 * add nr argument in mem_cgroup_swapin_uncharge_swap() instead of adding
   new API, Willy;
 * swapcache_prepare() and swapcache_clear() large folios support is also
   removed as it has been separated per Baolin's request, right now has
   been in mm-unstable.
 * provide more data in changelog.

-v5:
 https://lore.kernel.org/linux-mm/20240726094618.401593-1-21cnbao@gmail.com/

 * Add swap-in control policy according to Ying's proposal. Right now only
   "always" and "never" are supported, later we can extend to "auto";
 * Fix the comment regarding zswap_never_enabled() according to Yosry;
 * Filter out unaligned swp entries earlier;
 * add mem_cgroup_swapin_uncharge_swap_nr() helper

-v4:
 https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/

 Many parts of v3 have been merged into the mm tree with the help on reviewing
 from Ryan, David, Ying and Chris etc. Thank you very much!
 This is the final part to allocate large folios and map them.

 * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
   in this v4 RFC though it should be fixed in Yosry's patch
 * lots of code improvement (drop large stack, hold ptl etc) according
   to Yosry's and Ryan's feedback
 * rebased on top of the latest mm-unstable and utilized some new helpers
   introduced recently.

-v3:
 https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
 * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
   thanks!
 * fix the issue folio is charged twice for do_swap_page, separating
   alloc_anon_folio and alloc_swap_folio as they have many differences
   now on
   * memcg charing
   * clearing allocated folio or not

-v2:
 https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
 * lots of code cleanup according to Chris's comments, thanks!
 * collect Chris's ack tags, thanks!
 * address David's comment on moving to use folio_add_new_anon_rmap
   for !folio_test_anon in do_swap_page, thanks!
 * remove the MADV_PAGEOUT patch from this series as Ryan will
   intergrate it into swap-out series
 * Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
   on large folios swap-in as well
 * fixed corrupted data(zero-filled data) in two races: zswap and
   a part of entries are in swapcache while some others are not
   in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
 https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t

Barry Song (1):
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to
    support large folios

Chuanhua Han (1):
  mm: support large folios swap-in for zRAM-like devices

 include/linux/memcontrol.h |   5 +-
 mm/memcontrol.c            |   7 +-
 mm/memory.c                | 211 +++++++++++++++++++++++++++++++++----
 mm/swap_state.c            |   2 +-
 4 files changed, 196 insertions(+), 29 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
@ 2024-08-02 12:20   ` Barry Song
  2024-08-02 17:29     ` Chris Li
  2024-08-02 12:20   ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
  1 sibling, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-08-02 12:20 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, hch

From: Barry Song <v-songbaohua@oppo.com>

With large folios swap-in, we might need to uncharge multiple entries
all together, add nr argument in mem_cgroup_swapin_uncharge_swap().

For the existing two users, just pass nr=1.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/memcontrol.h | 5 +++--
 mm/memcontrol.c            | 7 ++++---
 mm/memory.c                | 2 +-
 mm/swap_state.c            | 2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b79760af685..44f7fb7dc0c8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
 
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
+
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
 
 void __mem_cgroup_uncharge(struct folio *folio);
 
@@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 	return 0;
 }
 
-static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b889a7fbf382..5d763c234c44 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 
 /*
  * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
- * @entry: swap entry for which the page is charged
+ * @entry: the first swap entry for which the pages are charged
+ * @nr_pages: number of pages which will be uncharged
  *
  * Call this function after successfully adding the charged page to swapcache.
  *
  * Note: This function assumes the page for which swap slot is being uncharged
  * is order 0 page.
  */
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 {
 	/*
 	 * Cgroup1's unified memory+swap counter has been charged with the
@@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry, 1);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4c8716cb306c..4cf4902db1ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry);
+				mem_cgroup_swapin_uncharge_swap(entry, 1);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 293ff1afdca4..1159e3225754 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
 		goto fail_unlock;
 
-	mem_cgroup_swapin_uncharge_swap(entry);
+	mem_cgroup_swapin_uncharge_swap(entry, 1);
 
 	if (shadow)
 		workingset_refault(new_folio, shadow);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  2024-08-02 12:20   ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
@ 2024-08-02 17:29     ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-08-02 17:29 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, david, hannes, hughd, kaleshsingh,
	kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, hch

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Aug 2, 2024 at 5:21 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> With large folios swap-in, we might need to uncharge multiple entries
> all together, add nr argument in mem_cgroup_swapin_uncharge_swap().
>
> For the existing two users, just pass nr=1.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/memcontrol.h | 5 +++--
>  mm/memcontrol.c            | 7 ++++---
>  mm/memory.c                | 2 +-
>  mm/swap_state.c            | 2 +-
>  4 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1b79760af685..44f7fb7dc0c8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -682,7 +682,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>
>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>                                   gfp_t gfp, swp_entry_t entry);
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> +
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
>
>  void __mem_cgroup_uncharge(struct folio *folio);
>
> @@ -1181,7 +1182,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>         return 0;
>  }
>
> -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr)
>  {
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b889a7fbf382..5d763c234c44 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4572,14 +4572,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>
>  /*
>   * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
> - * @entry: swap entry for which the page is charged
> + * @entry: the first swap entry for which the pages are charged
> + * @nr_pages: number of pages which will be uncharged
>   *
>   * Call this function after successfully adding the charged page to swapcache.
>   *
>   * Note: This function assumes the page for which swap slot is being uncharged
>   * is order 0 page.
>   */
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
>  {
>         /*
>          * Cgroup1's unified memory+swap counter has been charged with the
> @@ -4599,7 +4600,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
>                  * let's not wait for it.  The page already received a
>                  * memory+swap charge, drop the swap entry duplicate.
>                  */
> -               mem_cgroup_uncharge_swap(entry, 1);
> +               mem_cgroup_uncharge_swap(entry, nr_pages);
>         }
>  }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c8716cb306c..4cf4902db1ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4102,7 +4102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                         ret = VM_FAULT_OOM;
>                                         goto out_page;
>                                 }
> -                               mem_cgroup_swapin_uncharge_swap(entry);
> +                               mem_cgroup_swapin_uncharge_swap(entry, 1);
>
>                                 shadow = get_shadow_from_swap_cache(entry);
>                                 if (shadow)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 293ff1afdca4..1159e3225754 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
>                 goto fail_unlock;
>
> -       mem_cgroup_swapin_uncharge_swap(entry);
> +       mem_cgroup_swapin_uncharge_swap(entry, 1);
>
>         if (shadow)
>                 workingset_refault(new_folio, shadow);
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
  2024-08-02 12:20   ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
@ 2024-08-02 12:20   ` Barry Song
  2024-08-03 19:08     ` Andrew Morton
                       ` (2 more replies)
  1 sibling, 3 replies; 59+ messages in thread
From: Barry Song @ 2024-08-02 12:20 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, hannes, hughd, kaleshsingh, kasong,
	linux-kernel, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed, hch, Chuanhua Han

From: Chuanhua Han <hanchuanhua@oppo.com>

Currently, we have mTHP features, but unfortunately, without support for large
folio swap-ins, once these large folios are swapped out, they are lost because
mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
mTHP from being used on devices like Android that heavily rely on swap.

This patch introduces mTHP swap-in support. It starts from sync devices such
as zRAM. This is probably the simplest and most common use case, benefiting
billions of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in. Large folios in the buddy system are also
   preserved as much as possible, rather than being fragmented due
   to swap-in.

2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT.

   w/o this patch (Refer to the data from Chris's and Kairui's latest
   swap allocator optimization while running ./thp_swap_allocator_test
   w/o "-a" option [1]):

   ./thp_swap_allocator_test
   Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
   Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
   Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
   Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
   Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
   Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
   Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
   Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
   Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
   Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
   Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
   Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
   Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
   Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
   Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
   Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

   w/ this patch (always 0%):
   Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   ...

3. With both mTHP swap-out and swap-in supported, we offer the option to enable
   zsmalloc compression/decompression with larger granularity[2]. The upcoming
   optimization in zsmalloc will significantly increase swap speed and improve
   compression efficiency. Tested by running 100 iterations of swapping 100MiB
   of anon memory, the swap speed improved dramatically:
                time consumption of swapin(ms)   time consumption of swapout(ms)
     lz4 4k                  45274                    90540
     lz4 64k                 22942                    55667
     zstdn 4k                85035                    186585
     zstdn 64k               46558                    118533

    The compression ratio also improved, as evaluated with 1 GiB of data:
     granularity   orig_data_size   compr_data_size
     4KiB-zstd      1048576000       246876055
     64KiB-zstd     1048576000       199763892

   Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
   realized.

4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
   of nr_pages. Swapping in content filled with the same data 0x11, w/o
   and w/ the patch for five rounds (Since the content is the same,
   decompression will be very fast. This primarily assesses the impact of
   reduced page faults):

  swp in bandwidth(bytes/ms)    w/o              w/
   round1                     624152          1127501
   round2                     631672          1127501
   round3                     620459          1139756
   round4                     606113          1139756
   round5                     624152          1152281
   avg                        621310          1137359      +83%

[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 188 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4cf4902db1ec..07029532469a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * ptep must be first one in the range
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	struct swap_info_struct *si;
+	unsigned long addr;
+	swp_entry_t entry;
+	pgoff_t offset;
+	char has_cache;
+	int idx, i;
+	pte_t pte;
+
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	idx = (vmf->address - addr) / PAGE_SIZE;
+	pte = ptep_get(ptep);
+
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+		return false;
+	entry = pte_to_swp_entry(pte);
+	offset = swp_offset(entry);
+	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+		return false;
+
+	si = swp_swap_info(entry);
+	has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
+	for (i = 1; i < nr_pages; i++) {
+		/*
+		 * while allocating a large folio and doing swap_read_folio for the
+		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+		 * doesn't have swapcache. We need to ensure all PTEs have no cache
+		 * as well, otherwise, we might go to swap devices while the content
+		 * is in swapcache
+		 */
+		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
+			return false;
+	}
+
+	return true;
+}
+
+static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
+		unsigned long addr, unsigned long orders)
+{
+	int order, nr;
+
+	order = highest_order(orders);
+
+	/*
+	 * To swap-in a THP with nr pages, we require its first swap_offset
+	 * is aligned with nr. This can filter out most invalid entries.
+	 */
+	while (orders) {
+		nr = 1 << order;
+		if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	return false;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	swp_entry_t entry;
+	spinlock_t *ptl;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * A large swapped out folio could be partially or fully in zswap. We
+	 * lack handling for such cases, so fallback to swapping in order-0
+	 * folio.
+	 */
+	if (!zswap_never_enabled())
+		goto fallback;
+
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * and suitable for swapping THP.
+	 */
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap_unlock(pte, ptl);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio)
+			return folio;
+		order = next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry, 1)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache = true;
-
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_swap_folio(vmf);
 			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
+				nr_pages = folio_nr_pages(folio);
+				if (folio_test_large(folio))
+					entry.val = ALIGN_DOWN(entry.val, nr_pages);
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread may
+				 * finish swapin first, free the entry, and swapout
+				 * reusing the same entry. It's undetectable as
+				 * pte_same() returns true due to entry reuse.
+				 */
+				if (swapcache_prepare(entry, nr_pages)) {
+					/* Relax a bit to prevent rapid repeated page faults */
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+				need_clear_cache = true;
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry, 1);
+				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/* allocated large folios for SWP_SYNCHRONOUS_IO */
+	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
+		pte_t *folio_ptep = vmf->pte - idx;
+
+		if (!can_swapin_thp(vmf, folio_ptep, nr))
+			goto out_nomap;
+
+		page_idx = idx;
+		address = folio_start;
+		ptep = folio_ptep;
+		goto check_folio;
+	}
+
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
@@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios, which are either
-		 * fully exclusive or fully shared. If we ever get large folios
-		 * here, we have to be careful.
+		 * We currently only expect small !anon folios which are either
+		 * fully exclusive or fully shared, or new allocated large folios
+		 * which are fully exclusive. If we ever get large folios within
+		 * swapcache here, we have to be careful.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio));
+		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
 		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
@@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry, 1);
+		swapcache_clear(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-02 12:20   ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
@ 2024-08-03 19:08     ` Andrew Morton
  2024-08-12  8:26     ` Christoph Hellwig
  2024-08-15  9:47     ` Kairui Song
  2 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2024-08-03 19:08 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, baolin.wang, chrisl, david, hannes, hughd, kaleshsingh,
	kasong, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, hch, Chuanhua Han

On Sat,  3 Aug 2024 00:20:31 +1200 Barry Song <21cnbao@gmail.com> wrote:

> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Currently, we have mTHP features, but unfortunately, without support for large
> folio swap-ins, once these large folios are swapped out, they are lost because
> mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
> mTHP from being used on devices like Android that heavily rely on swap.
> 
> This patch introduces mTHP swap-in support. It starts from sync devices such
> as zRAM. This is probably the simplest and most common use case, benefiting
> billions of Android phones and similar devices with minimal implementation
> cost. In this straightforward scenario, large folios are always exclusive,
> eliminating the need to handle complex rmap and swapcache issues.
> 
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
> 
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
> 
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
> 
> ...
>
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> ...
>
> +#endif
> +	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> +}

Generates an unused-variable warning with allnoconfig.  Because
vma_alloc_folio_noprof() was implemented as a macro instead of an
inlined C function.  Why do we keep doing this.

Please check:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-support-large-folios-swap-in-for-zram-like-devices-fix
Date: Sat Aug  3 11:59:00 AM PDT 2024

fix unused var warning

mm/memory.c: In function 'alloc_swap_folio':
mm/memory.c:4062:32: warning: unused variable 'vma' [-Wunused-variable]
 4062 |         struct vm_area_struct *vma = vmf->vma;
      |                                ^~~

Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/memory.c~mm-support-large-folios-swap-in-for-zram-like-devices-fix
+++ a/mm/memory.c
@@ -4059,8 +4059,8 @@ static inline bool can_swapin_thp(struct
 
 static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct vm_area_struct *vma = vmf->vma;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
@@ -4128,7 +4128,8 @@ static struct folio *alloc_swap_folio(st
 
 fallback:
 #endif
-	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma,
+			       vmf->address, false);
 }
 
 
_



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-02 12:20   ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
  2024-08-03 19:08     ` Andrew Morton
@ 2024-08-12  8:26     ` Christoph Hellwig
  2024-08-12  8:53       ` Barry Song
  2024-08-15  9:47     ` Kairui Song
  2 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2024-08-12  8:26 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, hch,
	Chuanhua Han

The subject feels wrong.  Nothing particular about zram, it is all about
SWP_SYNCHRONOUS_IO, so the Subject and commit log should state that.

On Sat, Aug 03, 2024 at 12:20:31AM +1200, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Currently, we have mTHP features, but unfortunately, without support for large
> folio swap-ins, once these large folios are swapped out, they are lost because
> mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents

Please wrap your commit logs after 73 characters to make them readable.

> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * ptep must be first one in the range
> + */

Please capitalize the first character of block comments, make them full
sentences and use up all 80 characters.

> +	for (i = 1; i < nr_pages; i++) {
> +		/*
> +		 * while allocating a large folio and doing swap_read_folio for the

And also do not go over 80 characters for them, which renders them
really hard to read.

> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE

Please stub out the entire function.




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-12  8:26     ` Christoph Hellwig
@ 2024-08-12  8:53       ` Barry Song
  2024-08-12 11:38         ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-08-12  8:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
	ryan.roberts, senozhatsky, shakeel.butt, shy828301, surenb,
	v-songbaohua, willy, xiang, ying.huang, yosryahmed, Chuanhua Han

On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> The subject feels wrong.  Nothing particular about zram, it is all about
> SWP_SYNCHRONOUS_IO, so the Subject and commit log should state that.

right.

This is absolutely for sync io, zram is the most typical one which is
widely used in Android and embedded systems.  Others could be
nvdimm, brd.

>
> On Sat, Aug 03, 2024 at 12:20:31AM +1200, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > Currently, we have mTHP features, but unfortunately, without support for large
> > folio swap-ins, once these large folios are swapped out, they are lost because
> > mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
>
> Please wrap your commit logs after 73 characters to make them readable.

ack.

>
> > +/*
> > + * check a range of PTEs are completely swap entries with
> > + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> > + * ptep must be first one in the range
> > + */
>
> Please capitalize the first character of block comments, make them full
> sentences and use up all 80 characters.

ack.

>
> > +     for (i = 1; i < nr_pages; i++) {
> > +             /*
> > +              * while allocating a large folio and doing swap_read_folio for the
>
> And also do not go over 80 characters for them, which renders them
> really hard to read.
>
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +{
> > +     struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> Please stub out the entire function.

I assume you mean the below?

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
}
#else
static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
}
#endif

If so, this is fine to me. the only reason I am using the current
pattern is that i am trying to follow the same pattern with

static struct folio *alloc_anon_folio(struct vm_fault *vmf)
{
        struct vm_area_struct *vma = vmf->vma;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#endif
        ...
}

Likely we also want to change that one?

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-12  8:53       ` Barry Song
@ 2024-08-12 11:38         ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2024-08-12 11:38 UTC (permalink / raw)
  To: Barry Song
  Cc: Christoph Hellwig, akpm, linux-mm, baolin.wang, chrisl, david,
	hannes, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
	nphamcs, ryan.roberts, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
	Chuanhua Han

On Mon, Aug 12, 2024 at 08:53:06PM +1200, Barry Song wrote:
> On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote:
> I assume you mean the below?
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> {
> }
> #else
> static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> {
> }
> #endif

Yes.

> If so, this is fine to me. the only reason I am using the current
> pattern is that i am trying to follow the same pattern with
> 
> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> {
>         struct vm_area_struct *vma = vmf->vma;
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #endif
>         ...
> }
> 
> Likely we also want to change that one?

It would be nice to fix that a well, probably noy in this series,
though.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-02 12:20   ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
  2024-08-03 19:08     ` Andrew Morton
  2024-08-12  8:26     ` Christoph Hellwig
@ 2024-08-15  9:47     ` Kairui Song
  2024-08-15 13:27       ` Kefeng Wang
  2 siblings, 1 reply; 59+ messages in thread
From: Kairui Song @ 2024-08-15  9:47 UTC (permalink / raw)
  To: Chuanhua Han, Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, hch

On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>

Hi Chuanhua,

>
> Currently, we have mTHP features, but unfortunately, without support for large
> folio swap-ins, once these large folios are swapped out, they are lost because
> mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents
> mTHP from being used on devices like Android that heavily rely on swap.
>
> This patch introduces mTHP swap-in support. It starts from sync devices such
> as zRAM. This is probably the simplest and most common use case, benefiting
> billions of Android phones and similar devices with minimal implementation
> cost. In this straightforward scenario, large folios are always exclusive,
> eliminating the need to handle complex rmap and swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
>
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
>
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
>
>    ./thp_swap_allocator_test
>    Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
>    Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
>    Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
>    Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
>    Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
>    Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
>    Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
>    Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
>    Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
>    Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
>    Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
>    Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
>    Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
>    Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
>    Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
>    Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
>
>    w/ this patch (always 0%):
>    Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
>    Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
>    ...
>
> 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
>    zsmalloc compression/decompression with larger granularity[2]. The upcoming
>    optimization in zsmalloc will significantly increase swap speed and improve
>    compression efficiency. Tested by running 100 iterations of swapping 100MiB
>    of anon memory, the swap speed improved dramatically:
>                 time consumption of swapin(ms)   time consumption of swapout(ms)
>      lz4 4k                  45274                    90540
>      lz4 64k                 22942                    55667
>      zstdn 4k                85035                    186585
>      zstdn 64k               46558                    118533
>
>     The compression ratio also improved, as evaluated with 1 GiB of data:
>      granularity   orig_data_size   compr_data_size
>      4KiB-zstd      1048576000       246876055
>      64KiB-zstd     1048576000       199763892
>
>    Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
>    realized.
>
> 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
>    of nr_pages. Swapping in content filled with the same data 0x11, w/o
>    and w/ the patch for five rounds (Since the content is the same,
>    decompression will be very fast. This primarily assesses the impact of
>    reduced page faults):
>
>   swp in bandwidth(bytes/ms)    w/o              w/
>    round1                     624152          1127501
>    round2                     631672          1127501
>    round3                     620459          1139756
>    round4                     606113          1139756
>    round5                     624152          1152281
>    avg                        621310          1137359      +83%
>
> [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 188 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4cf4902db1ec..07029532469a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * ptep must be first one in the range
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +{
> +       struct swap_info_struct *si;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       pgoff_t offset;
> +       char has_cache;
> +       int idx, i;
> +       pte_t pte;
> +
> +       addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +       idx = (vmf->address - addr) / PAGE_SIZE;
> +       pte = ptep_get(ptep);
> +
> +       if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +               return false;
> +       entry = pte_to_swp_entry(pte);
> +       offset = swp_offset(entry);
> +       if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> +               return false;
> +
> +       si = swp_swap_info(entry);
> +       has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
> +       for (i = 1; i < nr_pages; i++) {
> +               /*
> +                * while allocating a large folio and doing swap_read_folio for the
> +                * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> +                * doesn't have swapcache. We need to ensure all PTEs have no cache
> +                * as well, otherwise, we might go to swap devices while the content
> +                * is in swapcache
> +                */
> +               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> +               unsigned long addr, unsigned long orders)
> +{
> +       int order, nr;
> +
> +       order = highest_order(orders);
> +
> +       /*
> +        * To swap-in a THP with nr pages, we require its first swap_offset
> +        * is aligned with nr. This can filter out most invalid entries.
> +        */
> +       while (orders) {
> +               nr = 1 << order;
> +               if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       return orders;
> +}
> +#else
> +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +{
> +       return false;
> +}
> +#endif
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       unsigned long orders;
> +       struct folio *folio;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       spinlock_t *ptl;
> +       pte_t *pte;
> +       gfp_t gfp;
> +       int order;
> +
> +       /*
> +        * If uffd is active for the vma we need per-page fault fidelity to
> +        * maintain the uffd semantics.
> +        */
> +       if (unlikely(userfaultfd_armed(vma)))
> +               goto fallback;
> +
> +       /*
> +        * A large swapped out folio could be partially or fully in zswap. We
> +        * lack handling for such cases, so fallback to swapping in order-0
> +        * folio.
> +        */
> +       if (!zswap_never_enabled())
> +               goto fallback;
> +
> +       entry = pte_to_swp_entry(vmf->orig_pte);
> +       /*
> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +        * and suitable for swapping THP.
> +        */
> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> +
> +       if (!orders)
> +               goto fallback;
> +
> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> +       if (unlikely(!pte))
> +               goto fallback;
> +
> +       /*
> +        * For do_swap_page, find the highest order where the aligned range is
> +        * completely swap entries with contiguous swap offsets.
> +        */
> +       order = highest_order(orders);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> +                       break;
> +               order = next_order(&orders, order);
> +       }
> +
> +       pte_unmap_unlock(pte, ptl);
> +
> +       /* Try allocating the highest of the remaining orders. */
> +       gfp = vma_thp_gfp_mask(vma);
> +       while (orders) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +               if (folio)
> +                       return folio;
> +               order = next_order(&orders, order);
> +       }
> +
> +fallback:
> +#endif
> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> +}
> +
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> -                       /*
> -                        * Prevent parallel swapin from proceeding with
> -                        * the cache flag. Otherwise, another thread may
> -                        * finish swapin first, free the entry, and swapout
> -                        * reusing the same entry. It's undetectable as
> -                        * pte_same() returns true due to entry reuse.
> -                        */
> -                       if (swapcache_prepare(entry, 1)) {
> -                               /* Relax a bit to prevent rapid repeated page faults */
> -                               schedule_timeout_uninterruptible(1);
> -                               goto out;
> -                       }
> -                       need_clear_cache = true;
> -
>                         /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> +                       folio = alloc_swap_folio(vmf);
>                         page = &folio->page;
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> +                               nr_pages = folio_nr_pages(folio);
> +                               if (folio_test_large(folio))
> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> +                               /*
> +                                * Prevent parallel swapin from proceeding with
> +                                * the cache flag. Otherwise, another thread may
> +                                * finish swapin first, free the entry, and swapout
> +                                * reusing the same entry. It's undetectable as
> +                                * pte_same() returns true due to entry reuse.
> +                                */
> +                               if (swapcache_prepare(entry, nr_pages)) {
> +                                       /* Relax a bit to prevent rapid repeated page faults */
> +                                       schedule_timeout_uninterruptible(1);
> +                                       goto out_page;
> +                               }
> +                               need_clear_cache = true;
> +
>                                 if (mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
>                                         ret = VM_FAULT_OOM;
>                                         goto out_page;
>                                 }

After your patch, with build kernel test, I'm seeing kernel log
spamming like this:
[  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
[  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
............

And heavy performance loss with workloads limited by memcg, mTHP enabled.

After some debugging, the problematic part is the
mem_cgroup_swapin_charge_folio call above.
When under pressure, cgroup charge fails easily for mTHP. One 64k
swapin will require a much more aggressive reclaim to success.

If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
gone and mTHP swapin should have a much higher swapin success rate.
But this might not be the right way.

For this particular issue, maybe you can change the charge order, try
charging first, if successful, use mTHP. if failed, fallback to 4k?

> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>
>                                 shadow = get_shadow_from_swap_cache(entry);
>                                 if (shadow)
> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out_nomap;
>         }
>
> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> +               unsigned long nr = folio_nr_pages(folio);
> +               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> +               pte_t *folio_ptep = vmf->pte - idx;
> +
> +               if (!can_swapin_thp(vmf, folio_ptep, nr))
> +                       goto out_nomap;
> +
> +               page_idx = idx;
> +               address = folio_start;
> +               ptep = folio_ptep;
> +               goto check_folio;
> +       }
> +
>         nr_pages = 1;
>         page_idx = 0;
>         address = vmf->address;
> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_lru_vma(folio, vma);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> -                * We currently only expect small !anon folios, which are either
> -                * fully exclusive or fully shared. If we ever get large folios
> -                * here, we have to be careful.
> +                * We currently only expect small !anon folios which are either
> +                * fully exclusive or fully shared, or new allocated large folios
> +                * which are fully exclusive. If we ever get large folios within
> +                * swapcache here, we have to be careful.
>                  */
> -               VM_WARN_ON_ONCE(folio_test_large(folio));
> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>                 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>         } else {
> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_put(swapcache);
>         }
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-15  9:47     ` Kairui Song
@ 2024-08-15 13:27       ` Kefeng Wang
  2024-08-15 23:06         ` Barry Song
  0 siblings, 1 reply; 59+ messages in thread
From: Kefeng Wang @ 2024-08-15 13:27 UTC (permalink / raw)
  To: Kairui Song, Chuanhua Han, Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, hannes, hughd,
	kaleshsingh, linux-kernel, mhocko, minchan, nphamcs, ryan.roberts,
	senozhatsky, shakeel.butt, shy828301, surenb, v-songbaohua, willy,
	xiang, ying.huang, yosryahmed, hch



On 2024/8/15 17:47, Kairui Song wrote:
> On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> Hi Chuanhua,
> 
>>
...

>> +
>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>> +{
>> +       struct vm_area_struct *vma = vmf->vma;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       unsigned long orders;
>> +       struct folio *folio;
>> +       unsigned long addr;
>> +       swp_entry_t entry;
>> +       spinlock_t *ptl;
>> +       pte_t *pte;
>> +       gfp_t gfp;
>> +       int order;
>> +
>> +       /*
>> +        * If uffd is active for the vma we need per-page fault fidelity to
>> +        * maintain the uffd semantics.
>> +        */
>> +       if (unlikely(userfaultfd_armed(vma)))
>> +               goto fallback;
>> +
>> +       /*
>> +        * A large swapped out folio could be partially or fully in zswap. We
>> +        * lack handling for such cases, so fallback to swapping in order-0
>> +        * folio.
>> +        */
>> +       if (!zswap_never_enabled())
>> +               goto fallback;
>> +
>> +       entry = pte_to_swp_entry(vmf->orig_pte);
>> +       /*
>> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +        * and suitable for swapping THP.
>> +        */
>> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
>> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
>> +
>> +       if (!orders)
>> +               goto fallback;
>> +
>> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
>> +       if (unlikely(!pte))
>> +               goto fallback;
>> +
>> +       /*
>> +        * For do_swap_page, find the highest order where the aligned range is
>> +        * completely swap entries with contiguous swap offsets.
>> +        */
>> +       order = highest_order(orders);
>> +       while (orders) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
>> +                       break;
>> +               order = next_order(&orders, order);
>> +       }
>> +
>> +       pte_unmap_unlock(pte, ptl);
>> +
>> +       /* Try allocating the highest of the remaining orders. */
>> +       gfp = vma_thp_gfp_mask(vma);
>> +       while (orders) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +               if (folio)
>> +                       return folio;
>> +               order = next_order(&orders, order);
>> +       }
>> +
>> +fallback:
>> +#endif
>> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>> +}
>> +
>> +
>>   /*
>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>    * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>          if (!folio) {
>>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>                      __swap_count(entry) == 1) {
>> -                       /*
>> -                        * Prevent parallel swapin from proceeding with
>> -                        * the cache flag. Otherwise, another thread may
>> -                        * finish swapin first, free the entry, and swapout
>> -                        * reusing the same entry. It's undetectable as
>> -                        * pte_same() returns true due to entry reuse.
>> -                        */
>> -                       if (swapcache_prepare(entry, 1)) {
>> -                               /* Relax a bit to prevent rapid repeated page faults */
>> -                               schedule_timeout_uninterruptible(1);
>> -                               goto out;
>> -                       }
>> -                       need_clear_cache = true;
>> -
>>                          /* skip swapcache */
>> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> -                                               vma, vmf->address, false);
>> +                       folio = alloc_swap_folio(vmf);
>>                          page = &folio->page;
>>                          if (folio) {
>>                                  __folio_set_locked(folio);
>>                                  __folio_set_swapbacked(folio);
>>
>> +                               nr_pages = folio_nr_pages(folio);
>> +                               if (folio_test_large(folio))
>> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
>> +                               /*
>> +                                * Prevent parallel swapin from proceeding with
>> +                                * the cache flag. Otherwise, another thread may
>> +                                * finish swapin first, free the entry, and swapout
>> +                                * reusing the same entry. It's undetectable as
>> +                                * pte_same() returns true due to entry reuse.
>> +                                */
>> +                               if (swapcache_prepare(entry, nr_pages)) {
>> +                                       /* Relax a bit to prevent rapid repeated page faults */
>> +                                       schedule_timeout_uninterruptible(1);
>> +                                       goto out_page;
>> +                               }
>> +                               need_clear_cache = true;
>> +
>>                                  if (mem_cgroup_swapin_charge_folio(folio,
>>                                                          vma->vm_mm, GFP_KERNEL,
>>                                                          entry)) {
>>                                          ret = VM_FAULT_OOM;
>>                                          goto out_page;
>>                                  }
> 
> After your patch, with build kernel test, I'm seeing kernel log
> spamming like this:
> [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> ............
> 
> And heavy performance loss with workloads limited by memcg, mTHP enabled.
> 
> After some debugging, the problematic part is the
> mem_cgroup_swapin_charge_folio call above.
> When under pressure, cgroup charge fails easily for mTHP. One 64k
> swapin will require a much more aggressive reclaim to success.
> 
> If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> gone and mTHP swapin should have a much higher swapin success rate.
> But this might not be the right way.
> 
> For this particular issue, maybe you can change the charge order, try
> charging first, if successful, use mTHP. if failed, fallback to 4k?

This is what we did in alloc_anon_folio(), see 085ff35e7636
("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
1) fallback earlier
2) using same GFP flags for allocation and charge

but it seems that there is a little complicated for swapin charge


> 
>> -                               mem_cgroup_swapin_uncharge_swap(entry, 1);
>> +                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>>
>>                                  shadow = get_shadow_from_swap_cache(entry);
>>                                  if (shadow)
>> @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  goto out_nomap;
>>          }
>>
>> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
>> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
>> +               unsigned long nr = folio_nr_pages(folio);
>> +               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>> +               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
>> +               pte_t *folio_ptep = vmf->pte - idx;
>> +
>> +               if (!can_swapin_thp(vmf, folio_ptep, nr))
>> +                       goto out_nomap;
>> +
>> +               page_idx = idx;
>> +               address = folio_start;
>> +               ptep = folio_ptep;
>> +               goto check_folio;
>> +       }
>> +
>>          nr_pages = 1;
>>          page_idx = 0;
>>          address = vmf->address;
>> @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  folio_add_lru_vma(folio, vma);
>>          } else if (!folio_test_anon(folio)) {
>>                  /*
>> -                * We currently only expect small !anon folios, which are either
>> -                * fully exclusive or fully shared. If we ever get large folios
>> -                * here, we have to be careful.
>> +                * We currently only expect small !anon folios which are either
>> +                * fully exclusive or fully shared, or new allocated large folios
>> +                * which are fully exclusive. If we ever get large folios within
>> +                * swapcache here, we have to be careful.
>>                   */
>> -               VM_WARN_ON_ONCE(folio_test_large(folio));
>> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
>>                  VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>>                  folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>>          } else {
>> @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   out:
>>          /* Clear the swap cache pin for direct swapin after PTL unlock */
>>          if (need_clear_cache)
>> -               swapcache_clear(si, entry, 1);
>> +               swapcache_clear(si, entry, nr_pages);
>>          if (si)
>>                  put_swap_device(si);
>>          return ret;
>> @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                  folio_put(swapcache);
>>          }
>>          if (need_clear_cache)
>> -               swapcache_clear(si, entry, 1);
>> +               swapcache_clear(si, entry, nr_pages);
>>          if (si)
>>                  put_swap_device(si);
>>          return ret;
>> --
>> 2.34.1
>>
>>
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-15 13:27       ` Kefeng Wang
@ 2024-08-15 23:06         ` Barry Song
  2024-08-16 16:50           ` Kairui Song
  2024-08-16 21:16           ` Matthew Wilcox
  0 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-08-15 23:06 UTC (permalink / raw)
  To: wangkefeng.wang
  Cc: akpm, baolin.wang, chrisl, david, hanchuanhua, hannes, hch, hughd,
	kaleshsingh, linux-kernel, linux-mm, mhocko, minchan, nphamcs,
	ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301,
	surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed

On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/8/15 17:47, Kairui Song wrote:
> > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > Hi Chuanhua,
> >
> >>
> ...
>
> >> +
> >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >> +{
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +       unsigned long orders;
> >> +       struct folio *folio;
> >> +       unsigned long addr;
> >> +       swp_entry_t entry;
> >> +       spinlock_t *ptl;
> >> +       pte_t *pte;
> >> +       gfp_t gfp;
> >> +       int order;
> >> +
> >> +       /*
> >> +        * If uffd is active for the vma we need per-page fault fidelity to
> >> +        * maintain the uffd semantics.
> >> +        */
> >> +       if (unlikely(userfaultfd_armed(vma)))
> >> +               goto fallback;
> >> +
> >> +       /*
> >> +        * A large swapped out folio could be partially or fully in zswap. We
> >> +        * lack handling for such cases, so fallback to swapping in order-0
> >> +        * folio.
> >> +        */
> >> +       if (!zswap_never_enabled())
> >> +               goto fallback;
> >> +
> >> +       entry = pte_to_swp_entry(vmf->orig_pte);
> >> +       /*
> >> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >> +        * and suitable for swapping THP.
> >> +        */
> >> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> >> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> >> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> >> +
> >> +       if (!orders)
> >> +               goto fallback;
> >> +
> >> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> >> +       if (unlikely(!pte))
> >> +               goto fallback;
> >> +
> >> +       /*
> >> +        * For do_swap_page, find the highest order where the aligned range is
> >> +        * completely swap entries with contiguous swap offsets.
> >> +        */
> >> +       order = highest_order(orders);
> >> +       while (orders) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> >> +                       break;
> >> +               order = next_order(&orders, order);
> >> +       }
> >> +
> >> +       pte_unmap_unlock(pte, ptl);
> >> +
> >> +       /* Try allocating the highest of the remaining orders. */
> >> +       gfp = vma_thp_gfp_mask(vma);
> >> +       while (orders) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >> +               if (folio)
> >> +                       return folio;
> >> +               order = next_order(&orders, order);
> >> +       }
> >> +
> >> +fallback:
> >> +#endif
> >> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> >> +}
> >> +
> >> +
> >>   /*
> >>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>    * but allow concurrent faults), and pte mapped but not yet locked.
> >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>          if (!folio) {
> >>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>                      __swap_count(entry) == 1) {
> >> -                       /*
> >> -                        * Prevent parallel swapin from proceeding with
> >> -                        * the cache flag. Otherwise, another thread may
> >> -                        * finish swapin first, free the entry, and swapout
> >> -                        * reusing the same entry. It's undetectable as
> >> -                        * pte_same() returns true due to entry reuse.
> >> -                        */
> >> -                       if (swapcache_prepare(entry, 1)) {
> >> -                               /* Relax a bit to prevent rapid repeated page faults */
> >> -                               schedule_timeout_uninterruptible(1);
> >> -                               goto out;
> >> -                       }
> >> -                       need_clear_cache = true;
> >> -
> >>                          /* skip swapcache */
> >> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >> -                                               vma, vmf->address, false);
> >> +                       folio = alloc_swap_folio(vmf);
> >>                          page = &folio->page;
> >>                          if (folio) {
> >>                                  __folio_set_locked(folio);
> >>                                  __folio_set_swapbacked(folio);
> >>
> >> +                               nr_pages = folio_nr_pages(folio);
> >> +                               if (folio_test_large(folio))
> >> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >> +                               /*
> >> +                                * Prevent parallel swapin from proceeding with
> >> +                                * the cache flag. Otherwise, another thread may
> >> +                                * finish swapin first, free the entry, and swapout
> >> +                                * reusing the same entry. It's undetectable as
> >> +                                * pte_same() returns true due to entry reuse.
> >> +                                */
> >> +                               if (swapcache_prepare(entry, nr_pages)) {
> >> +                                       /* Relax a bit to prevent rapid repeated page faults */
> >> +                                       schedule_timeout_uninterruptible(1);
> >> +                                       goto out_page;
> >> +                               }
> >> +                               need_clear_cache = true;
> >> +
> >>                                  if (mem_cgroup_swapin_charge_folio(folio,
> >>                                                          vma->vm_mm, GFP_KERNEL,
> >>                                                          entry)) {
> >>                                          ret = VM_FAULT_OOM;
> >>                                          goto out_page;
> >>                                  }
> >
> > After your patch, with build kernel test, I'm seeing kernel log
> > spamming like this:
> > [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > ............
> >
> > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> >
> > After some debugging, the problematic part is the
> > mem_cgroup_swapin_charge_folio call above.
> > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > swapin will require a much more aggressive reclaim to success.
> >
> > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > gone and mTHP swapin should have a much higher swapin success rate.
> > But this might not be the right way.
> >
> > For this particular issue, maybe you can change the charge order, try
> > charging first, if successful, use mTHP. if failed, fallback to 4k?
>
> This is what we did in alloc_anon_folio(), see 085ff35e7636
> ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> 1) fallback earlier
> 2) using same GFP flags for allocation and charge
>
> but it seems that there is a little complicated for swapin charge

Kefeng, thanks! I guess we can continue using the same approach and
it's not too complicated. 

Kairui, sorry for the trouble and thanks for the report! could you
check if the solution below resolves the issue? On phones, we don't
encounter the scenarios you’re facing.

From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Fri, 16 Aug 2024 10:51:48 +1200
Subject: [PATCH] mm: fallback to next_order if charing mTHP fails

When memcg approaches its limit, charging mTHP becomes difficult.
At this point, when the charge fails, we fallback to the next order
to avoid repeatedly retrying larger orders.

Reported-by: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0ed3603aaf31..6cba28ef91e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	while (orders) {
 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 		folio = vma_alloc_folio(gfp, order, vma, addr, true);
-		if (folio)
-			return folio;
+		if (folio) {
+			if (!mem_cgroup_swapin_charge_folio(folio,
+					vma->vm_mm, gfp, entry))
+				return folio;
+			folio_put(folio);
+		}
 		order = next_order(&orders, order);
 	}
 
@@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				}
 				need_clear_cache = true;
 
-				if (mem_cgroup_swapin_charge_folio(folio,
+				if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
-- 
2.34.1


Thanks
Barry



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-15 23:06         ` Barry Song
@ 2024-08-16 16:50           ` Kairui Song
  2024-08-16 20:34             ` Andrew Morton
  2024-08-16 21:16           ` Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: Kairui Song @ 2024-08-16 16:50 UTC (permalink / raw)
  To: Barry Song
  Cc: wangkefeng.wang, akpm, baolin.wang, chrisl, david, hanchuanhua,
	hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko,
	minchan, nphamcs, ryan.roberts, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, willy, xiang, ying.huang,
	yosryahmed

On Fri, Aug 16, 2024 at 7:06 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Aug 16, 2024 at 1:27 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >
> >
> >
> > On 2024/8/15 17:47, Kairui Song wrote:
> > > On Fri, Aug 2, 2024 at 8:21 PM Barry Song <21cnbao@gmail.com> wrote:
> > >>
> > >> From: Chuanhua Han <hanchuanhua@oppo.com>
> > >
> > > Hi Chuanhua,
> > >
> > >>
> > ...
> >
> > >> +
> > >> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >> +{
> > >> +       struct vm_area_struct *vma = vmf->vma;
> > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >> +       unsigned long orders;
> > >> +       struct folio *folio;
> > >> +       unsigned long addr;
> > >> +       swp_entry_t entry;
> > >> +       spinlock_t *ptl;
> > >> +       pte_t *pte;
> > >> +       gfp_t gfp;
> > >> +       int order;
> > >> +
> > >> +       /*
> > >> +        * If uffd is active for the vma we need per-page fault fidelity to
> > >> +        * maintain the uffd semantics.
> > >> +        */
> > >> +       if (unlikely(userfaultfd_armed(vma)))
> > >> +               goto fallback;
> > >> +
> > >> +       /*
> > >> +        * A large swapped out folio could be partially or fully in zswap. We
> > >> +        * lack handling for such cases, so fallback to swapping in order-0
> > >> +        * folio.
> > >> +        */
> > >> +       if (!zswap_never_enabled())
> > >> +               goto fallback;
> > >> +
> > >> +       entry = pte_to_swp_entry(vmf->orig_pte);
> > >> +       /*
> > >> +        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >> +        * and suitable for swapping THP.
> > >> +        */
> > >> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > >> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > >> +       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >> +       orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);
> > >> +
> > >> +       if (!orders)
> > >> +               goto fallback;
> > >> +
> > >> +       pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
> > >> +       if (unlikely(!pte))
> > >> +               goto fallback;
> > >> +
> > >> +       /*
> > >> +        * For do_swap_page, find the highest order where the aligned range is
> > >> +        * completely swap entries with contiguous swap offsets.
> > >> +        */
> > >> +       order = highest_order(orders);
> > >> +       while (orders) {
> > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> > >> +                       break;
> > >> +               order = next_order(&orders, order);
> > >> +       }
> > >> +
> > >> +       pte_unmap_unlock(pte, ptl);
> > >> +
> > >> +       /* Try allocating the highest of the remaining orders. */
> > >> +       gfp = vma_thp_gfp_mask(vma);
> > >> +       while (orders) {
> > >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > >> +               folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > >> +               if (folio)
> > >> +                       return folio;
> > >> +               order = next_order(&orders, order);
> > >> +       }
> > >> +
> > >> +fallback:
> > >> +#endif
> > >> +       return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > >> +}
> > >> +
> > >> +
> > >>   /*
> > >>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > >>    * but allow concurrent faults), and pte mapped but not yet locked.
> > >> @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >>          if (!folio) {
> > >>                  if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >>                      __swap_count(entry) == 1) {
> > >> -                       /*
> > >> -                        * Prevent parallel swapin from proceeding with
> > >> -                        * the cache flag. Otherwise, another thread may
> > >> -                        * finish swapin first, free the entry, and swapout
> > >> -                        * reusing the same entry. It's undetectable as
> > >> -                        * pte_same() returns true due to entry reuse.
> > >> -                        */
> > >> -                       if (swapcache_prepare(entry, 1)) {
> > >> -                               /* Relax a bit to prevent rapid repeated page faults */
> > >> -                               schedule_timeout_uninterruptible(1);
> > >> -                               goto out;
> > >> -                       }
> > >> -                       need_clear_cache = true;
> > >> -
> > >>                          /* skip swapcache */
> > >> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > >> -                                               vma, vmf->address, false);
> > >> +                       folio = alloc_swap_folio(vmf);
> > >>                          page = &folio->page;
> > >>                          if (folio) {
> > >>                                  __folio_set_locked(folio);
> > >>                                  __folio_set_swapbacked(folio);
> > >>
> > >> +                               nr_pages = folio_nr_pages(folio);
> > >> +                               if (folio_test_large(folio))
> > >> +                                       entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > >> +                               /*
> > >> +                                * Prevent parallel swapin from proceeding with
> > >> +                                * the cache flag. Otherwise, another thread may
> > >> +                                * finish swapin first, free the entry, and swapout
> > >> +                                * reusing the same entry. It's undetectable as
> > >> +                                * pte_same() returns true due to entry reuse.
> > >> +                                */
> > >> +                               if (swapcache_prepare(entry, nr_pages)) {
> > >> +                                       /* Relax a bit to prevent rapid repeated page faults */
> > >> +                                       schedule_timeout_uninterruptible(1);
> > >> +                                       goto out_page;
> > >> +                               }
> > >> +                               need_clear_cache = true;
> > >> +
> > >>                                  if (mem_cgroup_swapin_charge_folio(folio,
> > >>                                                          vma->vm_mm, GFP_KERNEL,
> > >>                                                          entry)) {
> > >>                                          ret = VM_FAULT_OOM;
> > >>                                          goto out_page;
> > >>                                  }
> > >
> > > After your patch, with build kernel test, I'm seeing kernel log
> > > spamming like this:
> > > [  101.048594] pagefault_out_of_memory: 95 callbacks suppressed
> > > [  101.048599] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.059416] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.118575] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.125585] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.182501] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.215351] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.272822] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > [  101.403195] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
> > > ............
> > >
> > > And heavy performance loss with workloads limited by memcg, mTHP enabled.
> > >
> > > After some debugging, the problematic part is the
> > > mem_cgroup_swapin_charge_folio call above.
> > > When under pressure, cgroup charge fails easily for mTHP. One 64k
> > > swapin will require a much more aggressive reclaim to success.
> > >
> > > If I change MAX_RECLAIM_RETRIES from 16 to 512, the spamming log is
> > > gone and mTHP swapin should have a much higher swapin success rate.
> > > But this might not be the right way.
> > >
> > > For this particular issue, maybe you can change the charge order, try
> > > charging first, if successful, use mTHP. if failed, fallback to 4k?
> >
> > This is what we did in alloc_anon_folio(), see 085ff35e7636
> > ("mm: memory: move mem_cgroup_charge() into alloc_anon_folio()"),
> > 1) fallback earlier
> > 2) using same GFP flags for allocation and charge
> >
> > but it seems that there is a little complicated for swapin charge
>
> Kefeng, thanks! I guess we can continue using the same approach and
> it's not too complicated.
>
> Kairui, sorry for the trouble and thanks for the report! could you
> check if the solution below resolves the issue? On phones, we don't
> encounter the scenarios you’re facing.
>
> From 2daaf91077705a8fa26a3a428117f158f05375b0 Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Fri, 16 Aug 2024 10:51:48 +1200
> Subject: [PATCH] mm: fallback to next_order if charing mTHP fails
>
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.
>
> Reported-by: Kairui Song <ryncsn@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ed3603aaf31..6cba28ef91e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4121,8 +4121,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>         while (orders) {
>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
> -               if (folio)
> -                       return folio;
> +               if (folio) {
> +                       if (!mem_cgroup_swapin_charge_folio(folio,
> +                                       vma->vm_mm, gfp, entry))
> +                               return folio;
> +                       folio_put(folio);
> +               }
>                 order = next_order(&orders, order);
>         }
>
> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                 }
>                                 need_clear_cache = true;
>
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> +                               if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
>                                         ret = VM_FAULT_OOM;
> --
> 2.34.1
>

Hi Barry

After the fix the spamming log is gone, thanks for the fix.

>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-16 16:50           ` Kairui Song
@ 2024-08-16 20:34             ` Andrew Morton
  2024-08-27  3:41               ` Chuanhua Han
  0 siblings, 1 reply; 59+ messages in thread
From: Andrew Morton @ 2024-08-16 20:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, wangkefeng.wang, baolin.wang, chrisl, david,
	hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel,
	linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed

On Sat, 17 Aug 2024 00:50:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> > --
> > 2.34.1
> >
> 
> Hi Barry
> 
> After the fix the spamming log is gone, thanks for the fix.
> 

Thanks, I'll drop the v6 series.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-16 20:34             ` Andrew Morton
@ 2024-08-27  3:41               ` Chuanhua Han
  0 siblings, 0 replies; 59+ messages in thread
From: Chuanhua Han @ 2024-08-27  3:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Barry Song, wangkefeng.wang, baolin.wang, chrisl,
	david, hanchuanhua, hannes, hch, hughd, kaleshsingh, linux-kernel,
	linux-mm, mhocko, minchan, nphamcs, ryan.roberts, senozhatsky,
	shakeel.butt, shy828301, surenb, v-songbaohua, willy, xiang,
	ying.huang, yosryahmed

Andrew Morton <akpm@linux-foundation.org> 于2024年8月17日周六 04:35写道：
>
> On Sat, 17 Aug 2024 00:50:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > > --
> > > 2.34.1
> > >
> >
> > Hi Barry
> >
> > After the fix the spamming log is gone, thanks for the fix.
> >
>
> Thanks, I'll drop the v6 series.
Hi, Andrew

Can you please queue v7 for testing:
https://lore.kernel.org/linux-mm/20240821074541.516249-1-hanchuanhua@oppo.com/

V7 has addressed all comments regarding the changelog, the subject and
order-0 charge from Christoph, Kairui and Willy.
>


-- 
Thanks,
Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-15 23:06         ` Barry Song
  2024-08-16 16:50           ` Kairui Song
@ 2024-08-16 21:16           ` Matthew Wilcox
  2024-08-16 21:39             ` Barry Song
  1 sibling, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-08-16 21:16 UTC (permalink / raw)
  To: Barry Song
  Cc: wangkefeng.wang, akpm, baolin.wang, chrisl, david, hanchuanhua,
	hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko,
	minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, ying.huang, yosryahmed

On Fri, Aug 16, 2024 at 11:06:12AM +1200, Barry Song wrote:
> When memcg approaches its limit, charging mTHP becomes difficult.
> At this point, when the charge fails, we fallback to the next order
> to avoid repeatedly retrying larger orders.

Why do you always find the ugliest possible solution to a problem?

> @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				}
>  				need_clear_cache = true;
>  
> -				if (mem_cgroup_swapin_charge_folio(folio,
> +				if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
>  							vma->vm_mm, GFP_KERNEL,
>  							entry)) {
>  					ret = VM_FAULT_OOM;

Just make alloc_swap_folio() always charge the folio, even for order-0.

And you'll have to uncharge it in the swapcache_prepare() failure case.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices
  2024-08-16 21:16           ` Matthew Wilcox
@ 2024-08-16 21:39             ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-08-16 21:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: wangkefeng.wang, akpm, baolin.wang, chrisl, david, hanchuanhua,
	hannes, hch, hughd, kaleshsingh, linux-kernel, linux-mm, mhocko,
	minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt,
	shy828301, surenb, v-songbaohua, xiang, ying.huang, yosryahmed

On Sat, Aug 17, 2024 at 9:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Aug 16, 2024 at 11:06:12AM +1200, Barry Song wrote:
> > When memcg approaches its limit, charging mTHP becomes difficult.
> > At this point, when the charge fails, we fallback to the next order
> > to avoid repeatedly retrying larger orders.
>
> Why do you always find the ugliest possible solution to a problem?
>

had definitely thought about charging order-0 as well in
alloc_swap_folio() when sending this
quick fix mainly for quick verification it can fix the problem. v7
will definitely charge order-0
in alloc_swap_folio().

> > @@ -4244,7 +4248,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                               }
> >                               need_clear_cache = true;
> >
> > -                             if (mem_cgroup_swapin_charge_folio(folio,
> > +                             if (nr_pages == 1 && mem_cgroup_swapin_charge_folio(folio,
> >                                                       vma->vm_mm, GFP_KERNEL,
> >                                                       entry)) {
> >                                       ret = VM_FAULT_OOM;
>
> Just make alloc_swap_folio() always charge the folio, even for order-0.
>
> And you'll have to uncharge it in the swapcache_prepare() failure case.

I suppose this is done by folio_put() automatically.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2024-08-27  3:41 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-26  9:46 [PATCH v5 0/4] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
2024-07-26  9:46 ` [PATCH v5 1/4] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
2024-07-30  3:00   ` Baolin Wang
2024-07-30  3:11   ` Matthew Wilcox
2024-07-30  3:15     ` Barry Song
2024-07-26  9:46 ` [PATCH v5 2/4] mm: Introduce mem_cgroup_swapin_uncharge_swap_nr() helper " Barry Song
2024-07-26 16:30   ` Yosry Ahmed
2024-07-29  2:02     ` Barry Song
2024-07-29  3:43       ` Matthew Wilcox
2024-07-29  4:52         ` Barry Song
2024-07-26  9:46 ` [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
2024-07-29  3:51   ` Matthew Wilcox
2024-07-29  4:41     ` Barry Song
     [not found]       ` <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com>
2024-07-29 12:49         ` Matthew Wilcox
2024-07-29 13:11           ` Barry Song
2024-07-29 15:13             ` Matthew Wilcox
2024-07-29 20:03               ` Barry Song
2024-07-29 21:56                 ` Barry Song
2024-07-30  8:12               ` Ryan Roberts
2024-07-29  6:36     ` Chuanhua Han
2024-07-29 12:55       ` Matthew Wilcox
2024-07-29 13:18         ` Barry Song
2024-07-29 13:32         ` Chuanhua Han
2024-07-29 14:16   ` Dan Carpenter
2024-07-26  9:46 ` [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy Barry Song
2024-07-27  5:58   ` kernel test robot
2024-07-29  1:37     ` Barry Song
2024-07-29  3:52   ` Matthew Wilcox
2024-07-29  4:49     ` Barry Song
2024-07-29 16:11     ` Christoph Hellwig
2024-07-29 20:11       ` Barry Song
2024-07-30 16:30         ` Christoph Hellwig
2024-07-30 19:28           ` Nhat Pham
2024-07-30 21:06             ` Barry Song
2024-07-31 18:35               ` Nhat Pham
2024-08-01  3:00                 ` Sergey Senozhatsky
2024-08-01 20:55           ` Chris Li
2024-08-12  8:27             ` Christoph Hellwig
2024-08-12  8:44               ` Barry Song
2024-07-30  2:27       ` Chuanhua Han
2024-07-30  8:36     ` Ryan Roberts
2024-07-30  8:47       ` David Hildenbrand
2024-08-05  6:10     ` Huang, Ying
2024-08-02 12:20 ` [PATCH v6 0/2] mm: Ignite large folios swap-in support Barry Song
2024-08-02 12:20   ` [PATCH v6 1/2] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
2024-08-02 17:29     ` Chris Li
2024-08-02 12:20   ` [PATCH v6 2/2] mm: support large folios swap-in for zRAM-like devices Barry Song
2024-08-03 19:08     ` Andrew Morton
2024-08-12  8:26     ` Christoph Hellwig
2024-08-12  8:53       ` Barry Song
2024-08-12 11:38         ` Christoph Hellwig
2024-08-15  9:47     ` Kairui Song
2024-08-15 13:27       ` Kefeng Wang
2024-08-15 23:06         ` Barry Song
2024-08-16 16:50           ` Kairui Song
2024-08-16 20:34             ` Andrew Morton
2024-08-27  3:41               ` Chuanhua Han
2024-08-16 21:16           ` Matthew Wilcox
2024-08-16 21:39             ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).