[PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile
@ 2024-06-29 11:10 Barry Song
  2024-06-29 11:10 ` [PATCH RFC v4 1/2] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Barry Song @ 2024-06-29 11:10 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: chrisl, david, hannes, kasong, linux-kernel, mhocko, nphamcs,
	ryan.roberts, shy828301, surenb, kaleshsingh, hughd, v-songbaohua,
	willy, xiang, ying.huang, yosryahmed, baolin.wang, shakeel.butt,
	senozhatsky, minchan

From: Barry Song <v-songbaohua@oppo.com>

In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app 
is switched to the background, most of its memory might be swapped out.

Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.

This is unacceptable and reduces mTHP to merely a toy on systems
with significant swap utilization.

This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.

Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
   without fragmentation. Based on the observed data [1] on Chris's and Ryan's
   THP swap allocation optimization, aligned swap-in plays a crucial role
   in the success of THP_SWPOUT.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
   and enhancing compression ratios significantly. We have another patchset
   to enable mTHP compression and decompression in zsmalloc/zRAM[2].

Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
to be an optimal approach. There's a critical distinction between pagecache
and anonymous pages: pagecache can be evicted and later retrieved from disk,
potentially becoming a mTHP upon retrieval, whereas anonymous pages must
always reside in memory or swapfile. If we swap in small folios and identify
adjacent memory suitable for swapping in as mTHP, those pages that have been
converted to small folios may never transition to mTHP. The process of
converting mTHP into small folios remains irreversible. This introduces
the risk of losing all mTHP through several swap-out and swap-in cycles,
let alone losing the benefits of defragmentation, improved compression
ratios, and reduced CPU usage based on mTHP compression/decompression.

Conversely, in deploying mTHP on millions of real-world products with this
feature in OPPO's out-of-tree code[3], we haven't observed any significant
increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.

[1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
[3] OnePlusOSS / android_kernel_oneplus_sm8550 
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

-v4:
 Many parts of v3 have been merged into the mm tree with the help on reviewing
 from Ryan, David, Ying and Chris etc. Thank you very much!
 This is the final part to allocate large folios and map them.

 * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
   in this v4 RFC though it should be fixed in Yosry's patch
 * lots of code improvement (drop large stack, hold ptl etc) according
   to Yosry's and Ryan's feedback
 * rebased on top of the latest mm-unstable and utilized some new helpers
   introduced recently.

-v3:
 https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
 * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
   thanks!
 * fix the issue folio is charged twice for do_swap_page, separating
   alloc_anon_folio and alloc_swap_folio as they have many differences
   now on
   * memcg charing
   * clearing allocated folio or not

-v2:
 https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
 * lots of code cleanup according to Chris's comments, thanks!
 * collect Chris's ack tags, thanks!
 * address David's comment on moving to use folio_add_new_anon_rmap
   for !folio_test_anon in do_swap_page, thanks!
 * remove the MADV_PAGEOUT patch from this series as Ryan will
   intergrate it into swap-out series
 * Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
   on large folios swap-in as well
 * fixed corrupted data(zero-filled data) in two races: zswap and
   a part of entries are in swapcache while some others are not
   in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
 https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t

Barry Song (1):
  mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
    large folios swap-in

Chuanhua Han (1):
  mm: support large folios swapin as a whole for zRAM-like swapfile

 include/linux/swap.h  |   4 +-
 include/linux/zswap.h |   2 +-
 mm/memory.c           | 210 +++++++++++++++++++++++++++++++++++-------
 mm/swap.h             |   4 +-
 mm/swap_state.c       |   2 +-
 mm/swapfile.c         | 114 +++++++++++++----------
 6 files changed, 251 insertions(+), 85 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RFC v4 1/2] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in
  2024-06-29 11:10 [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
@ 2024-06-29 11:10 ` Barry Song
  2024-06-29 11:10 ` [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
  2024-07-03  6:31 ` [PATCH RFC v4 0/2] mm: support mTHP swap-in " Huang, Ying
  2 siblings, 0 replies; 10+ messages in thread
From: Barry Song @ 2024-06-29 11:10 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: chrisl, david, hannes, kasong, linux-kernel, mhocko, nphamcs,
	ryan.roberts, shy828301, surenb, kaleshsingh, hughd, v-songbaohua,
	willy, xiang, ying.huang, yosryahmed, baolin.wang, shakeel.butt,
	senozhatsky, minchan

From: Barry Song <v-songbaohua@oppo.com>

Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports
one entry only, to support large folio swap-in, we need to handle multiple
swap entries.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |   4 +-
 mm/swap.h            |   4 +-
 mm/swapfile.c        | 114 +++++++++++++++++++++++++------------------
 3 files changed, 70 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e473fe6cfb7a..c0f4f2073ca6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,7 +481,7 @@ extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t);
+extern int swapcache_prepare_nr(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -555,7 +555,7 @@ static inline int swap_duplicate(swp_entry_t swp)
 	return 0;
 }
 
-static inline int swapcache_prepare(swp_entry_t swp)
+static inline int swapcache_prepare_nr(swp_entry_t swp, int nr)
 {
 	return 0;
 }
diff --git a/mm/swap.h b/mm/swap.h
index baa1fa946b34..b96b1157441f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -59,7 +59,7 @@ void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -120,7 +120,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+static inline void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f7224bc1320c..8f60dd10fdef 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1352,7 +1352,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 }
 
 static void cluster_swap_free_nr(struct swap_info_struct *sis,
-		unsigned long offset, int nr_pages)
+		unsigned long offset, int nr_pages,
+		unsigned char usage)
 {
 	struct swap_cluster_info *ci;
 	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
@@ -1362,7 +1363,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis,
 	while (nr_pages) {
 		nr = min(BITS_PER_LONG, nr_pages);
 		for (i = 0; i < nr; i++) {
-			if (!__swap_entry_free_locked(sis, offset + i, 1))
+			if (!__swap_entry_free_locked(sis, offset + i, usage))
 				bitmap_set(to_free, i, 1);
 		}
 		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
@@ -1396,7 +1397,7 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 
 	while (nr_pages) {
 		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-		cluster_swap_free_nr(sis, offset, nr);
+		cluster_swap_free_nr(sis, offset, nr, 1);
 		offset += nr;
 		nr_pages -= nr;
 	}
@@ -3382,7 +3383,7 @@ void si_swapinfo(struct sysinfo *val)
 }
 
 /*
- * Verify that a swap entry is valid and increment its swap map count.
+ * Verify that nr swap entries are valid and increment their swap map counts.
  *
  * Returns error code in following case.
  * - success -> 0
@@ -3392,66 +3393,88 @@ void si_swapinfo(struct sysinfo *val)
  * - swap-cache reference is requested but the entry is not used. -> ENOENT
  * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_nr(swp_entry_t entry, unsigned char usage, int nr)
 {
 	struct swap_info_struct *p;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned char count;
 	unsigned char has_cache;
-	int err;
+	int err, i;
 
 	p = swp_swap_info(entry);
 
 	offset = swp_offset(entry);
+	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	ci = lock_cluster_or_swap_info(p, offset);
 
-	count = p->swap_map[offset];
+	err = 0;
+	for (i = 0; i < nr; i++) {
+		count = p->swap_map[offset + i];
 
-	/*
-	 * swapin_readahead() doesn't check if a swap entry is valid, so the
-	 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
-	 */
-	if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
-		err = -ENOENT;
-		goto unlock_out;
-	}
+		/*
+		 * swapin_readahead() doesn't check if a swap entry is valid, so the
+		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+		 */
+		if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
+			err = -ENOENT;
+			goto unlock_out;
+		}
 
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-	err = 0;
+		has_cache = count & SWAP_HAS_CACHE;
+		count &= ~SWAP_HAS_CACHE;
 
-	if (usage == SWAP_HAS_CACHE) {
+		if (usage == SWAP_HAS_CACHE) {
+			/* set SWAP_HAS_CACHE if there is no cache and entry is used */
+			if (!has_cache && count)
+				continue;
+			else if (has_cache)		/* someone else added cache */
+				err = -EEXIST;
+			else				/* no users remaining */
+				err = -ENOENT;
 
-		/* set SWAP_HAS_CACHE if there is no cache and entry is used */
-		if (!has_cache && count)
-			has_cache = SWAP_HAS_CACHE;
-		else if (has_cache)		/* someone else added cache */
-			err = -EEXIST;
-		else				/* no users remaining */
-			err = -ENOENT;
+		} else if (count || has_cache) {
 
-	} else if (count || has_cache) {
+			if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+				continue;
+			else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
+				err = -EINVAL;
+			else if (swap_count_continued(p, offset + i, count))
+				continue;
+			else
+				err = -ENOMEM;
+		} else
+			err = -ENOENT;			/* unused swap entry */
 
-		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+		if (err)
+			goto unlock_out;
+	}
+
+	for (i = 0; i < nr; i++) {
+		count = p->swap_map[offset + i];
+		has_cache = count & SWAP_HAS_CACHE;
+		count &= ~SWAP_HAS_CACHE;
+
+		if (usage == SWAP_HAS_CACHE)
+			has_cache = SWAP_HAS_CACHE;
+		else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
 			count += usage;
-		else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX)
-			err = -EINVAL;
-		else if (swap_count_continued(p, offset, count))
-			count = COUNT_CONTINUED;
 		else
-			err = -ENOMEM;
-	} else
-		err = -ENOENT;			/* unused swap entry */
+			count = COUNT_CONTINUED;
 
-	if (!err)
-		WRITE_ONCE(p->swap_map[offset], count | has_cache);
+		WRITE_ONCE(p->swap_map[offset + i], count | has_cache);
+	}
 
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
 	return err;
 }
 
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+	return __swap_duplicate_nr(entry, usage, 1);
+}
+
 /*
  * Help swapoff by noting that swap entry belongs to shmem/tmpfs
  * (in which case its reference count is never incremented).
@@ -3485,22 +3508,17 @@ int swap_duplicate(swp_entry_t entry)
  * -EEXIST means there is a swap cache.
  * Note: return code is different from swap_duplicate().
  */
-int swapcache_prepare(swp_entry_t entry)
+int swapcache_prepare_nr(swp_entry_t entry, int nr)
 {
-	return __swap_duplicate(entry, SWAP_HAS_CACHE);
+	return __swap_duplicate_nr(entry, SWAP_HAS_CACHE, nr);
 }
 
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-	unsigned char usage;
+	pgoff_t offset = swp_offset(entry);
 
-	ci = lock_cluster_or_swap_info(si, offset);
-	usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
-	unlock_cluster_or_swap_info(si, ci);
-	if (!usage)
-		free_swap_slot(entry);
+	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
 
 struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-06-29 11:10 [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
  2024-06-29 11:10 ` [PATCH RFC v4 1/2] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
@ 2024-06-29 11:10 ` Barry Song
  2024-07-01 13:52   ` Yosry Ahmed
  2024-07-03  6:31 ` [PATCH RFC v4 0/2] mm: support mTHP swap-in " Huang, Ying
  2 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2024-06-29 11:10 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: chrisl, david, hannes, kasong, linux-kernel, mhocko, nphamcs,
	ryan.roberts, shy828301, surenb, kaleshsingh, hughd, v-songbaohua,
	willy, xiang, ying.huang, yosryahmed, baolin.wang, shakeel.butt,
	senozhatsky, minchan, Chuanhua Han

From: Chuanhua Han <hanchuanhua@oppo.com>

In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app
is switched to the background, most of its memory might be swapped out.

Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.

This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.

Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
   without fragmentation.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
   and enhancing compression ratios significantly.

Deploying this on millions of actual products, we haven't observed any
noticeable increase in memory footprint for 64KiB mTHP based on CONT-PTE
on ARM64.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/zswap.h |   2 +-
 mm/memory.c           | 210 +++++++++++++++++++++++++++++++++++-------
 mm/swap_state.c       |   2 +-
 3 files changed, 181 insertions(+), 33 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index bf83ae5e285d..6cecb4a4f68b 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -68,7 +68,7 @@ static inline bool zswap_is_enabled(void)
 
 static inline bool zswap_never_enabled(void)
 {
-	return false;
+	return true;
 }
 
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 0a769f34bbb2..41ec7b919c2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3987,6 +3987,141 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * ptep must be first one in the range
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	struct swap_info_struct *si;
+	unsigned long addr;
+	swp_entry_t entry;
+	pgoff_t offset;
+	char has_cache;
+	int idx, i;
+	pte_t pte;
+
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	idx = (vmf->address - addr) / PAGE_SIZE;
+	pte = ptep_get(ptep);
+
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+		return false;
+	entry = pte_to_swp_entry(pte);
+	offset = swp_offset(entry);
+	if (!IS_ALIGNED(offset, nr_pages))
+		return false;
+	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+		return false;
+
+	si = swp_swap_info(entry);
+	has_cache = si->swap_map[offset] & SWAP_HAS_CACHE;
+	for (i = 1; i < nr_pages; i++) {
+		/*
+		 * while allocating a large folio and doing swap_read_folio for the
+		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+		 * doesn't have swapcache. We need to ensure all PTEs have no cache
+		 * as well, otherwise, we might go to swap devices while the content
+		 * is in swapcache
+		 */
+		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache)
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long orders;
+
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	return orders;
+}
+#else
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+	return false;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	spinlock_t *ptl;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * a large folio being swapped-in could be partially in
+	 * zswap and partially in swap devices, zswap doesn't
+	 * support large folios yet, we might get corrupted
+	 * zero-filled data by reading all subpages from swap
+	 * devices while some of them are actually in zswap
+	 */
+	if (!zswap_never_enabled())
+		goto fallback;
+
+	orders = get_alloc_folio_orders(vmf);
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap_unlock(pte, ptl);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio)
+			return folio;
+		order = next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4075,35 +4210,38 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache = true;
-
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_swap_folio(vmf);
 			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
+				nr_pages = folio_nr_pages(folio);
+				if (folio_test_large(folio))
+					entry.val = ALIGN_DOWN(entry.val, nr_pages);
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread may
+				 * finish swapin first, free the entry, and swapout
+				 * reusing the same entry. It's undetectable as
+				 * pte_same() returns true due to entry reuse.
+				 */
+				if (swapcache_prepare_nr(entry, nr_pages)) {
+					/* Relax a bit to prevent rapid repeated page faults */
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+				need_clear_cache = true;
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry);
+				for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
+					mem_cgroup_swapin_uncharge_swap(e);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4210,6 +4348,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
+	/* allocated large folios for SWP_SYNCHRONOUS_IO */
+	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
+		pte_t *folio_ptep = vmf->pte - idx;
+
+		if (!can_swapin_thp(vmf, folio_ptep, nr))
+			goto out_nomap;
+
+		page_idx = idx;
+		address = folio_start;
+		ptep = folio_ptep;
+		goto check_folio;
+	}
+
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
@@ -4341,11 +4495,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios, which are either
-		 * fully exclusive or fully shared. If we ever get large folios
-		 * here, we have to be careful.
+		 * We currently only expect small !anon folios which are either
+		 * fully exclusive or fully shared, or new allocated large folios
+		 * which are fully exclusive. If we ever get large folios within
+		 * swapcache here, we have to be careful.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio));
+		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
 		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
@@ -4388,7 +4543,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry);
+		swapcache_clear_nr(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4404,7 +4559,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry);
+		swapcache_clear_nr(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4440,14 +4595,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		goto fallback;
 
-	/*
-	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
-	 * for this vma. Then filter out the orders that can't be allocated over
-	 * the faulting address and still be fully contained in the vma.
-	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
-	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders = get_alloc_folio_orders(vmf);
 
 	if (!orders)
 		goto fallback;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 994723cef821..7e20de975350 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/*
 		 * Swap entry may have been freed since our caller observed it.
 		 */
-		err = swapcache_prepare(entry);
+		err = swapcache_prepare_nr(entry, 1);
 		if (!err)
 			break;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-06-29 11:10 ` [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
@ 2024-07-01 13:52   ` Yosry Ahmed
  2024-07-01 21:27     ` Barry Song
  0 siblings, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2024-07-01 13:52 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, ying.huang, baolin.wang,
	shakeel.butt, senozhatsky, minchan, Chuanhua Han

[..]
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       unsigned long orders;
> +       struct folio *folio;
> +       unsigned long addr;
> +       spinlock_t *ptl;
> +       pte_t *pte;
> +       gfp_t gfp;
> +       int order;
> +
> +       /*
> +        * If uffd is active for the vma we need per-page fault fidelity to
> +        * maintain the uffd semantics.
> +        */
> +       if (unlikely(userfaultfd_armed(vma)))
> +               goto fallback;
> +
> +       /*
> +        * a large folio being swapped-in could be partially in
> +        * zswap and partially in swap devices, zswap doesn't
> +        * support large folios yet, we might get corrupted
> +        * zero-filled data by reading all subpages from swap
> +        * devices while some of them are actually in zswap
> +        */

If we read all subpages from swap devices while some of them are
actually in zswap, the corrupted data won't be zero-filled AFAICT, it
could be anything (old swapped out data). There are also more ways
this can go wrong: if the first page is in zswap, we will only fill
the first page and leave the rest of the folio uninitialized.

How about a more generic comment? Perhaps something like:

A large swapped out folio could be partially or fully in zswap. We
lack handling for such cases, so fallback to swapping in order-0
folio.

> +       if (!zswap_never_enabled())
> +               goto fallback;
> +


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile
  2024-07-01 13:52   ` Yosry Ahmed
@ 2024-07-01 21:27     ` Barry Song
  0 siblings, 0 replies; 10+ messages in thread
From: Barry Song @ 2024-07-01 21:27 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, ying.huang, baolin.wang,
	shakeel.butt, senozhatsky, minchan, Chuanhua Han

On Tue, Jul 2, 2024 at 1:53 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +{
> > +       struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +       unsigned long orders;
> > +       struct folio *folio;
> > +       unsigned long addr;
> > +       spinlock_t *ptl;
> > +       pte_t *pte;
> > +       gfp_t gfp;
> > +       int order;
> > +
> > +       /*
> > +        * If uffd is active for the vma we need per-page fault fidelity to
> > +        * maintain the uffd semantics.
> > +        */
> > +       if (unlikely(userfaultfd_armed(vma)))
> > +               goto fallback;
> > +
> > +       /*
> > +        * a large folio being swapped-in could be partially in
> > +        * zswap and partially in swap devices, zswap doesn't
> > +        * support large folios yet, we might get corrupted
> > +        * zero-filled data by reading all subpages from swap
> > +        * devices while some of them are actually in zswap
> > +        */
>
> If we read all subpages from swap devices while some of them are
> actually in zswap, the corrupted data won't be zero-filled AFAICT, it
> could be anything (old swapped out data). There are also more ways
> this can go wrong: if the first page is in zswap, we will only fill
> the first page and leave the rest of the folio uninitialized.
>
> How about a more generic comment? Perhaps something like:
>
> A large swapped out folio could be partially or fully in zswap. We
> lack handling for such cases, so fallback to swapping in order-0
> folio.

looks good to me, thanks!

>
> > +       if (!zswap_never_enabled())
> > +               goto fallback;
> > +


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile
  2024-06-29 11:10 [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
  2024-06-29 11:10 ` [PATCH RFC v4 1/2] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
  2024-06-29 11:10 ` [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
@ 2024-07-03  6:31 ` Huang, Ying
  2024-07-03  7:58   ` Barry Song
  2 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2024-07-03  6:31 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, yosryahmed, baolin.wang,
	shakeel.butt, senozhatsky, minchan

Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> In an embedded system like Android, more than half of anonymous memory is
> actually stored in swap devices such as zRAM. For instance, when an app 
> is switched to the background, most of its memory might be swapped out.
>
> Currently, we have mTHP features, but unfortunately, without support
> for large folio swap-ins, once those large folios are swapped out,
> we lose them immediately because mTHP is a one-way ticket.

No exactly one-way ticket, we have (or will have) khugepaged.  But I
admit that it may be not good enough for you.

> This is unacceptable and reduces mTHP to merely a toy on systems
> with significant swap utilization.

May be true in your systems.  May be not in some other systems.

> This patch introduces mTHP swap-in support. For now, we limit mTHP
> swap-ins to contiguous swaps that were likely swapped out from mTHP as
> a whole.
>
> Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> case. This is the simplest and most common use case, benefiting millions

I admit that Android is an important target platform of Linux kernel.
But I will not advocate that it's MOST common ...

> of Android phones and similar devices with minimal implementation
> cost. In this straightforward scenario, large folios are always exclusive,
> eliminating the need to handle complex rmap and swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in.
> 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
>    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
>    THP swap allocation optimization, aligned swap-in plays a crucial role
>    in the success of THP_SWPOUT.
> 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
>    and enhancing compression ratios significantly. We have another patchset
>    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
>
> Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> to be an optimal approach. There's a critical distinction between pagecache
> and anonymous pages: pagecache can be evicted and later retrieved from disk,
> potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> always reside in memory or swapfile. If we swap in small folios and identify
> adjacent memory suitable for swapping in as mTHP, those pages that have been
> converted to small folios may never transition to mTHP. The process of
> converting mTHP into small folios remains irreversible. This introduces
> the risk of losing all mTHP through several swap-out and swap-in cycles,
> let alone losing the benefits of defragmentation, improved compression
> ratios, and reduced CPU usage based on mTHP compression/decompression.

I understand that the most optimal policy in your use cases may be
always swapping-in mTHP in highest order.  But, it may be not in some
other use cases.  For example, relative slow swap devices, non-fault
sub-pages swapped out again before usage, etc.

So, IMO, the default policy should be the one that can adapt to the
requirements automatically.  For example, if most non-fault sub-pages
will be read/written before being swapped out again, we should swap-in
in larger order, otherwise in smaller order.  Swap readahead is one
possible way to do that.  But, I admit that this may not work perfectly
in your use cases.

Previously I hope that we can start with this automatic policy that
helps everyone, then check whether it can satisfy your requirements
before implementing the optimal policy for you.  But it appears that you
don't agree with this.

Based on the above, IMO, we should not use your policy as default at
least for now.  A user space interface can be implemented to select
different swap-in order policy similar as that of mTHP allocation order
policy.  We need a different policy because the performance characters
of the memory allocation is quite different from that of swap-in.  For
example, the SSD reading could be much slower than the memory
allocation.  With the policy selection, I think that we can implement
mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
are doing.

> Conversely, in deploying mTHP on millions of real-world products with this
> feature in OPPO's out-of-tree code[3], we haven't observed any significant
> increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
>
> [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> [3] OnePlusOSS / android_kernel_oneplus_sm8550 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>

[snip]

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile
  2024-07-03  6:31 ` [PATCH RFC v4 0/2] mm: support mTHP swap-in " Huang, Ying
@ 2024-07-03  7:58   ` Barry Song
  2024-07-03  8:32     ` Barry Song
  2024-07-04  1:40     ` Huang, Ying
  0 siblings, 2 replies; 10+ messages in thread
From: Barry Song @ 2024-07-03  7:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, yosryahmed, baolin.wang,
	shakeel.butt, senozhatsky, minchan

On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>

Ying, thanks!

> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > In an embedded system like Android, more than half of anonymous memory is
> > actually stored in swap devices such as zRAM. For instance, when an app
> > is switched to the background, most of its memory might be swapped out.
> >
> > Currently, we have mTHP features, but unfortunately, without support
> > for large folio swap-ins, once those large folios are swapped out,
> > we lose them immediately because mTHP is a one-way ticket.
>
> No exactly one-way ticket, we have (or will have) khugepaged.  But I
> admit that it may be not good enough for you.

That's right. From what I understand, khugepaged currently only supports PMD THP
till now?
Moreover, I have concerns that khugepaged might not be suitable for
all mTHPs for
the following reasons:

1. The lifecycle of mTHP might not be that long. We paid the cost for
the collapse,
but it could swap-out just after that. We expect THP to be durable and
not become
obsolete quickly, given the significant amount of money we spent on it.

2. mTHP's size might not be substantial enough for a collapse. For
example, if we can
find an effective method, such as Yu's TAO or others, we can achieve a
high success
rate in mTHP allocations at a minimal cost rather than depending on
compaction/collapse.

3. It could be a significant challenge to manage the collapse - unmap,
and map processes
in relation to the power consumption of phones considering the number
of mTHP could
be much larger than PMD-mapped THP. This behavior could be quite often.

>
> > This is unacceptable and reduces mTHP to merely a toy on systems
> > with significant swap utilization.
>
> May be true in your systems.  May be not in some other systems.

I agree that this isn't a concern for systems without significant
swapout and swapin activity.
However, on Android, where we frequently switch between applications
like YouTube,
Chrome, Zoom, WeChat, Alipay, TikTok, and others, swapping could occur
throughout the
day :-)

>
> > This patch introduces mTHP swap-in support. For now, we limit mTHP
> > swap-ins to contiguous swaps that were likely swapped out from mTHP as
> > a whole.
> >
> > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> > case. This is the simplest and most common use case, benefiting millions
>
> I admit that Android is an important target platform of Linux kernel.
> But I will not advocate that it's MOST common ...

Okay, I understand that there are still many embedded systems similar
to Android, even if
they are not Android :-)

>
> > of Android phones and similar devices with minimal implementation
> > cost. In this straightforward scenario, large folios are always exclusive,
> > eliminating the need to handle complex rmap and swapcache issues.
> >
> > It offers several benefits:
> > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> >    swap-out and swap-in.
> > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
> >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
> >    THP swap allocation optimization, aligned swap-in plays a crucial role
> >    in the success of THP_SWPOUT.
> > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
> >    and enhancing compression ratios significantly. We have another patchset
> >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
> >
> > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> > to be an optimal approach. There's a critical distinction between pagecache
> > and anonymous pages: pagecache can be evicted and later retrieved from disk,
> > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> > always reside in memory or swapfile. If we swap in small folios and identify
> > adjacent memory suitable for swapping in as mTHP, those pages that have been
> > converted to small folios may never transition to mTHP. The process of
> > converting mTHP into small folios remains irreversible. This introduces
> > the risk of losing all mTHP through several swap-out and swap-in cycles,
> > let alone losing the benefits of defragmentation, improved compression
> > ratios, and reduced CPU usage based on mTHP compression/decompression.
>
> I understand that the most optimal policy in your use cases may be
> always swapping-in mTHP in highest order.  But, it may be not in some
> other use cases.  For example, relative slow swap devices, non-fault
> sub-pages swapped out again before usage, etc.
>
> So, IMO, the default policy should be the one that can adapt to the
> requirements automatically.  For example, if most non-fault sub-pages
> will be read/written before being swapped out again, we should swap-in
> in larger order, otherwise in smaller order.  Swap readahead is one
> possible way to do that.  But, I admit that this may not work perfectly
> in your use cases.
>
> Previously I hope that we can start with this automatic policy that
> helps everyone, then check whether it can satisfy your requirements
> before implementing the optimal policy for you.  But it appears that you
> don't agree with this.
>
> Based on the above, IMO, we should not use your policy as default at
> least for now.  A user space interface can be implemented to select
> different swap-in order policy similar as that of mTHP allocation order
> policy.  We need a different policy because the performance characters
> of the memory allocation is quite different from that of swap-in.  For
> example, the SSD reading could be much slower than the memory
> allocation.  With the policy selection, I think that we can implement
> mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
> are doing.

Agreed. Ryan also suggested something similar before.
Could we add this user policy by:

/sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
which could be 0 or 1, I assume we don't need so many "always inherit
madvise never"?

Do you have any suggestions regarding the user interface?

>
> > Conversely, in deploying mTHP on millions of real-world products with this
> > feature in OPPO's out-of-tree code[3], we haven't observed any significant
> > increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
> >
> > [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
> > [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > [3] OnePlusOSS / android_kernel_oneplus_sm8550
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile
  2024-07-03  7:58   ` Barry Song
@ 2024-07-03  8:32     ` Barry Song
  2024-07-04  1:40     ` Huang, Ying
  1 sibling, 0 replies; 10+ messages in thread
From: Barry Song @ 2024-07-03  8:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, yosryahmed, baolin.wang,
	shakeel.butt, senozhatsky, minchan

On Wed, Jul 3, 2024 at 7:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
>
> Ying, thanks!
>
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > In an embedded system like Android, more than half of anonymous memory is
> > > actually stored in swap devices such as zRAM. For instance, when an app
> > > is switched to the background, most of its memory might be swapped out.
> > >
> > > Currently, we have mTHP features, but unfortunately, without support
> > > for large folio swap-ins, once those large folios are swapped out,
> > > we lose them immediately because mTHP is a one-way ticket.
> >
> > No exactly one-way ticket, we have (or will have) khugepaged.  But I
> > admit that it may be not good enough for you.
>
> That's right. From what I understand, khugepaged currently only supports PMD THP
> till now?
> Moreover, I have concerns that khugepaged might not be suitable for
> all mTHPs for
> the following reasons:
>
> 1. The lifecycle of mTHP might not be that long. We paid the cost for
> the collapse,
> but it could swap-out just after that. We expect THP to be durable and
> not become
> obsolete quickly, given the significant amount of money we spent on it.
>
> 2. mTHP's size might not be substantial enough for a collapse. For
> example, if we can
> find an effective method, such as Yu's TAO or others, we can achieve a
> high success
> rate in mTHP allocations at a minimal cost rather than depending on
> compaction/collapse.
>
> 3. It could be a significant challenge to manage the collapse - unmap,
> and map processes
> in relation to the power consumption of phones considering the number
> of mTHP could
> be much larger than PMD-mapped THP. This behavior could be quite often.
>
> >
> > > This is unacceptable and reduces mTHP to merely a toy on systems
> > > with significant swap utilization.
> >
> > May be true in your systems.  May be not in some other systems.
>
> I agree that this isn't a concern for systems without significant
> swapout and swapin activity.
> However, on Android, where we frequently switch between applications
> like YouTube,
> Chrome, Zoom, WeChat, Alipay, TikTok, and others, swapping could occur
> throughout the
> day :-)
>
> >
> > > This patch introduces mTHP swap-in support. For now, we limit mTHP
> > > swap-ins to contiguous swaps that were likely swapped out from mTHP as
> > > a whole.
> > >
> > > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> > > case. This is the simplest and most common use case, benefiting millions
> >
> > I admit that Android is an important target platform of Linux kernel.
> > But I will not advocate that it's MOST common ...
>
> Okay, I understand that there are still many embedded systems similar
> to Android, even if
> they are not Android :-)
>
> >
> > > of Android phones and similar devices with minimal implementation
> > > cost. In this straightforward scenario, large folios are always exclusive,
> > > eliminating the need to handle complex rmap and swapcache issues.
> > >
> > > It offers several benefits:
> > > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> > >    swap-out and swap-in.
> > > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
> > >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
> > >    THP swap allocation optimization, aligned swap-in plays a crucial role
> > >    in the success of THP_SWPOUT.
> > > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
> > >    and enhancing compression ratios significantly. We have another patchset
> > >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
> > >
> > > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> > > to be an optimal approach. There's a critical distinction between pagecache
> > > and anonymous pages: pagecache can be evicted and later retrieved from disk,
> > > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> > > always reside in memory or swapfile. If we swap in small folios and identify
> > > adjacent memory suitable for swapping in as mTHP, those pages that have been
> > > converted to small folios may never transition to mTHP. The process of
> > > converting mTHP into small folios remains irreversible. This introduces
> > > the risk of losing all mTHP through several swap-out and swap-in cycles,
> > > let alone losing the benefits of defragmentation, improved compression
> > > ratios, and reduced CPU usage based on mTHP compression/decompression.
> >
> > I understand that the most optimal policy in your use cases may be
> > always swapping-in mTHP in highest order.  But, it may be not in some
> > other use cases.  For example, relative slow swap devices, non-fault
> > sub-pages swapped out again before usage, etc.
> >
> > So, IMO, the default policy should be the one that can adapt to the
> > requirements automatically.  For example, if most non-fault sub-pages
> > will be read/written before being swapped out again, we should swap-in
> > in larger order, otherwise in smaller order.  Swap readahead is one
> > possible way to do that.  But, I admit that this may not work perfectly
> > in your use cases.
> >
> > Previously I hope that we can start with this automatic policy that
> > helps everyone, then check whether it can satisfy your requirements
> > before implementing the optimal policy for you.  But it appears that you
> > don't agree with this.
> >
> > Based on the above, IMO, we should not use your policy as default at
> > least for now.  A user space interface can be implemented to select
> > different swap-in order policy similar as that of mTHP allocation order
> > policy.  We need a different policy because the performance characters
> > of the memory allocation is quite different from that of swap-in.  For
> > example, the SSD reading could be much slower than the memory
> > allocation.  With the policy selection, I think that we can implement
> > mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
> > are doing.
>
> Agreed. Ryan also suggested something similar before.
> Could we add this user policy by:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
> which could be 0 or 1, I assume we don't need so many "always inherit
> madvise never"?

I actually meant:

Firstly, we respect the existing THP policy, and then we incorporate
swapin_enabled after checking both allowable and suitable, pseudo
code like this,

        orders = thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);

        orders = thp_swapin_allowable_order(orders);

>
> Do you have any suggestions regarding the user interface?
>
> >
> > > Conversely, in deploying mTHP on millions of real-world products with this
> > > feature in OPPO's out-of-tree code[3], we haven't observed any significant
> > > increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
> > > [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > > [3] OnePlusOSS / android_kernel_oneplus_sm8550
> > > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> > >
> >
> > [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile
  2024-07-03  7:58   ` Barry Song
  2024-07-03  8:32     ` Barry Song
@ 2024-07-04  1:40     ` Huang, Ying
  2024-07-04 10:23       ` Barry Song
  1 sibling, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2024-07-04  1:40 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, yosryahmed, baolin.wang,
	shakeel.butt, senozhatsky, minchan

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>
> Ying, thanks!
>
>> Barry Song <21cnbao@gmail.com> writes:

[snip]

>> > This patch introduces mTHP swap-in support. For now, we limit mTHP
>> > swap-ins to contiguous swaps that were likely swapped out from mTHP as
>> > a whole.
>> >
>> > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
>> > case. This is the simplest and most common use case, benefiting millions
>>
>> I admit that Android is an important target platform of Linux kernel.
>> But I will not advocate that it's MOST common ...
>
> Okay, I understand that there are still many embedded systems similar
> to Android, even if
> they are not Android :-)
>
>>
>> > of Android phones and similar devices with minimal implementation
>> > cost. In this straightforward scenario, large folios are always exclusive,
>> > eliminating the need to handle complex rmap and swapcache issues.
>> >
>> > It offers several benefits:
>> > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>> >    swap-out and swap-in.
>> > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
>> >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
>> >    THP swap allocation optimization, aligned swap-in plays a crucial role
>> >    in the success of THP_SWPOUT.
>> > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
>> >    and enhancing compression ratios significantly. We have another patchset
>> >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
>> >
>> > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
>> > to be an optimal approach. There's a critical distinction between pagecache
>> > and anonymous pages: pagecache can be evicted and later retrieved from disk,
>> > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
>> > always reside in memory or swapfile. If we swap in small folios and identify
>> > adjacent memory suitable for swapping in as mTHP, those pages that have been
>> > converted to small folios may never transition to mTHP. The process of
>> > converting mTHP into small folios remains irreversible. This introduces
>> > the risk of losing all mTHP through several swap-out and swap-in cycles,
>> > let alone losing the benefits of defragmentation, improved compression
>> > ratios, and reduced CPU usage based on mTHP compression/decompression.
>>
>> I understand that the most optimal policy in your use cases may be
>> always swapping-in mTHP in highest order.  But, it may be not in some
>> other use cases.  For example, relative slow swap devices, non-fault
>> sub-pages swapped out again before usage, etc.
>>
>> So, IMO, the default policy should be the one that can adapt to the
>> requirements automatically.  For example, if most non-fault sub-pages
>> will be read/written before being swapped out again, we should swap-in
>> in larger order, otherwise in smaller order.  Swap readahead is one
>> possible way to do that.  But, I admit that this may not work perfectly
>> in your use cases.
>>
>> Previously I hope that we can start with this automatic policy that
>> helps everyone, then check whether it can satisfy your requirements
>> before implementing the optimal policy for you.  But it appears that you
>> don't agree with this.
>>
>> Based on the above, IMO, we should not use your policy as default at
>> least for now.  A user space interface can be implemented to select
>> different swap-in order policy similar as that of mTHP allocation order
>> policy.  We need a different policy because the performance characters
>> of the memory allocation is quite different from that of swap-in.  For
>> example, the SSD reading could be much slower than the memory
>> allocation.  With the policy selection, I think that we can implement
>> mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
>> are doing.
>
> Agreed. Ryan also suggested something similar before.
> Could we add this user policy by:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
> which could be 0 or 1, I assume we don't need so many "always inherit
> madvise never"?
>
> Do you have any suggestions regarding the user interface?

/sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled

looks good to me.  To be consistent with "enabled" in the same
directory, and more importantly, to be extensible, I think that it's
better to start with at least "always never".  I believe that we will
add "auto" in the future to tune automatically.  Which can be used as
default finally.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile
  2024-07-04  1:40     ` Huang, Ying
@ 2024-07-04 10:23       ` Barry Song
  0 siblings, 0 replies; 10+ messages in thread
From: Barry Song @ 2024-07-04 10:23 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, linux-mm, chrisl, david, hannes, kasong, linux-kernel,
	mhocko, nphamcs, ryan.roberts, shy828301, surenb, kaleshsingh,
	hughd, v-songbaohua, willy, xiang, yosryahmed, baolin.wang,
	shakeel.butt, senozhatsky, minchan

On Thu, Jul 4, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >
> > Ying, thanks!
> >
> >> Barry Song <21cnbao@gmail.com> writes:
>
> [snip]
>
> >> > This patch introduces mTHP swap-in support. For now, we limit mTHP
> >> > swap-ins to contiguous swaps that were likely swapped out from mTHP as
> >> > a whole.
> >> >
> >> > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> >> > case. This is the simplest and most common use case, benefiting millions
> >>
> >> I admit that Android is an important target platform of Linux kernel.
> >> But I will not advocate that it's MOST common ...
> >
> > Okay, I understand that there are still many embedded systems similar
> > to Android, even if
> > they are not Android :-)
> >
> >>
> >> > of Android phones and similar devices with minimal implementation
> >> > cost. In this straightforward scenario, large folios are always exclusive,
> >> > eliminating the need to handle complex rmap and swapcache issues.
> >> >
> >> > It offers several benefits:
> >> > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> >> >    swap-out and swap-in.
> >> > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
> >> >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
> >> >    THP swap allocation optimization, aligned swap-in plays a crucial role
> >> >    in the success of THP_SWPOUT.
> >> > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
> >> >    and enhancing compression ratios significantly. We have another patchset
> >> >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
> >> >
> >> > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> >> > to be an optimal approach. There's a critical distinction between pagecache
> >> > and anonymous pages: pagecache can be evicted and later retrieved from disk,
> >> > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> >> > always reside in memory or swapfile. If we swap in small folios and identify
> >> > adjacent memory suitable for swapping in as mTHP, those pages that have been
> >> > converted to small folios may never transition to mTHP. The process of
> >> > converting mTHP into small folios remains irreversible. This introduces
> >> > the risk of losing all mTHP through several swap-out and swap-in cycles,
> >> > let alone losing the benefits of defragmentation, improved compression
> >> > ratios, and reduced CPU usage based on mTHP compression/decompression.
> >>
> >> I understand that the most optimal policy in your use cases may be
> >> always swapping-in mTHP in highest order.  But, it may be not in some
> >> other use cases.  For example, relative slow swap devices, non-fault
> >> sub-pages swapped out again before usage, etc.
> >>
> >> So, IMO, the default policy should be the one that can adapt to the
> >> requirements automatically.  For example, if most non-fault sub-pages
> >> will be read/written before being swapped out again, we should swap-in
> >> in larger order, otherwise in smaller order.  Swap readahead is one
> >> possible way to do that.  But, I admit that this may not work perfectly
> >> in your use cases.
> >>
> >> Previously I hope that we can start with this automatic policy that
> >> helps everyone, then check whether it can satisfy your requirements
> >> before implementing the optimal policy for you.  But it appears that you
> >> don't agree with this.
> >>
> >> Based on the above, IMO, we should not use your policy as default at
> >> least for now.  A user space interface can be implemented to select
> >> different swap-in order policy similar as that of mTHP allocation order
> >> policy.  We need a different policy because the performance characters
> >> of the memory allocation is quite different from that of swap-in.  For
> >> example, the SSD reading could be much slower than the memory
> >> allocation.  With the policy selection, I think that we can implement
> >> mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
> >> are doing.
> >
> > Agreed. Ryan also suggested something similar before.
> > Could we add this user policy by:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
> > which could be 0 or 1, I assume we don't need so many "always inherit
> > madvise never"?
> >
> > Do you have any suggestions regarding the user interface?
>
> /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
>
> looks good to me.  To be consistent with "enabled" in the same
> directory, and more importantly, to be extensible, I think that it's
> better to start with at least "always never".  I believe that we will
> add "auto" in the future to tune automatically.  Which can be used as
> default finally.

Sounds good to me. Thanks!

>
> --
> Best Regards,
> Huang, Ying

Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-07-04 10:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-29 11:10 [PATCH RFC v4 0/2] mm: support mTHP swap-in for zRAM-like swapfile Barry Song
2024-06-29 11:10 ` [PATCH RFC v4 1/2] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Barry Song
2024-06-29 11:10 ` [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile Barry Song
2024-07-01 13:52   ` Yosry Ahmed
2024-07-01 21:27     ` Barry Song
2024-07-03  6:31 ` [PATCH RFC v4 0/2] mm: support mTHP swap-in " Huang, Ying
2024-07-03  7:58   ` Barry Song
2024-07-03  8:32     ` Barry Song
2024-07-04  1:40     ` Huang, Ying
2024-07-04 10:23       ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).