* [PATCH v5 0/9] support large folio swap-out and swap-in for shmem
@ 2024-08-12 7:42 Baolin Wang
2024-08-12 7:42 ` [PATCH v5 1/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
` (8 more replies)
0 siblings, 9 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap-out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.
Moreover, the swap code already supports for swapping out large folio without
split, and large folio swap-in[3] series is queued into mm-unstable branch.
Hence this patch set also supports the large folio swap-out and swap-in for
shmem.
Please help to review. Thanks.
Functional testing
==================
Machine environment: 32 Arm cores, 120G memory and 50G swap device.
1. Run xfstests suite to test tmpfs filesystem, and I did not catch any
regressions with this patch set.
FSTYP=tmpfs
export TEST_DIR=/mnt/tempfs_mnt
export TEST_DEV=/mnt/tempfs_mnt
export SCRATCH_MNT=/mnt/scratchdir
export SCRATCH_DEV=/mnt/scratchdir
2. Run all mm selftests in tools/testing/selftests/mm/, and no
regressions found.
3. I also wrote several shmem swap test cases, including shmem splitting,
shmem swapout, shmem swapin, swapoff during shmem swapout, shmem reclaim,
shmem swapin replacement, etc. I tested these cases under 4K and 64K
shmem folio sizes with a swap device, and shmem swap functionality works
well on my machine.
[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
[3] https://lore.kernel.org/all/20240508224040.190469-6-21cnbao@gmail.com/T/
[4] https://lore.kernel.org/all/8db63194-77fd-e0b8-8601-2bbf04889a5b@google.com/
Changes from v4:
- Add reviewed tag from Barry. Thanks.
- Drop patch 1 and move shmem split to shmem_writepage(), which can avoid
other unnecessary split, per David.
Changes from v3:
- Rebase to the latest mm-unstable.
- Simplify patch 2 based on Barry's patch:
https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.com
Chagens from v2:
- Add new patch to split large swap entry if swapin folio is order 0
folio.
- Update some commit message.
Changes from v1:
- Remove useless 'order' variable in shmem_partial_swap_usage(), per Daniel.
- Add a new patch to return number of pages beeing freed in shmem_free_swap(),
per Daniel.
- Drop 'orders' parameter for find_get_entries() and find_lock_entries().
- Round down the index when adding the swapin folio into the pagecache,
suggested by Hugh.
- Fix the reference issue when removing folio from pagecache in patch 8.
- Fix replacing old folio in swap cache in patch 7.
Changes from RFC:
- Rebased to the latest mm-unstable.
- Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
branch.
Baolin Wang (8):
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
flag setting
mm: shmem: extend shmem_partial_swap_usage() to support large folio
swap
mm: filemap: use xa_get_order() to get the swap entry order
mm: shmem: use swap_free_nr() to free shmem swap entries
mm: shmem: support large folio allocation for shmem_replace_folio()
mm: shmem: drop folio reference count using 'nr_pages' in
shmem_delete_from_page_cache()
mm: shmem: split large entry if the swapin folio is not large
mm: shmem: support large folio swap out
Daniel Gomez (1):
mm: shmem: return number of pages beeing freed in shmem_free_swap
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
include/linux/swap.h | 4 +-
include/linux/writeback.h | 4 +
mm/filemap.c | 4 +
mm/shmem.c | 217 +++++++++++++++++-----
mm/swapfile.c | 4 +-
mm/vmscan.c | 32 +++-
7 files changed, 209 insertions(+), 57 deletions(-)
--
2.39.3
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v5 1/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 2/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
` (7 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for
shmem swap entries.
While we are at it, using folio_nr_pages() to get the number of pages
of the folio as a preparation.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
include/linux/swap.h | 4 ++--
mm/shmem.c | 6 ++++--
mm/swapfile.c | 4 ++--
3 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1c8f844a9f0f..248db1dd7812 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,7 +481,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t);
+extern void swap_shmem_alloc(swp_entry_t, int);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
@@ -548,7 +548,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
return 0;
}
-static inline void swap_shmem_alloc(swp_entry_t swp)
+static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
{
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 4a5254bfd610..22cdc10f27ea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1452,6 +1452,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
swp_entry_t swap;
pgoff_t index;
+ int nr_pages;
/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -1484,6 +1485,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
}
index = folio->index;
+ nr_pages = folio_nr_pages(folio);
/*
* This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
@@ -1536,8 +1538,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
if (add_to_swap_cache(folio, swap,
__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
NULL) == 0) {
- shmem_recalc_inode(inode, 0, 1);
- swap_shmem_alloc(swap);
+ shmem_recalc_inode(inode, 0, nr_pages);
+ swap_shmem_alloc(swap, nr_pages);
shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6de12d712c7e..1caeee676696 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3604,9 +3604,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
* Help swapoff by noting that swap entry belongs to shmem/tmpfs
* (in which case its reference count is never incremented).
*/
-void swap_shmem_alloc(swp_entry_t entry)
+void swap_shmem_alloc(swp_entry_t entry, int nr)
{
- __swap_duplicate(entry, SWAP_MAP_SHMEM, 1);
+ __swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
}
/*
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 2/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
2024-08-12 7:42 ` [PATCH v5 1/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 3/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
` (6 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 22cdc10f27ea..02fb188d627f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -890,7 +890,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
if (xas_retry(&xas, page))
continue;
if (xa_is_value(page))
- swapped++;
+ swapped += 1 << xa_get_order(xas.xa, xas.xa_index);
if (xas.xa_index == max)
break;
if (need_resched()) {
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 3/9] mm: shmem: return number of pages beeing freed in shmem_free_swap
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
2024-08-12 7:42 ` [PATCH v5 1/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
2024-08-12 7:42 ` [PATCH v5 2/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
` (5 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
From: Daniel Gomez <da.gomez@samsung.com>
Both shmem_free_swap callers expect the number of pages being freed. In
the large folios context, this needs to support larger values other than
0 (used as 1 page being freed) and -ENOENT (used as 0 pages being
freed). In preparation for large folios adoption, make shmem_free_swap
routine return the number of pages being freed. So, returning 0 in this
context, means 0 pages being freed.
While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 02fb188d627f..d0d54939da48 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -856,18 +856,22 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
}
/*
- * Remove swap entry from page cache, free the swap and its page cache.
+ * Remove swap entry from page cache, free the swap and its page cache. Returns
+ * the number of pages being freed. 0 means entry not found in XArray (0 pages
+ * being freed).
*/
-static int shmem_free_swap(struct address_space *mapping,
- pgoff_t index, void *radswap)
+static long shmem_free_swap(struct address_space *mapping,
+ pgoff_t index, void *radswap)
{
+ int order = xa_get_order(&mapping->i_pages, index);
void *old;
old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
if (old != radswap)
- return -ENOENT;
- free_swap_and_cache(radix_to_swp_entry(radswap));
- return 0;
+ return 0;
+ free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+
+ return 1 << order;
}
/*
@@ -1019,7 +1023,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
if (xa_is_value(folio)) {
if (unfalloc)
continue;
- nr_swaps_freed += !shmem_free_swap(mapping,
+ nr_swaps_freed += shmem_free_swap(mapping,
indices[i], folio);
continue;
}
@@ -1086,14 +1090,17 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
folio = fbatch.folios[i];
if (xa_is_value(folio)) {
+ long swaps_freed;
+
if (unfalloc)
continue;
- if (shmem_free_swap(mapping, indices[i], folio)) {
+ swaps_freed = shmem_free_swap(mapping, indices[i], folio);
+ if (!swaps_freed) {
/* Swap was replaced by page: retry */
index = indices[i];
break;
}
- nr_swaps_freed++;
+ nr_swaps_freed += swaps_freed;
continue;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
` (2 preceding siblings ...)
2024-08-12 7:42 ` [PATCH v5 3/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-25 21:55 ` Hugh Dickins
2024-08-12 7:42 ` [PATCH v5 5/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
` (4 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/filemap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4130be74f6fd..4c312aab8b1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
folio = fbatch->folios[idx];
if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
+ else
+ nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
*start = indices[idx] + nr;
}
return folio_batch_count(fbatch);
@@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
folio = fbatch->folios[idx];
if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
+ else
+ nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
*start = indices[idx] + nr;
}
return folio_batch_count(fbatch);
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 5/9] mm: shmem: use swap_free_nr() to free shmem swap entries
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
` (3 preceding siblings ...)
2024-08-12 7:42 ` [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
` (3 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
As a preparation for supporting shmem large folio swapout, use swap_free_nr()
to free some continuous swap entries of the shmem large folio when the
large folio was swapped in from the swap cache. In addition, the index
should also be round down to the number of pages when adding the swapin
folio into the pagecache.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index d0d54939da48..f6bab42180ea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1961,6 +1961,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
struct address_space *mapping = inode->i_mapping;
swp_entry_t swapin_error;
void *old;
+ int nr_pages;
swapin_error = make_poisoned_swp_entry();
old = xa_cmpxchg_irq(&mapping->i_pages, index,
@@ -1969,6 +1970,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
if (old != swp_to_radix_entry(swap))
return;
+ nr_pages = folio_nr_pages(folio);
folio_wait_writeback(folio);
delete_from_swap_cache(folio);
/*
@@ -1976,8 +1978,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
* won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
* in shmem_evict_inode().
*/
- shmem_recalc_inode(inode, -1, -1);
- swap_free(swap);
+ shmem_recalc_inode(inode, -nr_pages, -nr_pages);
+ swap_free_nr(swap, nr_pages);
}
/*
@@ -1996,7 +1998,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct swap_info_struct *si;
struct folio *folio = NULL;
swp_entry_t swap;
- int error;
+ int error, nr_pages;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
swap = radix_to_swp_entry(*foliop);
@@ -2043,6 +2045,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
goto failed;
}
folio_wait_writeback(folio);
+ nr_pages = folio_nr_pages(folio);
/*
* Some architectures may have to restore extra metadata to the
@@ -2056,19 +2059,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
goto failed;
}
- error = shmem_add_to_page_cache(folio, mapping, index,
+ error = shmem_add_to_page_cache(folio, mapping,
+ round_down(index, nr_pages),
swp_to_radix_entry(swap), gfp);
if (error)
goto failed;
- shmem_recalc_inode(inode, 0, -1);
+ shmem_recalc_inode(inode, 0, -nr_pages);
if (sgp == SGP_WRITE)
folio_mark_accessed(folio);
delete_from_swap_cache(folio);
folio_mark_dirty(folio);
- swap_free(swap);
+ swap_free_nr(swap, nr_pages);
put_swap_device(si);
*foliop = folio;
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio()
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
` (4 preceding siblings ...)
2024-08-12 7:42 ` [PATCH v5 5/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-25 22:05 ` Hugh Dickins
2024-08-12 7:42 ` [PATCH v5 7/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
` (2 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in shmem_replace_folio().
Moreover large folios occupy N consecutive entries in the swap cache
instead of using multi-index entries like the page cache, therefore
we should replace each consecutive entries in the swap cache instead
of using the shmem_replace_entry().
As well as updating statistics and folio reference count using the number
of pages in the folio.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 54 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 31 insertions(+), 23 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index f6bab42180ea..d94f02ad7bd1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1889,28 +1889,24 @@ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp)
static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
- struct folio *old, *new;
- struct address_space *swap_mapping;
- swp_entry_t entry;
- pgoff_t swap_index;
- int error;
-
- old = *foliop;
- entry = old->swap;
- swap_index = swap_cache_index(entry);
- swap_mapping = swap_address_space(entry);
+ struct folio *new, *old = *foliop;
+ swp_entry_t entry = old->swap;
+ struct address_space *swap_mapping = swap_address_space(entry);
+ pgoff_t swap_index = swap_cache_index(entry);
+ XA_STATE(xas, &swap_mapping->i_pages, swap_index);
+ int nr_pages = folio_nr_pages(old);
+ int error = 0, i;
/*
* We have arrived here because our zones are constrained, so don't
* limit chance of success by further cpuset and node constraints.
*/
gfp &= ~GFP_CONSTRAINT_MASK;
- VM_BUG_ON_FOLIO(folio_test_large(old), old);
- new = shmem_alloc_folio(gfp, 0, info, index);
+ new = shmem_alloc_folio(gfp, folio_order(old), info, index);
if (!new)
return -ENOMEM;
- folio_get(new);
+ folio_ref_add(new, nr_pages);
folio_copy(new, old);
flush_dcache_folio(new);
@@ -1920,18 +1916,25 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
new->swap = entry;
folio_set_swapcache(new);
- /*
- * Our caller will very soon move newpage out of swapcache, but it's
- * a nice clean interface for us to replace oldpage by newpage there.
- */
+ /* Swap cache still stores N entries instead of a high-order entry */
xa_lock_irq(&swap_mapping->i_pages);
- error = shmem_replace_entry(swap_mapping, swap_index, old, new);
+ for (i = 0; i < nr_pages; i++) {
+ void *item = xas_load(&xas);
+
+ if (item != old) {
+ error = -ENOENT;
+ break;
+ }
+
+ xas_store(&xas, new);
+ xas_next(&xas);
+ }
if (!error) {
mem_cgroup_replace_folio(old, new);
- __lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1);
- __lruvec_stat_mod_folio(new, NR_SHMEM, 1);
- __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1);
- __lruvec_stat_mod_folio(old, NR_SHMEM, -1);
+ __lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages);
+ __lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages);
+ __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages);
+ __lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages);
}
xa_unlock_irq(&swap_mapping->i_pages);
@@ -1951,7 +1954,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
old->private = NULL;
folio_unlock(old);
- folio_put_refs(old, 2);
+ /*
+ * The old folio are removed from swap cache, drop the 'nr_pages'
+ * reference, as well as one temporary reference getting from swap
+ * cache.
+ */
+ folio_put_refs(old, nr_pages + 1);
return error;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 7/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
` (5 preceding siblings ...)
2024-08-12 7:42 ` [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
2024-08-12 7:42 ` [PATCH v5 9/9] mm: shmem: support large folio swap out Baolin Wang
8 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding
into swap cache.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index d94f02ad7bd1..345e25425e37 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -851,7 +851,7 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
__lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
xa_unlock_irq(&mapping->i_pages);
- folio_put(folio);
+ folio_put_refs(folio, nr);
BUG_ON(error);
}
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
` (6 preceding siblings ...)
2024-08-12 7:42 ` [PATCH v5 7/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-25 22:31 ` Hugh Dickins
2024-08-12 7:42 ` [PATCH v5 9/9] mm: shmem: support large folio swap out Baolin Wang
8 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
Now the swap device can only swap-in order 0 folio, even though a large
folio is swapped out. This requires us to split the large entry previously
saved in the shmem pagecache to support the swap in of small folios.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 100 insertions(+)
diff --git a/mm/shmem.c b/mm/shmem.c
index 345e25425e37..996062dc196b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1990,6 +1990,81 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
swap_free_nr(swap, nr_pages);
}
+static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
+ swp_entry_t swap, int new_order, gfp_t gfp)
+{
+ struct address_space *mapping = inode->i_mapping;
+ XA_STATE_ORDER(xas, &mapping->i_pages, index, new_order);
+ void *alloced_shadow = NULL;
+ int alloced_order = 0, i;
+
+ for (;;) {
+ int order = -1, split_order = 0;
+ void *old = NULL;
+
+ xas_lock_irq(&xas);
+ old = xas_load(&xas);
+ if (!xa_is_value(old) || swp_to_radix_entry(swap) != old) {
+ xas_set_err(&xas, -EEXIST);
+ goto unlock;
+ }
+
+ order = xas_get_order(&xas);
+
+ /* Swap entry may have changed before we re-acquire the lock */
+ if (alloced_order &&
+ (old != alloced_shadow || order != alloced_order)) {
+ xas_destroy(&xas);
+ alloced_order = 0;
+ }
+
+ /* Try to split large swap entry in pagecache */
+ if (order > 0 && order > new_order) {
+ if (!alloced_order) {
+ split_order = order;
+ goto unlock;
+ }
+ xas_split(&xas, old, order);
+
+ /*
+ * Re-set the swap entry after splitting, and the swap
+ * offset of the original large entry must be continuous.
+ */
+ for (i = 0; i < 1 << order; i += (1 << new_order)) {
+ pgoff_t aligned_index = round_down(index, 1 << order);
+ swp_entry_t tmp;
+
+ tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+ __xa_store(&mapping->i_pages, aligned_index + i,
+ swp_to_radix_entry(tmp), 0);
+ }
+ }
+
+unlock:
+ xas_unlock_irq(&xas);
+
+ /* split needed, alloc here and retry. */
+ if (split_order) {
+ xas_split_alloc(&xas, old, split_order, gfp);
+ if (xas_error(&xas))
+ goto error;
+ alloced_shadow = old;
+ alloced_order = split_order;
+ xas_reset(&xas);
+ continue;
+ }
+
+ if (!xas_nomem(&xas, gfp))
+ break;
+ }
+
+error:
+ if (xas_error(&xas))
+ return xas_error(&xas);
+
+ return alloced_order;
+}
+
/*
* Swap in the folio pointed to by *foliop.
* Caller has to make sure that *foliop contains a valid swapped folio.
@@ -2026,12 +2101,37 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
/* Look it up and read it in.. */
folio = swap_cache_get_folio(swap, NULL, 0);
if (!folio) {
+ int split_order;
+
/* Or update major stats only when swapin succeeds?? */
if (fault_type) {
*fault_type |= VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
+
+ /*
+ * Now swap device can only swap in order 0 folio, then we
+ * should split the large swap entry stored in the pagecache
+ * if necessary.
+ */
+ split_order = shmem_split_large_entry(inode, index, swap, 0, gfp);
+ if (split_order < 0) {
+ error = split_order;
+ goto failed;
+ }
+
+ /*
+ * If the large swap entry has already been split, it is
+ * necessary to recalculate the new swap entry based on
+ * the old order alignment.
+ */
+ if (split_order > 0) {
+ pgoff_t offset = index - round_down(index, 1 << split_order);
+
+ swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+ }
+
/* Here we actually start the io */
folio = shmem_swapin_cluster(swap, gfp, info, index);
if (!folio) {
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v5 9/9] mm: shmem: support large folio swap out
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
` (7 preceding siblings ...)
2024-08-12 7:42 ` [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
@ 2024-08-12 7:42 ` Baolin Wang
2024-08-25 23:14 ` Hugh Dickins
8 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-08-12 7:42 UTC (permalink / raw)
To: akpm, hughd
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
baolin.wang, linux-mm, linux-kernel
Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.
Moreover, the swap code already supports for swapping out large folio without
split, hence this patch set supports the large folio swap out for shmem.
Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate spliting
the large folio.
[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
include/linux/writeback.h | 4 +++
mm/shmem.c | 12 ++++++---
mm/vmscan.c | 32 ++++++++++++++++++-----
4 files changed, 38 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index c5e1c718a6d2..c66cb9c585e1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -308,6 +308,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
.range_start = 0,
.range_end = LLONG_MAX,
.for_reclaim = 1,
+ .split_large_folio = 1,
};
unsigned long i;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 1a54676d843a..10100e22d5c6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,6 +63,7 @@ struct writeback_control {
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned unpinned_netfs_wb:1; /* Cleared I_PINNING_NETFS_WB */
+ unsigned split_large_folio:1; /* Split large folio for shmem writeback */
/*
* When writeback IOs are bounced through async layers, only the
@@ -79,6 +80,9 @@ struct writeback_control {
*/
struct swap_iocb **swap_plug;
+ /* Target list for splitting a large folio */
+ struct list_head *list;
+
/* internal fields used by the ->writepages implementation: */
struct folio_batch fbatch;
pgoff_t index;
diff --git a/mm/shmem.c b/mm/shmem.c
index 996062dc196b..50aeb03c4d34 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -795,7 +795,6 @@ static int shmem_add_to_page_cache(struct folio *folio,
VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
- VM_BUG_ON(expected && folio_test_large(folio));
folio_ref_add(folio, nr);
folio->mapping = mapping;
@@ -1482,10 +1481,11 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
* "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
* and its shmem_writeback() needs them to be split when swapping.
*/
- if (folio_test_large(folio)) {
+ if (wbc->split_large_folio && folio_test_large(folio)) {
+try_split:
/* Ensure the subpages are still dirty */
folio_test_set_dirty(folio);
- if (split_huge_page(page) < 0)
+ if (split_huge_page_to_list_to_order(page, wbc->list, 0))
goto redirty;
folio = page_folio(page);
folio_clear_dirty(folio);
@@ -1527,8 +1527,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
}
swap = folio_alloc_swap(folio);
- if (!swap.val)
+ if (!swap.val) {
+ if (nr_pages > 1)
+ goto try_split;
+
goto redirty;
+ }
/*
* Add inode to shmem_unuse()'s list of swapped-out inodes,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 96ce889ea3d0..ba7b67218caf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -628,7 +628,7 @@ typedef enum {
* Calls ->writepage().
*/
static pageout_t pageout(struct folio *folio, struct address_space *mapping,
- struct swap_iocb **plug)
+ struct swap_iocb **plug, struct list_head *folio_list)
{
/*
* If the folio is dirty, only perform writeback if that write
@@ -676,6 +676,16 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
.swap_plug = plug,
};
+ /*
+ * The large shmem folio can be split if CONFIG_THP_SWAP is
+ * not enabled or contiguous swap entries are failed to
+ * allocate.
+ */
+ if (shmem_mapping(mapping) && folio_test_large(folio)) {
+ wbc.list = folio_list;
+ wbc.split_large_folio = !IS_ENABLED(CONFIG_THP_SWAP);
+ }
+
folio_set_reclaim(folio);
res = mapping->a_ops->writepage(&folio->page, &wbc);
if (res < 0)
@@ -1257,11 +1267,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
goto activate_locked_split;
}
}
- } else if (folio_test_swapbacked(folio) &&
- folio_test_large(folio)) {
- /* Split shmem folio */
- if (split_folio_to_list(folio, folio_list))
- goto keep_locked;
}
/*
@@ -1362,12 +1367,25 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
- switch (pageout(folio, mapping, &plug)) {
+ switch (pageout(folio, mapping, &plug, folio_list)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
+ /*
+ * If shmem folio is split when writeback to swap,
+ * the tail pages will make their own pass through
+ * this function and be accounted then.
+ */
+ if (nr_pages > 1 && !folio_test_large(folio)) {
+ sc->nr_scanned -= (nr_pages - 1);
+ nr_pages = 1;
+ }
goto activate_locked;
case PAGE_SUCCESS:
+ if (nr_pages > 1 && !folio_test_large(folio)) {
+ sc->nr_scanned -= (nr_pages - 1);
+ nr_pages = 1;
+ }
stat->nr_pageout += nr_pages;
if (folio_test_writeback(folio))
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-12 7:42 ` [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
@ 2024-08-25 21:55 ` Hugh Dickins
2024-08-25 23:28 ` Matthew Wilcox
2024-08-27 10:10 ` Baolin Wang
0 siblings, 2 replies; 23+ messages in thread
From: Hugh Dickins @ 2024-08-25 21:55 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
p.raghav, linux-mm, linux-kernel
On Mon, 12 Aug 2024, Baolin Wang wrote:
> In the following patches, shmem will support the swap out of large folios,
> which means the shmem mappings may contain large order swap entries, so
> using xa_get_order() to get the folio order of the shmem swap entry to
> update the '*start' correctly.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/filemap.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4130be74f6fd..4c312aab8b1f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> folio = fbatch->folios[idx];
> if (!xa_is_value(folio))
> nr = folio_nr_pages(folio);
> + else
> + nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
> *start = indices[idx] + nr;
> }
> return folio_batch_count(fbatch);
> @@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> folio = fbatch->folios[idx];
> if (!xa_is_value(folio))
> nr = folio_nr_pages(folio);
> + else
> + nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
> *start = indices[idx] + nr;
> }
> return folio_batch_count(fbatch);
> --
Here we have a problem, but I'm not suggesting a fix for it yet: I
need to get other fixes out first, then turn to thinking about this -
it's not easy.
That code is almost always right, so it works well enough for most
people not to have noticed, but there are two issues with it.
The first issue is that it's assuming indices[idx] is already aligned
to nr: not necessarily so. I believe it was already wrong in the
folio_nr_pages() case, but the time I caught it wrong with a printk
was in the swap (value) case. (I may not be stating this correctly:
again more thought needed but I can't spend so long writing.)
And immediately afterwards that kernel build failed with a corrupted
(all 0s) .o file - I'm building on ext4 on /dev/loop0 on huge tmpfs while
swapping, and happen to be using the "-o discard" option to ext4 mount.
I've been chasing these failures (maybe a few minutes in, maybe half an
hour) for days, then had the idea of trying without the "-o discard":
whereupon these builds can be repeated successfully for many hours.
IIRC ext4 discard to /dev/loop0 entails hole-punch to the tmpfs.
The alignment issue can easily be corrected, but that might not help.
(And those two functions would look better with the rcu_read_unlock()
moved down to just before returning: but again, wouldn't really help.)
The second issue is that swap is more slippery to work with than
folios or pages: in the folio_nr_pages() case, that number is stable
because we hold a refcount (which stops a THP from being split), and
later we'll be taking folio lock too. None of that in the swap case,
so (depending on how a large entry gets split) the xa_get_order() result
is not reliable. Likewise for other uses of xa_get_order() in this series.
There needs to be some kind of locking or retry to make the order usable,
and to avoid shmem_free_swap() occasionally freeing more than it ought.
I'll give it thought after.
Hugh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio()
2024-08-12 7:42 ` [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
@ 2024-08-25 22:05 ` Hugh Dickins
2024-08-27 3:06 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2024-08-25 22:05 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
p.raghav, linux-mm, linux-kernel
On Mon, 12 Aug 2024, Baolin Wang wrote:
> To support large folio swapin for shmem in the following patches, add
> large folio allocation for the new replacement folio in shmem_replace_folio().
> Moreover large folios occupy N consecutive entries in the swap cache
> instead of using multi-index entries like the page cache, therefore
> we should replace each consecutive entries in the swap cache instead
> of using the shmem_replace_entry().
>
> As well as updating statistics and folio reference count using the number
> of pages in the folio.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/shmem.c | 54 +++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 31 insertions(+), 23 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f6bab42180ea..d94f02ad7bd1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1889,28 +1889,24 @@ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp)
> static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
> struct shmem_inode_info *info, pgoff_t index)
> {
> - struct folio *old, *new;
> - struct address_space *swap_mapping;
> - swp_entry_t entry;
> - pgoff_t swap_index;
> - int error;
> -
> - old = *foliop;
> - entry = old->swap;
> - swap_index = swap_cache_index(entry);
> - swap_mapping = swap_address_space(entry);
> + struct folio *new, *old = *foliop;
> + swp_entry_t entry = old->swap;
> + struct address_space *swap_mapping = swap_address_space(entry);
> + pgoff_t swap_index = swap_cache_index(entry);
> + XA_STATE(xas, &swap_mapping->i_pages, swap_index);
> + int nr_pages = folio_nr_pages(old);
> + int error = 0, i;
>
> /*
> * We have arrived here because our zones are constrained, so don't
> * limit chance of success by further cpuset and node constraints.
> */
> gfp &= ~GFP_CONSTRAINT_MASK;
> - VM_BUG_ON_FOLIO(folio_test_large(old), old);
> - new = shmem_alloc_folio(gfp, 0, info, index);
> + new = shmem_alloc_folio(gfp, folio_order(old), info, index);
It is not clear to me whether folio_order(old) will ever be more than 0
here: but if it can be, then care will need to be taken over the gfp flags,
that they are suited to allocating the large folio; and there will need to
be (could be awkward!) fallback to order 0 when that allocation fails.
My own testing never comes to shmem_replace_folio(): it was originally for
one lowend graphics driver; but IIRC there's now a more common case for it.
Hugh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large
2024-08-12 7:42 ` [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
@ 2024-08-25 22:31 ` Hugh Dickins
2024-08-27 6:46 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2024-08-25 22:31 UTC (permalink / raw)
To: Andrew Morton, Baolin Wang
Cc: hughd, willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On Mon, 12 Aug 2024, Baolin Wang wrote:
> Now the swap device can only swap-in order 0 folio, even though a large
> folio is swapped out. This requires us to split the large entry previously
> saved in the shmem pagecache to support the swap in of small folios.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/shmem.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 100 insertions(+)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 345e25425e37..996062dc196b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1990,6 +1990,81 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> swap_free_nr(swap, nr_pages);
> }
>
> +static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> + swp_entry_t swap, int new_order, gfp_t gfp)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + XA_STATE_ORDER(xas, &mapping->i_pages, index, new_order);
> + void *alloced_shadow = NULL;
> + int alloced_order = 0, i;
gfp needs to be adjusted: see fix patch below.
> +
> + for (;;) {
> + int order = -1, split_order = 0;
> + void *old = NULL;
> +
> + xas_lock_irq(&xas);
> + old = xas_load(&xas);
> + if (!xa_is_value(old) || swp_to_radix_entry(swap) != old) {
> + xas_set_err(&xas, -EEXIST);
> + goto unlock;
> + }
> +
> + order = xas_get_order(&xas);
> +
> + /* Swap entry may have changed before we re-acquire the lock */
> + if (alloced_order &&
> + (old != alloced_shadow || order != alloced_order)) {
> + xas_destroy(&xas);
> + alloced_order = 0;
> + }
> +
> + /* Try to split large swap entry in pagecache */
> + if (order > 0 && order > new_order) {
I have not even attempted to understand all the manipulations of order and
new_order and alloced_order and split_order. And further down it turns out
that this is only ever called with new_order 0.
You may be wanting to cater for more generality in future, but for now
please cut this down to the new_order 0 case, and omit that parameter.
It will be easier for us to think about the xa_get_order() races if
the possibilities are more limited.
> + if (!alloced_order) {
> + split_order = order;
> + goto unlock;
> + }
> + xas_split(&xas, old, order);
> +
> + /*
> + * Re-set the swap entry after splitting, and the swap
> + * offset of the original large entry must be continuous.
> + */
> + for (i = 0; i < 1 << order; i += (1 << new_order)) {
> + pgoff_t aligned_index = round_down(index, 1 << order);
> + swp_entry_t tmp;
> +
> + tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
> + __xa_store(&mapping->i_pages, aligned_index + i,
> + swp_to_radix_entry(tmp), 0);
> + }
So that is done under xas lock: good. But is the intermediate state
visible to RCU readers, and could that be a problem?
> + }
> +
> +unlock:
> + xas_unlock_irq(&xas);
> +
> + /* split needed, alloc here and retry. */
> + if (split_order) {
> + xas_split_alloc(&xas, old, split_order, gfp);
> + if (xas_error(&xas))
> + goto error;
> + alloced_shadow = old;
> + alloced_order = split_order;
> + xas_reset(&xas);
> + continue;
> + }
> +
> + if (!xas_nomem(&xas, gfp))
> + break;
> + }
> +
> +error:
> + if (xas_error(&xas))
> + return xas_error(&xas);
> +
> + return alloced_order;
> +}
> +
> /*
> * Swap in the folio pointed to by *foliop.
> * Caller has to make sure that *foliop contains a valid swapped folio.
> @@ -2026,12 +2101,37 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> /* Look it up and read it in.. */
> folio = swap_cache_get_folio(swap, NULL, 0);
> if (!folio) {
> + int split_order;
> +
> /* Or update major stats only when swapin succeeds?? */
> if (fault_type) {
> *fault_type |= VM_FAULT_MAJOR;
> count_vm_event(PGMAJFAULT);
> count_memcg_event_mm(fault_mm, PGMAJFAULT);
> }
> +
> + /*
> + * Now swap device can only swap in order 0 folio, then we
> + * should split the large swap entry stored in the pagecache
> + * if necessary.
> + */
> + split_order = shmem_split_large_entry(inode, index, swap, 0, gfp);
> + if (split_order < 0) {
> + error = split_order;
> + goto failed;
> + }
> +
> + /*
> + * If the large swap entry has already been split, it is
> + * necessary to recalculate the new swap entry based on
> + * the old order alignment.
> + */
> + if (split_order > 0) {
> + pgoff_t offset = index - round_down(index, 1 << split_order);
> +
> + swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> + }
> +
> /* Here we actually start the io */
> folio = shmem_swapin_cluster(swap, gfp, info, index);
> if (!folio) {
> --
[PATCH] mm: shmem: split large entry if the swapin folio is not large fix
Fix all the
Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x1120d0
(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_RECLAIMABLE).
Fix your code!
warnings from kmalloc_fix_flags() from xas_split_alloc() from
shmem_split_large_entry().
Fixes: a960844d5ac9 ("mm: shmem: split large entry if the swapin folio is not large")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/shmem.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/shmem.c b/mm/shmem.c
index ae2245dce8ae..85e3bd3e709e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1999,6 +1999,9 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
void *alloced_shadow = NULL;
int alloced_order = 0, i;
+ /* Convert user data gfp flags to xarray node gfp flags */
+ gfp &= GFP_RECLAIM_MASK;
+
for (;;) {
int order = -1, split_order = 0;
void *old = NULL;
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 9/9] mm: shmem: support large folio swap out
2024-08-12 7:42 ` [PATCH v5 9/9] mm: shmem: support large folio swap out Baolin Wang
@ 2024-08-25 23:14 ` Hugh Dickins
2024-08-27 6:58 ` Baolin Wang
2024-08-28 8:28 ` [PATCH] mm: shmem: support large folio swap out fix 2 Baolin Wang
0 siblings, 2 replies; 23+ messages in thread
From: Hugh Dickins @ 2024-08-25 23:14 UTC (permalink / raw)
To: Andrew Morton, Baolin Wang
Cc: hughd, willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On Mon, 12 Aug 2024, Baolin Wang wrote:
> Shmem will support large folio allocation [1] [2] to get a better performance,
> however, the memory reclaim still splits the precious large folios when trying
> to swap out shmem, which may lead to the memory fragmentation issue and can not
> take advantage of the large folio for shmeme.
>
> Moreover, the swap code already supports for swapping out large folio without
> split, hence this patch set supports the large folio swap out for shmem.
>
> Note the i915_gem_shmem driver still need to be split when swapping, thus
> add a new flag 'split_large_folio' for writeback_control to indicate spliting
> the large folio.
Is that last paragraph a misunderstanding? The i915 THP splitting in
shmem_writepage() was to avoid mm VM_BUG_ONs and crashes when shmem.c
did not support huge page swapout: but now you are enabling that support,
and such VM_BUG_ONs and crashes are gone (so far as I can see: and this
is written on a laptop using the i915 driver).
I cannot think of why i915 itself would care how mm implements swapout
(beyond enjoying faster): I think all the wbc->split_large_folio you
introduce here should be reverted. But you may know better!
I do need a further change to shmem_writepage() here: see fixup patch
below: that's written to apply on top of this 9/9, but I'd be glad to
see a replacement with wbc->split_large_folio gone, and just one
!IS_ENABLED(CONFIG_THP_SWAP) instead.
>
> [1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
> [2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
I get "Not found" for that [2] link.
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
> include/linux/writeback.h | 4 +++
> mm/shmem.c | 12 ++++++---
> mm/vmscan.c | 32 ++++++++++++++++++-----
> 4 files changed, 38 insertions(+), 11 deletions(-)
[PATCH] mm: shmem: shmem_writepage() split folio at EOF before swapout
Working in a constrained (size= or nr_blocks=) huge=always tmpfs relies
on swapout to split a large folio at EOF, to trim off its excess before
hitting premature ENOSPC: shmem_unused_huge_shrink() contains no code to
handle splitting huge swap blocks, and nobody would want that to be added.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/shmem.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 37c300f69baf..4dd0570962fa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1459,6 +1459,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
swp_entry_t swap;
pgoff_t index;
int nr_pages;
+ bool split = false;
/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -1480,8 +1481,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
* If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
* "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
* and its shmem_writeback() needs them to be split when swapping.
+ *
+ * And shrinkage of pages beyond i_size does not split swap, so
+ * swapout of a large folio crossing i_size needs to split too
+ * (unless fallocate has been used to preallocate beyond EOF).
*/
- if (wbc->split_large_folio && folio_test_large(folio)) {
+ if (folio_test_large(folio)) {
+ split = wbc->split_large_folio;
+ index = shmem_fallocend(inode,
+ DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
+ if (index > folio->index && index < folio_next_index(folio))
+ split = true;
+ }
+
+ if (split) {
try_split:
/* Ensure the subpages are still dirty */
folio_test_set_dirty(folio);
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-25 21:55 ` Hugh Dickins
@ 2024-08-25 23:28 ` Matthew Wilcox
2024-08-27 10:10 ` Baolin Wang
1 sibling, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2024-08-25 23:28 UTC (permalink / raw)
To: Hugh Dickins
Cc: Baolin Wang, akpm, david, wangkefeng.wang, chrisl, ying.huang,
21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
p.raghav, linux-mm, linux-kernel
On Sun, Aug 25, 2024 at 02:55:41PM -0700, Hugh Dickins wrote:
> The second issue is that swap is more slippery to work with than
> folios or pages: in the folio_nr_pages() case, that number is stable
> because we hold a refcount (which stops a THP from being split), and
> later we'll be taking folio lock too. None of that in the swap case,
> so (depending on how a large entry gets split) the xa_get_order() result
> is not reliable. Likewise for other uses of xa_get_order() in this series.
>
> There needs to be some kind of locking or retry to make the order usable,
> and to avoid shmem_free_swap() occasionally freeing more than it ought.
> I'll give it thought after.
My original thought was that we'd take a bit from the swap entry in
order to indicate the order of the entry. I was surprised to see the
xa_get_order() implementation, but didn't remember why it wouldn't work.
Sorry.
Anyway, that's how I think it should be fixed. Is that enough? Holding
a reference on the folio prevents truncation, splitting, and so on.
There's no reference to be held on a swap entry, so could we have some
moderately implausible series of operations while holding only the RCU
read lock that would cause us to go wrong?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio()
2024-08-25 22:05 ` Hugh Dickins
@ 2024-08-27 3:06 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-27 3:06 UTC (permalink / raw)
To: Hugh Dickins
Cc: akpm, willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On 2024/8/26 06:05, Hugh Dickins wrote:
> On Mon, 12 Aug 2024, Baolin Wang wrote:
>
>> To support large folio swapin for shmem in the following patches, add
>> large folio allocation for the new replacement folio in shmem_replace_folio().
>> Moreover large folios occupy N consecutive entries in the swap cache
>> instead of using multi-index entries like the page cache, therefore
>> we should replace each consecutive entries in the swap cache instead
>> of using the shmem_replace_entry().
>>
>> As well as updating statistics and folio reference count using the number
>> of pages in the folio.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/shmem.c | 54 +++++++++++++++++++++++++++++++-----------------------
>> 1 file changed, 31 insertions(+), 23 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f6bab42180ea..d94f02ad7bd1 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1889,28 +1889,24 @@ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp)
>> static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>> struct shmem_inode_info *info, pgoff_t index)
>> {
>> - struct folio *old, *new;
>> - struct address_space *swap_mapping;
>> - swp_entry_t entry;
>> - pgoff_t swap_index;
>> - int error;
>> -
>> - old = *foliop;
>> - entry = old->swap;
>> - swap_index = swap_cache_index(entry);
>> - swap_mapping = swap_address_space(entry);
>> + struct folio *new, *old = *foliop;
>> + swp_entry_t entry = old->swap;
>> + struct address_space *swap_mapping = swap_address_space(entry);
>> + pgoff_t swap_index = swap_cache_index(entry);
>> + XA_STATE(xas, &swap_mapping->i_pages, swap_index);
>> + int nr_pages = folio_nr_pages(old);
>> + int error = 0, i;
>>
>> /*
>> * We have arrived here because our zones are constrained, so don't
>> * limit chance of success by further cpuset and node constraints.
>> */
>> gfp &= ~GFP_CONSTRAINT_MASK;
>> - VM_BUG_ON_FOLIO(folio_test_large(old), old);
>> - new = shmem_alloc_folio(gfp, 0, info, index);
>> + new = shmem_alloc_folio(gfp, folio_order(old), info, index);
>
> It is not clear to me whether folio_order(old) will ever be more than 0
> here: but if it can be, then care will need to be taken over the gfp flags,
With this patch set, it can be a large folio. If a large folio still
exists in the swap cache, we will get a large folio during swap in.
And yes, the gfp flags should be updated. How about the following fix?
> that they are suited to allocating the large folio; and there will need to
> be (could be awkward!) fallback to order 0 when that allocation fails.
I do not think we should fallback to order 0 for a large folio, which
will introduce more complex logic, for example, we should split the
original large swap entries in shmem mapping, and it is tricky to free
large swap entries, etc. So I want to keept it simple now.
> My own testing never comes to shmem_replace_folio(): it was originally for
> one lowend graphics driver; but IIRC there's now a more common case for it.
Good to know. Thank you very much for your valuable input.
[PATCH] mm: shmem: fix the gfp flag for large folio allocation
In shmem_replace_folio(), it may be necessary to allocate a large folio,
so we should update the gfp flags to ensure it is suitable for
allocating the large folio.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index dd384d4ab035..d8038a66b110 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -155,7 +155,7 @@ static unsigned long shmem_default_max_inodes(void)
static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct folio **foliop, enum sgp_type sgp, gfp_t
gfp,
- struct mm_struct *fault_mm, vm_fault_t *fault_type);
+ struct vm_area_struct *vma, vm_fault_t *fault_type);
static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
{
@@ -1887,7 +1887,8 @@ static bool shmem_should_replace_folio(struct
folio *folio, gfp_t gfp)
}
static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t
index)
+ struct shmem_inode_info *info, pgoff_t
index,
+ struct vm_area_struct *vma)
{
struct folio *new, *old = *foliop;
swp_entry_t entry = old->swap;
@@ -1902,6 +1903,12 @@ static int shmem_replace_folio(struct folio
**foliop, gfp_t gfp,
* limit chance of success by further cpuset and node constraints.
*/
gfp &= ~GFP_CONSTRAINT_MASK;
+ if (nr_pages > 1) {
+ gfp_t huge_gfp = vma_thp_gfp_mask(vma);
+
+ gfp = limit_gfp_mask(huge_gfp, gfp);
+ }
+
new = shmem_alloc_folio(gfp, folio_order(old), info, index);
if (!new)
return -ENOMEM;
@@ -2073,10 +2080,11 @@ static int shmem_split_large_entry(struct inode
*inode, pgoff_t index,
*/
static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct folio **foliop, enum sgp_type sgp,
- gfp_t gfp, struct mm_struct *fault_mm,
+ gfp_t gfp, struct vm_area_struct *vma,
vm_fault_t *fault_type)
{
struct address_space *mapping = inode->i_mapping;
+ struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
struct shmem_inode_info *info = SHMEM_I(inode);
struct swap_info_struct *si;
struct folio *folio = NULL;
@@ -2162,7 +2170,7 @@ static int shmem_swapin_folio(struct inode *inode,
pgoff_t index,
arch_swap_restore(folio_swap(swap, folio), folio);
if (shmem_should_replace_folio(folio, gfp)) {
- error = shmem_replace_folio(&folio, gfp, info, index);
+ error = shmem_replace_folio(&folio, gfp, info, index, vma);
if (error)
goto failed;
}
@@ -2243,7 +2251,7 @@ static int shmem_get_folio_gfp(struct inode
*inode, pgoff_t index,
if (xa_is_value(folio)) {
error = shmem_swapin_folio(inode, index, &folio,
- sgp, gfp, fault_mm, fault_type);
+ sgp, gfp, vma, fault_type);
if (error == -EEXIST)
goto repeat;
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large
2024-08-25 22:31 ` Hugh Dickins
@ 2024-08-27 6:46 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-27 6:46 UTC (permalink / raw)
To: Hugh Dickins, Andrew Morton
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On 2024/8/26 06:31, Hugh Dickins wrote:
> On Mon, 12 Aug 2024, Baolin Wang wrote:
>
>> Now the swap device can only swap-in order 0 folio, even though a large
>> folio is swapped out. This requires us to split the large entry previously
>> saved in the shmem pagecache to support the swap in of small folios.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/shmem.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 100 insertions(+)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 345e25425e37..996062dc196b 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1990,6 +1990,81 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>> swap_free_nr(swap, nr_pages);
>> }
>>
>> +static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>> + swp_entry_t swap, int new_order, gfp_t gfp)
>> +{
>> + struct address_space *mapping = inode->i_mapping;
>> + XA_STATE_ORDER(xas, &mapping->i_pages, index, new_order);
>> + void *alloced_shadow = NULL;
>> + int alloced_order = 0, i;
>
> gfp needs to be adjusted: see fix patch below.
Ah, good catch. Thank you Hugh.
>> +
>> + for (;;) {
>> + int order = -1, split_order = 0;
>> + void *old = NULL;
>> +
>> + xas_lock_irq(&xas);
>> + old = xas_load(&xas);
>> + if (!xa_is_value(old) || swp_to_radix_entry(swap) != old) {
>> + xas_set_err(&xas, -EEXIST);
>> + goto unlock;
>> + }
>> +
>> + order = xas_get_order(&xas);
>> +
>> + /* Swap entry may have changed before we re-acquire the lock */
>> + if (alloced_order &&
>> + (old != alloced_shadow || order != alloced_order)) {
>> + xas_destroy(&xas);
>> + alloced_order = 0;
>> + }
>> +
>> + /* Try to split large swap entry in pagecache */
>> + if (order > 0 && order > new_order) {
>
> I have not even attempted to understand all the manipulations of order and
> new_order and alloced_order and split_order. And further down it turns out
> that this is only ever called with new_order 0.
>
> You may be wanting to cater for more generality in future, but for now
> please cut this down to the new_order 0 case, and omit that parameter.
> It will be easier for us to think about the xa_get_order() races if
> the possibilities are more limited.
Sure. I will drop the 'new_order' with following fix.
>
>> + if (!alloced_order) {
>> + split_order = order;
>> + goto unlock;
>> + }
>> + xas_split(&xas, old, order);
>> +
>> + /*
>> + * Re-set the swap entry after splitting, and the swap
>> + * offset of the original large entry must be continuous.
>> + */
>> + for (i = 0; i < 1 << order; i += (1 << new_order)) {
>> + pgoff_t aligned_index = round_down(index, 1 << order);
>> + swp_entry_t tmp;
>> +
>> + tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>> + __xa_store(&mapping->i_pages, aligned_index + i,
>> + swp_to_radix_entry(tmp), 0);
>> + }
>
> So that is done under xas lock: good. But is the intermediate state
> visible to RCU readers, and could that be a problem?
In xas_split(), the multi-index entry has been split into smaller
entries, and each of these smaller entries has been set with the old
swap value. During the process of __xa_store(), these entries will be
re-set to the new swap value. Although RCU readers might observe the old
swap value, I have not seen any problems until now (may be I missed
something).
For concurrent shmem swap-in cases, there are some checks in
shmem_swapin_folio() (including folio->swap.val and shmem_confirm_swap()
validation ) to ensure the correctness of the swap values.
For the shmem_partial_swap_usage(), we may get racy swap usages, but it
is not a problem form its comments:
" * This is safe to call without i_rwsem or the i_pages lock thanks to RCU,
* as long as the inode doesn't go away and racy results are not a
problem."
For shmem truncation, when removing the racy swap entry from shmem page
cache, it will use xa_cmpxchg_irq() to sync the correct swap state.
[PATCH] mm: shmem: split large entry if the swapin folio is not large
fix 2
Now we only split large folio to order 0, so drop the 'new_order'
parameter.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index d8038a66b110..f00b7b99ad09 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1998,10 +1998,10 @@ static void shmem_set_folio_swapin_error(struct
inode *inode, pgoff_t index,
}
static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
- swp_entry_t swap, int new_order,
gfp_t gfp)
+ swp_entry_t swap, gfp_t gfp)
{
struct address_space *mapping = inode->i_mapping;
- XA_STATE_ORDER(xas, &mapping->i_pages, index, new_order);
+ XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
void *alloced_shadow = NULL;
int alloced_order = 0, i;
@@ -2026,7 +2026,7 @@ static int shmem_split_large_entry(struct inode
*inode, pgoff_t index,
}
/* Try to split large swap entry in pagecache */
- if (order > 0 && order > new_order) {
+ if (order > 0) {
if (!alloced_order) {
split_order = order;
goto unlock;
@@ -2037,7 +2037,7 @@ static int shmem_split_large_entry(struct inode
*inode, pgoff_t index,
* Re-set the swap entry after splitting, and
the swap
* offset of the original large entry must be
continuous.
*/
- for (i = 0; i < 1 << order; i += (1 << new_order)) {
+ for (i = 0; i < 1 << order; i++) {
pgoff_t aligned_index =
round_down(index, 1 << order);
swp_entry_t tmp;
@@ -2123,7 +2123,7 @@ static int shmem_swapin_folio(struct inode *inode,
pgoff_t index,
* should split the large swap entry stored in the
pagecache
* if necessary.
*/
- split_order = shmem_split_large_entry(inode, index,
swap, 0, gfp);
+ split_order = shmem_split_large_entry(inode, index,
swap, gfp);
if (split_order < 0) {
error = split_order;
goto failed;
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 9/9] mm: shmem: support large folio swap out
2024-08-25 23:14 ` Hugh Dickins
@ 2024-08-27 6:58 ` Baolin Wang
2024-08-28 8:28 ` [PATCH] mm: shmem: support large folio swap out fix 2 Baolin Wang
1 sibling, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-27 6:58 UTC (permalink / raw)
To: Hugh Dickins, Andrew Morton
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On 2024/8/26 07:14, Hugh Dickins wrote:
> On Mon, 12 Aug 2024, Baolin Wang wrote:
>
>> Shmem will support large folio allocation [1] [2] to get a better performance,
>> however, the memory reclaim still splits the precious large folios when trying
>> to swap out shmem, which may lead to the memory fragmentation issue and can not
>> take advantage of the large folio for shmeme.
>>
>> Moreover, the swap code already supports for swapping out large folio without
>> split, hence this patch set supports the large folio swap out for shmem.
>>
>> Note the i915_gem_shmem driver still need to be split when swapping, thus
>> add a new flag 'split_large_folio' for writeback_control to indicate spliting
>> the large folio.
>
> Is that last paragraph a misunderstanding? The i915 THP splitting in
> shmem_writepage() was to avoid mm VM_BUG_ONs and crashes when shmem.c
> did not support huge page swapout: but now you are enabling that support,
> and such VM_BUG_ONs and crashes are gone (so far as I can see: and this
> is written on a laptop using the i915 driver).
Thanks for the history, and I understand.
> I cannot think of why i915 itself would care how mm implements swapout
> (beyond enjoying faster): I think all the wbc->split_large_folio you
> introduce here should be reverted. But you may know better!
>
> I do need a further change to shmem_writepage() here: see fixup patch
> below: that's written to apply on top of this 9/9, but I'd be glad to
> see a replacement with wbc->split_large_folio gone, and just one
> !IS_ENABLED(CONFIG_THP_SWAP) instead.
Sure. After Andrew queuing your fixes, I can send a proper fix patch to
remove the 'wbc->split_large_folio'.
>> [1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
>> [2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
>
> I get "Not found" for that [2] link.
Weird, I can access the link, not sure why.
>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 +
>> include/linux/writeback.h | 4 +++
>> mm/shmem.c | 12 ++++++---
>> mm/vmscan.c | 32 ++++++++++++++++++-----
>> 4 files changed, 38 insertions(+), 11 deletions(-)
>
> [PATCH] mm: shmem: shmem_writepage() split folio at EOF before swapout
>
> Working in a constrained (size= or nr_blocks=) huge=always tmpfs relies
> on swapout to split a large folio at EOF, to trim off its excess before
> hitting premature ENOSPC: shmem_unused_huge_shrink() contains no code to
> handle splitting huge swap blocks, and nobody would want that to be added.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> mm/shmem.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 37c300f69baf..4dd0570962fa 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1459,6 +1459,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> swp_entry_t swap;
> pgoff_t index;
> int nr_pages;
> + bool split = false;
>
> /*
> * Our capabilities prevent regular writeback or sync from ever calling
> @@ -1480,8 +1481,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
> * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> * and its shmem_writeback() needs them to be split when swapping.
> + *
> + * And shrinkage of pages beyond i_size does not split swap, so
> + * swapout of a large folio crossing i_size needs to split too
> + * (unless fallocate has been used to preallocate beyond EOF).
> */
> - if (wbc->split_large_folio && folio_test_large(folio)) {
> + if (folio_test_large(folio)) {
> + split = wbc->split_large_folio;
> + index = shmem_fallocend(inode,
> + DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> + if (index > folio->index && index < folio_next_index(folio))
> + split = true;
> + }
> +
> + if (split) {
> try_split:
> /* Ensure the subpages are still dirty */
> folio_test_set_dirty(folio);
Thanks for the fix, Hugh. Very appreciated for your reviewing.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-25 21:55 ` Hugh Dickins
2024-08-25 23:28 ` Matthew Wilcox
@ 2024-08-27 10:10 ` Baolin Wang
2024-08-29 8:07 ` Hugh Dickins
1 sibling, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-08-27 10:10 UTC (permalink / raw)
To: Hugh Dickins
Cc: akpm, willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On 2024/8/26 05:55, Hugh Dickins wrote:
> On Mon, 12 Aug 2024, Baolin Wang wrote:
>
>> In the following patches, shmem will support the swap out of large folios,
>> which means the shmem mappings may contain large order swap entries, so
>> using xa_get_order() to get the folio order of the shmem swap entry to
>> update the '*start' correctly.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/filemap.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 4130be74f6fd..4c312aab8b1f 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
>> folio = fbatch->folios[idx];
>> if (!xa_is_value(folio))
>> nr = folio_nr_pages(folio);
>> + else
>> + nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
>> *start = indices[idx] + nr;
>> }
>> return folio_batch_count(fbatch);
>> @@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>> folio = fbatch->folios[idx];
>> if (!xa_is_value(folio))
>> nr = folio_nr_pages(folio);
>> + else
>> + nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
>> *start = indices[idx] + nr;
>> }
>> return folio_batch_count(fbatch);
>> --
>
> Here we have a problem, but I'm not suggesting a fix for it yet: I
> need to get other fixes out first, then turn to thinking about this -
> it's not easy.
Thanks for raising the issues.
>
> That code is almost always right, so it works well enough for most
> people not to have noticed, but there are two issues with it.
>
> The first issue is that it's assuming indices[idx] is already aligned
> to nr: not necessarily so. I believe it was already wrong in the
> folio_nr_pages() case, but the time I caught it wrong with a printk
> was in the swap (value) case. (I may not be stating this correctly:
> again more thought needed but I can't spend so long writing.)
>
> And immediately afterwards that kernel build failed with a corrupted
> (all 0s) .o file - I'm building on ext4 on /dev/loop0 on huge tmpfs while
> swapping, and happen to be using the "-o discard" option to ext4 mount.
>
> I've been chasing these failures (maybe a few minutes in, maybe half an
> hour) for days, then had the idea of trying without the "-o discard":
> whereupon these builds can be repeated successfully for many hours.
> IIRC ext4 discard to /dev/loop0 entails hole-punch to the tmpfs.
>
> The alignment issue can easily be corrected, but that might not help.
> (And those two functions would look better with the rcu_read_unlock()
> moved down to just before returning: but again, wouldn't really help.)
>
> The second issue is that swap is more slippery to work with than
> folios or pages: in the folio_nr_pages() case, that number is stable
> because we hold a refcount (which stops a THP from being split), and
> later we'll be taking folio lock too. None of that in the swap case,
> so (depending on how a large entry gets split) the xa_get_order() result
> is not reliable. Likewise for other uses of xa_get_order() in this series.
Now we have 2 users of xa_get_order() in this series:
1) shmem_partial_swap_usage(): this is acceptable, since racy results
are not a problem for the swap statistics.
2) shmem_undo_range(): when freeing a large swap entry, it will use
xa_cmpxchg_irq() to make sure the swap value is not changed (in case the
large swap entry is split). If failed to cmpxchg, then it will use
current index to retry in shmem_undo_range(). So seems not too bad?
static long shmem_free_swap(struct address_space *mapping,
pgoff_t index, void *radswap)
{
int order = xa_get_order(&mapping->i_pages, index);
void *old;
old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
if (old != radswap)
return 0;
free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
return 1 << order;
}
> There needs to be some kind of locking or retry to make the order usable,
> and to avoid shmem_free_swap() occasionally freeing more than it ought.
> I'll give it thought after.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] mm: shmem: support large folio swap out fix 2
2024-08-25 23:14 ` Hugh Dickins
2024-08-27 6:58 ` Baolin Wang
@ 2024-08-28 8:28 ` Baolin Wang
1 sibling, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2024-08-28 8:28 UTC (permalink / raw)
To: hughd
Cc: 21cnbao, akpm, baolin.wang, chrisl, da.gomez, david, ioworker0,
linux-kernel, linux-mm, p.raghav, ryan.roberts, shy828301,
wangkefeng.wang, willy, ying.huang, ziy
As Hugh said:
"
The i915 THP splitting inshmem_writepage() was to avoid mm VM_BUG_ONs and
crashes when shmem.c did not support huge page swapout: but now you are
enabling that support, and such VM_BUG_ONs and crashes are gone (so far
as I can see: and this is written on a laptop using the i915 driver).
I cannot think of why i915 itself would care how mm implements swapout
(beyond enjoying faster): I think all the wbc->split_large_folio you
introduce here should be reverted.
"
So this fixup patch will remove the wbc->split_large_folio as suggested
by Hugh.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
This fixup patch is based on 2024-08-28 mm-unstable branch.
Andrew, please help to squash this fixup patch. Thanks.
---
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 -
include/linux/writeback.h | 1 -
mm/shmem.c | 9 ++++-----
mm/vmscan.c | 4 +---
4 files changed, 5 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index c66cb9c585e1..c5e1c718a6d2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -308,7 +308,6 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
.range_start = 0,
.range_end = LLONG_MAX,
.for_reclaim = 1,
- .split_large_folio = 1,
};
unsigned long i;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 10100e22d5c6..51278327b3c6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,7 +63,6 @@ struct writeback_control {
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned unpinned_netfs_wb:1; /* Cleared I_PINNING_NETFS_WB */
- unsigned split_large_folio:1; /* Split large folio for shmem writeback */
/*
* When writeback IOs are bounced through async layers, only the
diff --git a/mm/shmem.c b/mm/shmem.c
index 16099340ca1d..2b0209d6ac9c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1473,19 +1473,18 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
goto redirty;
/*
- * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or
- * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
- * and its shmem_writeback() needs them to be split when swapping.
+ * If CONFIG_THP_SWAP is not enabled, the large folio should be
+ * split when swapping.
*
* And shrinkage of pages beyond i_size does not split swap, so
* swapout of a large folio crossing i_size needs to split too
* (unless fallocate has been used to preallocate beyond EOF).
*/
if (folio_test_large(folio)) {
- split = wbc->split_large_folio;
index = shmem_fallocend(inode,
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
- if (index > folio->index && index < folio_next_index(folio))
+ if ((index > folio->index && index < folio_next_index(folio)) ||
+ !IS_ENABLED(CONFIG_THP_SWAP))
split = true;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 283e3f9d652b..f27792e77a0f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -681,10 +681,8 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
* not enabled or contiguous swap entries are failed to
* allocate.
*/
- if (shmem_mapping(mapping) && folio_test_large(folio)) {
+ if (shmem_mapping(mapping) && folio_test_large(folio))
wbc.list = folio_list;
- wbc.split_large_folio = !IS_ENABLED(CONFIG_THP_SWAP);
- }
folio_set_reclaim(folio);
res = mapping->a_ops->writepage(&folio->page, &wbc);
--
2.39.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-27 10:10 ` Baolin Wang
@ 2024-08-29 8:07 ` Hugh Dickins
2024-08-29 12:40 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Hugh Dickins @ 2024-08-29 8:07 UTC (permalink / raw)
To: Baolin Wang, Andrew Morton
Cc: Hugh Dickins, willy, david, wangkefeng.wang, chrisl, ying.huang,
21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
p.raghav, linux-mm, linux-kernel
On Tue, 27 Aug 2024, Baolin Wang wrote:
> On 2024/8/26 05:55, Hugh Dickins wrote:
> > On Mon, 12 Aug 2024, Baolin Wang wrote:
> >
> >> In the following patches, shmem will support the swap out of large folios,
> >> which means the shmem mappings may contain large order swap entries, so
> >> using xa_get_order() to get the folio order of the shmem swap entry to
> >> update the '*start' correctly.
> >>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> ---
> >> mm/filemap.c | 4 ++++
> >> 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/mm/filemap.c b/mm/filemap.c
> >> index 4130be74f6fd..4c312aab8b1f 100644
> >> --- a/mm/filemap.c
> >> +++ b/mm/filemap.c
> >> @@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space
> >> *mapping, pgoff_t *start,
> >> folio = fbatch->folios[idx];
> >> if (!xa_is_value(folio))
> >> nr = folio_nr_pages(folio);
> >> + else
> >> + nr = 1 << xa_get_order(&mapping->i_pages,
> >> indices[idx]);
> >> *start = indices[idx] + nr;
> >> }
> >> return folio_batch_count(fbatch);
> >> @@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space
> >> *mapping, pgoff_t *start,
> >> folio = fbatch->folios[idx];
> >> if (!xa_is_value(folio))
> >> nr = folio_nr_pages(folio);
> >> + else
> >> + nr = 1 << xa_get_order(&mapping->i_pages,
> >> indices[idx]);
> >> *start = indices[idx] + nr;
> >> }
> >> return folio_batch_count(fbatch);
> >> --
> >
> > Here we have a problem, but I'm not suggesting a fix for it yet: I
> > need to get other fixes out first, then turn to thinking about this -
> > it's not easy.
>
> Thanks for raising the issues.
>
> >
> > That code is almost always right, so it works well enough for most
> > people not to have noticed, but there are two issues with it.
> >
> > The first issue is that it's assuming indices[idx] is already aligned
> > to nr: not necessarily so. I believe it was already wrong in the
> > folio_nr_pages() case, but the time I caught it wrong with a printk
> > was in the swap (value) case. (I may not be stating this correctly:
> > again more thought needed but I can't spend so long writing.)
> >
> > And immediately afterwards that kernel build failed with a corrupted
> > (all 0s) .o file - I'm building on ext4 on /dev/loop0 on huge tmpfs while
> > swapping, and happen to be using the "-o discard" option to ext4 mount.
> >
> > I've been chasing these failures (maybe a few minutes in, maybe half an
> > hour) for days, then had the idea of trying without the "-o discard":
> > whereupon these builds can be repeated successfully for many hours.
> > IIRC ext4 discard to /dev/loop0 entails hole-punch to the tmpfs.
> >
> > The alignment issue can easily be corrected, but that might not help.
> > (And those two functions would look better with the rcu_read_unlock()
> > moved down to just before returning: but again, wouldn't really help.)
> >
> > The second issue is that swap is more slippery to work with than
> > folios or pages: in the folio_nr_pages() case, that number is stable
> > because we hold a refcount (which stops a THP from being split), and
> > later we'll be taking folio lock too. None of that in the swap case,
> > so (depending on how a large entry gets split) the xa_get_order() result
> > is not reliable. Likewise for other uses of xa_get_order() in this series.
>
> Now we have 2 users of xa_get_order() in this series:
>
> 1) shmem_partial_swap_usage(): this is acceptable, since racy results are not
> a problem for the swap statistics.
Yes: there might be room for improvement, but no big deal there.
>
> 2) shmem_undo_range(): when freeing a large swap entry, it will use
> xa_cmpxchg_irq() to make sure the swap value is not changed (in case the large
> swap entry is split). If failed to cmpxchg, then it will use current index to
> retry in shmem_undo_range(). So seems not too bad?
Right, I was missing the cmpxchg aspect. I am not entirely convinced of
the safety in proceeding in this way, but I shouldn't spread FUD without
justification. Today, no yesterday, I realized what might be the actual
problem, and it's not at all these entry splitting races I had suspected.
Fix below. Successful testing on mm-everything-2024-08-24-07-21 (well,
that minus the commit which spewed warnings from bootup) confirmed it.
But testing on mm-everything-2024-08-28-21-38 very quickly failed:
unrelated to this series, presumably caused by patch or patches added
since 08-24, one kind of crash on one machine (some memcg thing called
from isolate_migratepages_block), another kind of crash on another (some
memcg thing called from __read_swap_cache_async), I'm exhausted by now
but will investigate later in the day (or hope someone else has).
[PATCH] mm: filemap: use xa_get_order() to get the swap entry order: fix
find_lock_entries(), used in the first pass of shmem_undo_range() and
truncate_inode_pages_range() before partial folios are dealt with, has
to be careful to avoid those partial folios: as its doc helpfully says,
"Folios which are partially outside the range are not returned". Of
course, the same must be true of any value entries returned, otherwise
truncation and hole-punch risk erasing swapped areas - as has been seen.
Rewrite find_lock_entries() to emphasize that, following the same pattern
for folios and for value entries.
Adjust find_get_entries() slightly, to get order while still holding
rcu_read_lock(), and to round down the updated start: good changes, like
find_lock_entries() now does, but it's unclear if either is ever important.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/filemap.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 885a8ed9d00d..88a2ed008474 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2047,10 +2047,9 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
if (!folio_batch_add(fbatch, folio))
break;
}
- rcu_read_unlock();
if (folio_batch_count(fbatch)) {
- unsigned long nr = 1;
+ unsigned long nr;
int idx = folio_batch_count(fbatch) - 1;
folio = fbatch->folios[idx];
@@ -2058,8 +2057,10 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
nr = folio_nr_pages(folio);
else
nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
- *start = indices[idx] + nr;
+ *start = round_down(indices[idx] + nr, nr);
}
+ rcu_read_unlock();
+
return folio_batch_count(fbatch);
}
@@ -2091,10 +2092,17 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
rcu_read_lock();
while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
+ unsigned long base;
+ unsigned long nr;
+
if (!xa_is_value(folio)) {
- if (folio->index < *start)
+ nr = folio_nr_pages(folio);
+ base = folio->index;
+ /* Omit large folio which begins before the start */
+ if (base < *start)
goto put;
- if (folio_next_index(folio) - 1 > end)
+ /* Omit large folio which extends beyond the end */
+ if (base + nr - 1 > end)
goto put;
if (!folio_trylock(folio))
goto put;
@@ -2103,7 +2111,19 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
goto unlock;
VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
folio);
+ } else {
+ nr = 1 << xa_get_order(&mapping->i_pages, xas.xa_index);
+ base = xas.xa_index & ~(nr - 1);
+ /* Omit order>0 value which begins before the start */
+ if (base < *start)
+ continue;
+ /* Omit order>0 value which extends beyond the end */
+ if (base + nr - 1 > end)
+ break;
}
+
+ /* Update start now so that last update is correct on return */
+ *start = base + nr;
indices[fbatch->nr] = xas.xa_index;
if (!folio_batch_add(fbatch, folio))
break;
@@ -2115,17 +2135,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
}
rcu_read_unlock();
- if (folio_batch_count(fbatch)) {
- unsigned long nr = 1;
- int idx = folio_batch_count(fbatch) - 1;
-
- folio = fbatch->folios[idx];
- if (!xa_is_value(folio))
- nr = folio_nr_pages(folio);
- else
- nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
- *start = indices[idx] + nr;
- }
return folio_batch_count(fbatch);
}
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-29 8:07 ` Hugh Dickins
@ 2024-08-29 12:40 ` Baolin Wang
2024-08-30 10:18 ` Hugh Dickins
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2024-08-29 12:40 UTC (permalink / raw)
To: Hugh Dickins, Andrew Morton
Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
linux-mm, linux-kernel
On 2024/8/29 16:07, Hugh Dickins wrote:
> On Tue, 27 Aug 2024, Baolin Wang wrote:
>> On 2024/8/26 05:55, Hugh Dickins wrote:
>>> On Mon, 12 Aug 2024, Baolin Wang wrote:
>>>
>>>> In the following patches, shmem will support the swap out of large folios,
>>>> which means the shmem mappings may contain large order swap entries, so
>>>> using xa_get_order() to get the folio order of the shmem swap entry to
>>>> update the '*start' correctly.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>> mm/filemap.c | 4 ++++
>>>> 1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index 4130be74f6fd..4c312aab8b1f 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space
>>>> *mapping, pgoff_t *start,
>>>> folio = fbatch->folios[idx];
>>>> if (!xa_is_value(folio))
>>>> nr = folio_nr_pages(folio);
>>>> + else
>>>> + nr = 1 << xa_get_order(&mapping->i_pages,
>>>> indices[idx]);
>>>> *start = indices[idx] + nr;
>>>> }
>>>> return folio_batch_count(fbatch);
>>>> @@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space
>>>> *mapping, pgoff_t *start,
>>>> folio = fbatch->folios[idx];
>>>> if (!xa_is_value(folio))
>>>> nr = folio_nr_pages(folio);
>>>> + else
>>>> + nr = 1 << xa_get_order(&mapping->i_pages,
>>>> indices[idx]);
>>>> *start = indices[idx] + nr;
>>>> }
>>>> return folio_batch_count(fbatch);
>>>> --
>>>
>>> Here we have a problem, but I'm not suggesting a fix for it yet: I
>>> need to get other fixes out first, then turn to thinking about this -
>>> it's not easy.
>>
>> Thanks for raising the issues.
>>
>>>
>>> That code is almost always right, so it works well enough for most
>>> people not to have noticed, but there are two issues with it.
>>>
>>> The first issue is that it's assuming indices[idx] is already aligned
>>> to nr: not necessarily so. I believe it was already wrong in the
>>> folio_nr_pages() case, but the time I caught it wrong with a printk
>>> was in the swap (value) case. (I may not be stating this correctly:
>>> again more thought needed but I can't spend so long writing.)
>>>
>>> And immediately afterwards that kernel build failed with a corrupted
>>> (all 0s) .o file - I'm building on ext4 on /dev/loop0 on huge tmpfs while
>>> swapping, and happen to be using the "-o discard" option to ext4 mount.
>>>
>>> I've been chasing these failures (maybe a few minutes in, maybe half an
>>> hour) for days, then had the idea of trying without the "-o discard":
>>> whereupon these builds can be repeated successfully for many hours.
>>> IIRC ext4 discard to /dev/loop0 entails hole-punch to the tmpfs.
>>>
>>> The alignment issue can easily be corrected, but that might not help.
>>> (And those two functions would look better with the rcu_read_unlock()
>>> moved down to just before returning: but again, wouldn't really help.)
>>>
>>> The second issue is that swap is more slippery to work with than
>>> folios or pages: in the folio_nr_pages() case, that number is stable
>>> because we hold a refcount (which stops a THP from being split), and
>>> later we'll be taking folio lock too. None of that in the swap case,
>>> so (depending on how a large entry gets split) the xa_get_order() result
>>> is not reliable. Likewise for other uses of xa_get_order() in this series.
>>
>> Now we have 2 users of xa_get_order() in this series:
>>
>> 1) shmem_partial_swap_usage(): this is acceptable, since racy results are not
>> a problem for the swap statistics.
>
> Yes: there might be room for improvement, but no big deal there.
>
>>
>> 2) shmem_undo_range(): when freeing a large swap entry, it will use
>> xa_cmpxchg_irq() to make sure the swap value is not changed (in case the large
>> swap entry is split). If failed to cmpxchg, then it will use current index to
>> retry in shmem_undo_range(). So seems not too bad?
>
> Right, I was missing the cmpxchg aspect. I am not entirely convinced of
> the safety in proceeding in this way, but I shouldn't spread FUD without
> justification. Today, no yesterday, I realized what might be the actual
> problem, and it's not at all these entry splitting races I had suspected.
>
> Fix below. Successful testing on mm-everything-2024-08-24-07-21 (well,
> that minus the commit which spewed warnings from bootup) confirmed it.
> But testing on mm-everything-2024-08-28-21-38 very quickly failed:
> unrelated to this series, presumably caused by patch or patches added
> since 08-24, one kind of crash on one machine (some memcg thing called
> from isolate_migratepages_block), another kind of crash on another (some
> memcg thing called from __read_swap_cache_async), I'm exhausted by now
> but will investigate later in the day (or hope someone else has).
I saw the isolate_migratepages_block crash issue on
mm-everything-2024-08-28-09-32, and I reverted Kefeng's series "[PATCH
0/4] mm: convert to folio_isolate_movable()", the
isolate_migratepages_block issue seems to be resolved (at least I can
not reproduce it).
And I have already pointed out some potential issues in Kefeng’s
series[1]. Andrew has dropped this series from
mm-everything-2024-08-28-21-38. However, you can still encounter the
isolate_migratepages_block issue on mm-everything-2024-08-28-21-38,
while I cannot, weird.
[1]
https://lore.kernel.org/all/3f8300d9-1c21-46ad-a311-e97dc94eda08@linux.alibaba.com/
[ 337.999054] page: refcount:0 mapcount:0 mapping:0000000000000000
index:0x3bbda pfn:0xf09041
[ 337.999065] memcg:ffff0000c642f000
[ 337.999066] anon flags:
0x17fffe0000020808(uptodate|owner_2|swapbacked|node=0|zone=2|lastcpupid=0x3ffff)
[ 337.999071] raw: 17fffe0000020808 dead000000000100 dead000000000122
ffff00047c6537b9
[ 337.999073] raw: 000000000003bbda 0000000000000000 00000000ffffffff
ffff0000c642f000
[ 337.999074] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page)
== 0)
[ 337.999082] ------------[ cut here ]------------
[ 337.999083] kernel BUG at include/linux/mm.h:1126!
[ 337.999384] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 338.002828] CPU: 31 UID: 0 PID: 87531 Comm: transhuge-stres Kdump:
loaded Tainted: G E 6.11.0-rc4+ #830
[ 338.003372] Tainted: [E]=UNSIGNED_MODULE
[ 338.003570] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
1.0.0 01/01/2017
[ 338.003939] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS
BTYPE=--)
[ 338.004282] pc : isolate_migratepages_block+0xb84/0x1000
[ 338.004553] lr : isolate_migratepages_block+0xb84/0x1000
[ 338.004817] sp : ffff8000a730b5d0
[......]
[ 338.008538] Call trace:
[ 338.008661] isolate_migratepages_block+0xb84/0x1000
[ 338.008910] isolate_migratepages+0x118/0x330
[ 338.009127] compact_zone+0x2c8/0x640
[ 338.009311] compact_zone_order+0xbc/0x110
[ 338.009516] try_to_compact_pages+0xf8/0x368
[ 338.009730] __alloc_pages_direct_compact+0x8c/0x260
[ 338.010002] __alloc_pages_slowpath.constprop.0+0x388/0x900
[ 338.010279] __alloc_pages_noprof+0x1f8/0x270
[ 338.010497] alloc_pages_mpol_noprof+0x8c/0x210
[ 338.010724] folio_alloc_mpol_noprof+0x18/0x68
[ 338.010945] vma_alloc_folio_noprof+0x7c/0xd0
[ 338.011162] do_huge_pmd_anonymous_page+0xe0/0x3b0
[ 338.011401] __handle_mm_fault+0x428/0x440
[ 338.011606] handle_mm_fault+0x68/0x210
> [PATCH] mm: filemap: use xa_get_order() to get the swap entry order: fix
>
> find_lock_entries(), used in the first pass of shmem_undo_range() and
> truncate_inode_pages_range() before partial folios are dealt with, has
> to be careful to avoid those partial folios: as its doc helpfully says,
> "Folios which are partially outside the range are not returned". Of
> course, the same must be true of any value entries returned, otherwise
> truncation and hole-punch risk erasing swapped areas - as has been seen.
>
> Rewrite find_lock_entries() to emphasize that, following the same pattern
> for folios and for value entries.
>
> Adjust find_get_entries() slightly, to get order while still holding
> rcu_read_lock(), and to round down the updated start: good changes, like
> find_lock_entries() now does, but it's unclear if either is ever important.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
Thanks Hugh. The changes make sense to me.
> ---
> mm/filemap.c | 41 +++++++++++++++++++++++++----------------
> 1 file changed, 25 insertions(+), 16 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 885a8ed9d00d..88a2ed008474 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2047,10 +2047,9 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> if (!folio_batch_add(fbatch, folio))
> break;
> }
> - rcu_read_unlock();
>
> if (folio_batch_count(fbatch)) {
> - unsigned long nr = 1;
> + unsigned long nr;
> int idx = folio_batch_count(fbatch) - 1;
>
> folio = fbatch->folios[idx];
> @@ -2058,8 +2057,10 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
> nr = folio_nr_pages(folio);
> else
> nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
> - *start = indices[idx] + nr;
> + *start = round_down(indices[idx] + nr, nr);
> }
> + rcu_read_unlock();
> +
> return folio_batch_count(fbatch);
> }
>
> @@ -2091,10 +2092,17 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>
> rcu_read_lock();
> while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
> + unsigned long base;
> + unsigned long nr;
> +
> if (!xa_is_value(folio)) {
> - if (folio->index < *start)
> + nr = folio_nr_pages(folio);
> + base = folio->index;
> + /* Omit large folio which begins before the start */
> + if (base < *start)
> goto put;
> - if (folio_next_index(folio) - 1 > end)
> + /* Omit large folio which extends beyond the end */
> + if (base + nr - 1 > end)
> goto put;
> if (!folio_trylock(folio))
> goto put;
> @@ -2103,7 +2111,19 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> goto unlock;
> VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
> folio);
> + } else {
> + nr = 1 << xa_get_order(&mapping->i_pages, xas.xa_index);
> + base = xas.xa_index & ~(nr - 1);
> + /* Omit order>0 value which begins before the start */
> + if (base < *start)
> + continue;
> + /* Omit order>0 value which extends beyond the end */
> + if (base + nr - 1 > end)
> + break;
> }
> +
> + /* Update start now so that last update is correct on return */
> + *start = base + nr;
> indices[fbatch->nr] = xas.xa_index;
> if (!folio_batch_add(fbatch, folio))
> break;
> @@ -2115,17 +2135,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
> }
> rcu_read_unlock();
>
> - if (folio_batch_count(fbatch)) {
> - unsigned long nr = 1;
> - int idx = folio_batch_count(fbatch) - 1;
> -
> - folio = fbatch->folios[idx];
> - if (!xa_is_value(folio))
> - nr = folio_nr_pages(folio);
> - else
> - nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
> - *start = indices[idx] + nr;
> - }
> return folio_batch_count(fbatch);
> }
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order
2024-08-29 12:40 ` Baolin Wang
@ 2024-08-30 10:18 ` Hugh Dickins
0 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2024-08-30 10:18 UTC (permalink / raw)
To: Baolin Wang
Cc: Hugh Dickins, Andrew Morton, willy, david, wangkefeng.wang,
chrisl, ying.huang, 21cnbao, ryan.roberts, shy828301, ziy,
ioworker0, da.gomez, p.raghav, linux-mm, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]
On Thu, 29 Aug 2024, Baolin Wang wrote:
> On 2024/8/29 16:07, Hugh Dickins wrote:
...
> >
> > Fix below. Successful testing on mm-everything-2024-08-24-07-21 (well,
> > that minus the commit which spewed warnings from bootup) confirmed it.
> > But testing on mm-everything-2024-08-28-21-38 very quickly failed:
> > unrelated to this series, presumably caused by patch or patches added
> > since 08-24, one kind of crash on one machine (some memcg thing called
> > from isolate_migratepages_block), another kind of crash on another (some
> > memcg thing called from __read_swap_cache_async), I'm exhausted by now
> > but will investigate later in the day (or hope someone else has).
>
> I saw the isolate_migratepages_block crash issue on
> mm-everything-2024-08-28-09-32, and I reverted Kefeng's series "[PATCH 0/4]
> mm: convert to folio_isolate_movable()", the isolate_migratepages_block issue
> seems to be resolved (at least I can not reproduce it).
>
> And I have already pointed out some potential issues in Kefeng’s series[1].
> Andrew has dropped this series from mm-everything-2024-08-28-21-38. However,
> you can still encounter the isolate_migratepages_block issue on
> mm-everything-2024-08-28-21-38, while I cannot, weird.
It was not that issue: isolate_migratepages_block() turned out to be an
innocent bystander in my case: and I didn't see it crash there again,
but in a variety of other memcg places, many of them stat updates.
The error came from a different series, fix now posted:
https://lore.kernel.org/linux-mm/56d42242-37fe-b94f-d3cb-00673f1e5efb@google.com/T/#u
>
> > [PATCH] mm: filemap: use xa_get_order() to get the swap entry order: fix
> >
> > find_lock_entries(), used in the first pass of shmem_undo_range() and
> > truncate_inode_pages_range() before partial folios are dealt with, has
> > to be careful to avoid those partial folios: as its doc helpfully says,
> > "Folios which are partially outside the range are not returned". Of
> > course, the same must be true of any value entries returned, otherwise
> > truncation and hole-punch risk erasing swapped areas - as has been seen.
> >
> > Rewrite find_lock_entries() to emphasize that, following the same pattern
> > for folios and for value entries.
> >
> > Adjust find_get_entries() slightly, to get order while still holding
> > rcu_read_lock(), and to round down the updated start: good changes, like
> > find_lock_entries() now does, but it's unclear if either is ever important.
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
>
> Thanks Hugh. The changes make sense to me.
Thanks!
Hugh
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-08-30 10:18 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-12 7:42 [PATCH v5 0/9] support large folio swap-out and swap-in for shmem Baolin Wang
2024-08-12 7:42 ` [PATCH v5 1/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
2024-08-12 7:42 ` [PATCH v5 2/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
2024-08-12 7:42 ` [PATCH v5 3/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
2024-08-12 7:42 ` [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
2024-08-25 21:55 ` Hugh Dickins
2024-08-25 23:28 ` Matthew Wilcox
2024-08-27 10:10 ` Baolin Wang
2024-08-29 8:07 ` Hugh Dickins
2024-08-29 12:40 ` Baolin Wang
2024-08-30 10:18 ` Hugh Dickins
2024-08-12 7:42 ` [PATCH v5 5/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
2024-08-12 7:42 ` [PATCH v5 6/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
2024-08-25 22:05 ` Hugh Dickins
2024-08-27 3:06 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 7/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
2024-08-12 7:42 ` [PATCH v5 8/9] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
2024-08-25 22:31 ` Hugh Dickins
2024-08-27 6:46 ` Baolin Wang
2024-08-12 7:42 ` [PATCH v5 9/9] mm: shmem: support large folio swap out Baolin Wang
2024-08-25 23:14 ` Hugh Dickins
2024-08-27 6:58 ` Baolin Wang
2024-08-28 8:28 ` [PATCH] mm: shmem: support large folio swap out fix 2 Baolin Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).