linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-05 22:10 ` [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON() Kemeng Shi
@ 2025-06-05 19:57   ` Andrew Morton
  2025-06-06  1:11     ` Kemeng Shi
  2025-06-07  6:11   ` Baolin Wang
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2025-06-05 19:57 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: hughd, baolin.wang, willy, linux-mm, linux-kernel, linux-fsdevel

On Fri,  6 Jun 2025 06:10:31 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> As noted in the comments, we need to release block usage for swap entry
> which was replaced with poisoned swap entry. However, no block usage is
> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
> the block usage.
> 
> ...
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>  	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>  	 * in shmem_evict_inode().
>  	 */
> -	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
> +	shmem_recalc_inode(inode, 0, -nr_pages);
>  	swap_free_nr(swap, nr_pages);
>  }

Huh, three years ago.  What do we think might be the userspace-visible
runtime effects of this?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/7] Some random fixes and cleanups to shmem
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
@ 2025-06-05 20:02 ` Andrew Morton
  2025-06-05 22:10 ` [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON() Kemeng Shi
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2025-06-05 20:02 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: hughd, baolin.wang, willy, linux-mm, linux-kernel, linux-fsdevel

On Fri,  6 Jun 2025 06:10:30 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> This series contains more fixes and cleanups which are made during
> learning shmem. Patch 1-3 are some random fixes; Patch 4-7 are some
> random cleanups. More details can be found in respective patches. Thanks!

Thanks.  I'll skip v1, wait to see what reviewers have to say.  Please
poke me in a week or so if you think I should proceed with the v1 series.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/7] Some random fixes and cleanups to shmem
@ 2025-06-05 22:10 Kemeng Shi
  2025-06-05 20:02 ` Andrew Morton
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

This series contains more fixes and cleanups which are made during
learning shmem. Patch 1-3 are some random fixes; Patch 4-7 are some
random cleanups. More details can be found in respective patches. Thanks!

Kemeng Shi (7):
  mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to
    avoid WARN_ON()
  mm: shmem: avoid setting error on splited entries in
    shmem_set_folio_swapin_error()
  mm: shmem: avoid missing entries in shmem_undo_range() when entries
    was splited concurrently
  mm: shmem: handle special case of shmem_recalc_inode() in it's caller
  mm: shmem: wrap additional shmem quota related code with
    CONFIG_TMPFS_QUOTA
  mm: shmem: simplify error flow in thpsize_shmem_enabled_store()
  mm: shmem: eliminate unneeded page counting in
    shmem_unuse_swap_entries()

 include/linux/shmem_fs.h |   4 +
 mm/filemap.c             |   2 +-
 mm/internal.h            |   2 +
 mm/shmem.c               | 169 ++++++++++++++++++++++++++-------------
 4 files changed, 121 insertions(+), 56 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
  2025-06-05 20:02 ` Andrew Morton
@ 2025-06-05 22:10 ` Kemeng Shi
  2025-06-05 19:57   ` Andrew Morton
  2025-06-07  6:11   ` Baolin Wang
  2025-06-05 22:10 ` [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error() Kemeng Shi
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

As noted in the comments, we need to release block usage for swap entry
which was replaced with poisoned swap entry. However, no block usage is
actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
the block usage.

Fixes: 6cec2b95dadf7 ("mm/shmem: fix infinite loop when swap in shmem error at swapoff time")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4b42419ce6b2..e27d19867e03 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
 	 * in shmem_evict_inode().
 	 */
-	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
+	shmem_recalc_inode(inode, 0, -nr_pages);
 	swap_free_nr(swap, nr_pages);
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error()
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
  2025-06-05 20:02 ` Andrew Morton
  2025-06-05 22:10 ` [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON() Kemeng Shi
@ 2025-06-05 22:10 ` Kemeng Shi
  2025-06-07  6:20   ` Baolin Wang
  2025-06-05 22:10 ` [PATCH 3/7] mm: shmem: avoid missing entries in shmem_undo_range() when entries was splited concurrently Kemeng Shi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

When large entry is splited, the first entry splited from large entry
retains the same entry value and index as original large entry but it's
order is reduced. In shmem_set_folio_swapin_error(), if large entry is
splited before xa_cmpxchg_irq(), we may replace the first splited entry
with error entry while using the size of original large entry for release
operations. This could lead to a WARN_ON(i_blocks) due to incorrect
nr_pages used by shmem_recalc_inode() and could lead to used after free
due to incorrect nr_pages used by swap_free_nr().
Skip setting error if entry spliiting is detected to fix the issue. The
bad entry will be replaced with error entry anyway as we will still get
IO error when we swap in the bad entry at next time.

Fixes: 12885cbe88ddf ("mm: shmem: split large entry if the swapin folio is not large")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e27d19867e03..f1062910a4de 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2127,16 +2127,25 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	swp_entry_t swapin_error;
 	void *old;
-	int nr_pages;
+	int nr_pages = folio_nr_pages(folio);
+	int order;
 
 	swapin_error = make_poisoned_swp_entry();
-	old = xa_cmpxchg_irq(&mapping->i_pages, index,
-			     swp_to_radix_entry(swap),
-			     swp_to_radix_entry(swapin_error), 0);
-	if (old != swp_to_radix_entry(swap))
+	xa_lock_irq(&mapping->i_pages);
+	order = xa_get_order(&mapping->i_pages, index);
+	if (nr_pages != (1 << order)) {
+		xa_unlock_irq(&mapping->i_pages);
 		return;
+	}
+	old = __xa_cmpxchg(&mapping->i_pages, index,
+			   swp_to_radix_entry(swap),
+			   swp_to_radix_entry(swapin_error), 0);
+	if (old != swp_to_radix_entry(swap)) {
+		xa_unlock_irq(&mapping->i_pages);
+		return;
+	}
+	xa_unlock_irq(&mapping->i_pages);
 
-	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
 	if (!skip_swapcache)
 		delete_from_swap_cache(folio);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/7] mm: shmem: avoid missing entries in shmem_undo_range() when entries was splited concurrently
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
                   ` (2 preceding siblings ...)
  2025-06-05 22:10 ` [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error() Kemeng Shi
@ 2025-06-05 22:10 ` Kemeng Shi
  2025-06-05 22:10 ` [PATCH 4/7] mm: shmem: handle special case of shmem_recalc_inode() in it's caller Kemeng Shi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

If large swap entry or large folio entry returned from find_get_entries()
is splited before it is truncated, only the first splited entry will be
truncated, leaving the remaining splited entries un-truncated.
To address this issue, we introduce a new helper function
shmem_find_get_entries() which is similar to find_get_entries() but it will
also return order of found entries. Then we can detect entry splitting
after initial search by comparing current entry order with order returned
from shmem_find_get_entries() and retry finding entries if the split is
detectted to fix the issue.
The large swap entry related race was introduced in 12885cbe88dd ("mm:
shmem: split large entry if the swapin folio is not large"). The large
folio related race seems a long-standing issue which may be related to
conversion to xarray, conversion to folio and other changes. As a result,
it's hard to track down the specific commit that directly introduced this
issue.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/filemap.c  |  2 +-
 mm/internal.h |  2 ++
 mm/shmem.c    | 81 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 7b90cbeb4a1a..672844b94d3a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2015,7 +2015,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 }
 EXPORT_SYMBOL(__filemap_get_folio);
 
-static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
+struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 		xa_mark_t mark)
 {
 	struct folio *folio;
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..9573b3a9e8c0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -446,6 +446,8 @@ static inline void force_page_cache_readahead(struct address_space *mapping,
 	force_page_cache_ra(&ractl, nr_to_read);
 }
 
+struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
+		xa_mark_t mark);
 unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
 		pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
diff --git a/mm/shmem.c b/mm/shmem.c
index f1062910a4de..2349673b239b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -949,18 +949,29 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
  * the number of pages being freed. 0 means entry not found in XArray (0 pages
  * being freed).
  */
-static long shmem_free_swap(struct address_space *mapping,
-			    pgoff_t index, void *radswap)
+static long shmem_free_swap(struct address_space *mapping, pgoff_t index,
+			    int order, void *radswap)
 {
-	int order = xa_get_order(&mapping->i_pages, index);
+	int old_order;
 	void *old;
 
-	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
-	if (old != radswap)
+	xa_lock_irq(&mapping->i_pages);
+	old_order = xa_get_order(&mapping->i_pages, index);
+	/* free swap anyway if input order is -1 */
+	if (order != -1 && old_order != order) {
+		xa_unlock_irq(&mapping->i_pages);
+		return 0;
+	}
+
+	old = __xa_cmpxchg(&mapping->i_pages, index, radswap, NULL, 0);
+	if (old != radswap) {
+		xa_unlock_irq(&mapping->i_pages);
 		return 0;
-	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+	}
+	xa_unlock_irq(&mapping->i_pages);
 
-	return 1 << order;
+	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << old_order);
+	return 1 << old_order;
 }
 
 /*
@@ -1077,6 +1088,39 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return folio;
 }
 
+/*
+ * Similar to find_get_entries(), but will return order of found entries
+ */
+static unsigned shmem_find_get_entries(struct address_space *mapping,
+		pgoff_t *start, pgoff_t end, struct folio_batch *fbatch,
+		pgoff_t *indices, int *orders)
+{
+	XA_STATE(xas, &mapping->i_pages, *start);
+	struct folio *folio;
+
+	rcu_read_lock();
+	while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
+		indices[fbatch->nr] = xas.xa_index;
+		if (!xa_is_value(folio))
+			orders[fbatch->nr] = folio_order(folio);
+		else
+			orders[fbatch->nr] = xas_get_order(&xas);
+		if (!folio_batch_add(fbatch, folio))
+			break;
+	}
+
+	if (folio_batch_count(fbatch)) {
+		unsigned long nr;
+		int idx = folio_batch_count(fbatch) - 1;
+
+		nr = 1 << orders[idx];
+		*start = round_down(indices[idx] + nr, nr);
+	}
+	rcu_read_unlock();
+
+	return folio_batch_count(fbatch);
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -1090,6 +1134,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pgoff_t end = (lend + 1) >> PAGE_SHIFT;
 	struct folio_batch fbatch;
 	pgoff_t indices[PAGEVEC_SIZE];
+	int orders[PAGEVEC_SIZE];
 	struct folio *folio;
 	bool same_folio;
 	long nr_swaps_freed = 0;
@@ -1113,7 +1158,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 				if (unfalloc)
 					continue;
 				nr_swaps_freed += shmem_free_swap(mapping,
-							indices[i], folio);
+						indices[i], -1, folio);
 				continue;
 			}
 
@@ -1166,8 +1211,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	while (index < end) {
 		cond_resched();
 
-		if (!find_get_entries(mapping, &index, end - 1, &fbatch,
-				indices)) {
+		if (!shmem_find_get_entries(mapping, &index, end - 1, &fbatch,
+				indices, orders)) {
 			/* If all gone or hole-punch or unfalloc, we're done */
 			if (index == start || end != -1)
 				break;
@@ -1183,9 +1228,13 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 
 				if (unfalloc)
 					continue;
-				swaps_freed = shmem_free_swap(mapping, indices[i], folio);
+				swaps_freed = shmem_free_swap(mapping,
+					indices[i], orders[i], folio);
+				/*
+				 * Swap was replaced by page or was
+				 * splited: retry
+				 */
 				if (!swaps_freed) {
-					/* Swap was replaced by page: retry */
 					index = indices[i];
 					break;
 				}
@@ -1196,8 +1245,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			folio_lock(folio);
 
 			if (!unfalloc || !folio_test_uptodate(folio)) {
-				if (folio_mapping(folio) != mapping) {
-					/* Page was replaced by swap: retry */
+				/*
+				 * Page was replaced by swap or was
+				 * splited: retry
+				 */
+				if (folio_mapping(folio) != mapping ||
+				    folio_order(folio) != orders[i]) {
 					folio_unlock(folio);
 					index = indices[i];
 					break;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/7] mm: shmem: handle special case of shmem_recalc_inode() in it's caller
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
                   ` (3 preceding siblings ...)
  2025-06-05 22:10 ` [PATCH 3/7] mm: shmem: avoid missing entries in shmem_undo_range() when entries was splited concurrently Kemeng Shi
@ 2025-06-05 22:10 ` Kemeng Shi
  2025-06-05 22:10 ` [PATCH 5/7] mm: shmem: wrap additional shmem quota related code with CONFIG_TMPFS_QUOTA Kemeng Shi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

The sepcial case in shmem_recalc_inode() is tailored for shmem_writepage().
By raising swapped before nrpages is lowered directly within
shmem_writepage(), we can simplify code of shmem_recalc_inode() and
eliminate the need of executing the special case code for all callers of
shmem_recalc_inode().

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2349673b239b..9f5e1eccaacb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -443,15 +443,6 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
 	info->swapped += swapped;
 	freed = info->alloced - info->swapped -
 		READ_ONCE(inode->i_mapping->nrpages);
-	/*
-	 * Special case: whereas normally shmem_recalc_inode() is called
-	 * after i_mapping->nrpages has already been adjusted (up or down),
-	 * shmem_writepage() has to raise swapped before nrpages is lowered -
-	 * to stop a racing shmem_recalc_inode() from thinking that a page has
-	 * been freed.  Compensate here, to avoid the need for a followup call.
-	 */
-	if (swapped > 0)
-		freed += swapped;
 	if (freed > 0)
 		info->alloced -= freed;
 	spin_unlock(&info->lock);
@@ -1694,9 +1685,16 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		list_add(&info->swaplist, &shmem_swaplist);
 
 	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
-		shmem_recalc_inode(inode, 0, nr_pages);
+		/*
+		 * Raise swapped before nrpages is lowered to stop racing
+		 * shmem_recalc_inode() from thinking that a page has been freed.
+		 */
+		spin_lock(&info->lock);
+		info->swapped += nr_pages;
+		spin_unlock(&info->lock);
 		swap_shmem_alloc(folio->swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
+		shmem_recalc_inode(inode, 0, 0);
 
 		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(folio_mapped(folio));
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/7] mm: shmem: wrap additional shmem quota related code with CONFIG_TMPFS_QUOTA
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
                   ` (4 preceding siblings ...)
  2025-06-05 22:10 ` [PATCH 4/7] mm: shmem: handle special case of shmem_recalc_inode() in it's caller Kemeng Shi
@ 2025-06-05 22:10 ` Kemeng Shi
  2025-06-05 22:10 ` [PATCH 6/7] mm: shmem: simplify error flow in thpsize_shmem_enabled_store() Kemeng Shi
  2025-06-05 22:10 ` [PATCH 7/7] mm: shmem: eliminate unneeded page counting in shmem_unuse_swap_entries() Kemeng Shi
  7 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

Some code routines and structure members are unreachable when
CONFIG_TMPFS_QUOTA is off. Wrap additional shmem quota related code to
eliniate these unreachable sections.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 include/linux/shmem_fs.h |  4 ++++
 mm/shmem.c               | 10 +++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0b273a7b9f01..4873359a5442 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -47,12 +47,14 @@ struct shmem_inode_info {
 	(FS_IMMUTABLE_FL | FS_APPEND_FL | FS_NODUMP_FL | FS_NOATIME_FL | FS_CASEFOLD_FL)
 #define SHMEM_FL_INHERITED		(FS_NODUMP_FL | FS_NOATIME_FL | FS_CASEFOLD_FL)
 
+#ifdef CONFIG_TMPFS_QUOTA
 struct shmem_quota_limits {
 	qsize_t usrquota_bhardlimit; /* Default user quota block hard limit */
 	qsize_t usrquota_ihardlimit; /* Default user quota inode hard limit */
 	qsize_t grpquota_bhardlimit; /* Default group quota block hard limit */
 	qsize_t grpquota_ihardlimit; /* Default group quota inode hard limit */
 };
+#endif
 
 struct shmem_sb_info {
 	unsigned long max_blocks;   /* How many blocks are allowed */
@@ -72,7 +74,9 @@ struct shmem_sb_info {
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
 	unsigned long shrinklist_len; /* Length of shrinklist */
+#ifdef CONFIG_TMPFS_QUOTA
 	struct shmem_quota_limits qlimits; /* Default quota limits */
+#endif
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f5e1eccaacb..e3e05bbb6db2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -121,8 +121,10 @@ struct shmem_options {
 	int huge;
 	int seen;
 	bool noswap;
+#ifdef CONFIG_TMPFS_QUOTA
 	unsigned short quota_types;
 	struct shmem_quota_limits qlimits;
+#endif
 #if IS_ENABLED(CONFIG_UNICODE)
 	struct unicode_map *encoding;
 	bool strict_encoding;
@@ -132,7 +134,9 @@ struct shmem_options {
 #define SHMEM_SEEN_HUGE 4
 #define SHMEM_SEEN_INUMS 8
 #define SHMEM_SEEN_NOSWAP 16
+#ifdef CONFIG_TMPFS_QUOTA
 #define SHMEM_SEEN_QUOTA 32
+#endif
 };
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4549,6 +4553,7 @@ enum shmem_param {
 	Opt_inode32,
 	Opt_inode64,
 	Opt_noswap,
+#ifdef CONFIG_TMPFS_QUOTA
 	Opt_quota,
 	Opt_usrquota,
 	Opt_grpquota,
@@ -4556,6 +4561,7 @@ enum shmem_param {
 	Opt_usrquota_inode_hardlimit,
 	Opt_grpquota_block_hardlimit,
 	Opt_grpquota_inode_hardlimit,
+#endif
 	Opt_casefold_version,
 	Opt_casefold,
 	Opt_strict_encoding,
@@ -4742,6 +4748,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->noswap = true;
 		ctx->seen |= SHMEM_SEEN_NOSWAP;
 		break;
+#ifdef CONFIG_TMPFS_QUOTA
 	case Opt_quota:
 		if (fc->user_ns != &init_user_ns)
 			return invalfc(fc, "Quotas in unprivileged tmpfs mounts are unsupported");
@@ -4796,6 +4803,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 				       "Group quota inode hardlimit too large.");
 		ctx->qlimits.grpquota_ihardlimit = size;
 		break;
+#endif
 	case Opt_casefold_version:
 		return shmem_parse_opt_casefold(fc, param, false);
 	case Opt_casefold:
@@ -4899,13 +4907,13 @@ static int shmem_reconfigure(struct fs_context *fc)
 		goto out;
 	}
 
+#ifdef CONFIG_TMPFS_QUOTA
 	if (ctx->seen & SHMEM_SEEN_QUOTA &&
 	    !sb_any_quota_loaded(fc->root->d_sb)) {
 		err = "Cannot enable quota on remount";
 		goto out;
 	}
 
-#ifdef CONFIG_TMPFS_QUOTA
 #define CHANGED_LIMIT(name)						\
 	(ctx->qlimits.name## hardlimit &&				\
 	(ctx->qlimits.name## hardlimit != sbinfo->qlimits.name## hardlimit))
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/7] mm: shmem: simplify error flow in thpsize_shmem_enabled_store()
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
                   ` (5 preceding siblings ...)
  2025-06-05 22:10 ` [PATCH 5/7] mm: shmem: wrap additional shmem quota related code with CONFIG_TMPFS_QUOTA Kemeng Shi
@ 2025-06-05 22:10 ` Kemeng Shi
  2025-06-05 22:10 ` [PATCH 7/7] mm: shmem: eliminate unneeded page counting in shmem_unuse_swap_entries() Kemeng Shi
  7 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

Simplify error flow in thpsize_shmem_enabled_store().

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e3e05bbb6db2..c6ea45d542d2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5601,7 +5601,7 @@ static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
 					   const char *buf, size_t count)
 {
 	int order = to_thpsize(kobj)->order;
-	ssize_t ret = count;
+	int err;
 
 	if (sysfs_streq(buf, "always")) {
 		spin_lock(&huge_shmem_orders_lock);
@@ -5644,16 +5644,14 @@ static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
 		clear_bit(order, &huge_shmem_orders_madvise);
 		spin_unlock(&huge_shmem_orders_lock);
 	} else {
-		ret = -EINVAL;
+		return -EINVAL;
 	}
 
-	if (ret > 0) {
-		int err = start_stop_khugepaged();
+	err = start_stop_khugepaged();
+	if (err)
+		return err;
 
-		if (err)
-			ret = err;
-	}
-	return ret;
+	return count;
 }
 
 struct kobj_attribute thpsize_shmem_enabled_attr =
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/7] mm: shmem: eliminate unneeded page counting in shmem_unuse_swap_entries()
  2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
                   ` (6 preceding siblings ...)
  2025-06-05 22:10 ` [PATCH 6/7] mm: shmem: simplify error flow in thpsize_shmem_enabled_store() Kemeng Shi
@ 2025-06-05 22:10 ` Kemeng Shi
  7 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-05 22:10 UTC (permalink / raw)
  To: hughd, baolin.wang, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel

Caller of shmem_unuse_swap_entries() will not use the count of pages
swapped in, so eliminate unneeded page counting in
shmem_unuse_swap_entries().

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/shmem.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c6ea45d542d2..c83baabc169d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1480,14 +1480,13 @@ static unsigned int shmem_find_swap_entries(struct address_space *mapping,
 }
 
 /*
- * Move the swapped pages for an inode to page cache. Returns the count
- * of pages swapped in, or the error in case of failure.
+ * Move the swapped pages for an inode to page cache. Returns 0 if success,
+ * or returns error in case of failure.
  */
 static int shmem_unuse_swap_entries(struct inode *inode,
 		struct folio_batch *fbatch, pgoff_t *indices)
 {
 	int i = 0;
-	int ret = 0;
 	int error = 0;
 	struct address_space *mapping = inode->i_mapping;
 
@@ -1499,13 +1498,11 @@ static int shmem_unuse_swap_entries(struct inode *inode,
 		if (error == 0) {
 			folio_unlock(folio);
 			folio_put(folio);
-			ret++;
 		}
 		if (error == -ENOMEM)
-			break;
-		error = 0;
+			return error;
 	}
-	return error ? error : ret;
+	return 0;
 }
 
 /*
@@ -1517,24 +1514,20 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type)
 	pgoff_t start = 0;
 	struct folio_batch fbatch;
 	pgoff_t indices[PAGEVEC_SIZE];
-	int ret = 0;
+	int ret;
 
 	do {
 		folio_batch_init(&fbatch);
 		if (!shmem_find_swap_entries(mapping, start, &fbatch,
-					     indices, type)) {
-			ret = 0;
-			break;
-		}
+					     indices, type))
+			return 0;
 
 		ret = shmem_unuse_swap_entries(inode, &fbatch, indices);
 		if (ret < 0)
-			break;
+			return ret;
 
 		start = indices[folio_batch_count(&fbatch) - 1];
 	} while (true);
-
-	return ret;
 }
 
 /*
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-05 19:57   ` Andrew Morton
@ 2025-06-06  1:11     ` Kemeng Shi
  2025-06-06  1:28       ` Andrew Morton
  2025-06-06  1:31       ` Kemeng Shi
  0 siblings, 2 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-06  1:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, baolin.wang, willy, linux-mm, linux-kernel, linux-fsdevel



on 6/6/2025 3:57 AM, Andrew Morton wrote:
> On Fri,  6 Jun 2025 06:10:31 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
>> As noted in the comments, we need to release block usage for swap entry
>> which was replaced with poisoned swap entry. However, no block usage is
>> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
>> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
>> the block usage.
>>
>> ...
>>
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>  	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>  	 * in shmem_evict_inode().
>>  	 */
>> -	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>> +	shmem_recalc_inode(inode, 0, -nr_pages);
>>  	swap_free_nr(swap, nr_pages);
>>  }
> 
> Huh, three years ago.  What do we think might be the userspace-visible
> runtime effects of this?
This could trigger WARN_ON(i_blocks) in shmem_evict_inode() as i_blocks
is supposed to be dropped in the quota free routine.
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-06  1:11     ` Kemeng Shi
@ 2025-06-06  1:28       ` Andrew Morton
  2025-06-06  2:29         ` Kemeng Shi
  2025-06-06  1:31       ` Kemeng Shi
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2025-06-06  1:28 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: hughd, baolin.wang, willy, linux-mm, linux-kernel, linux-fsdevel

On Fri, 6 Jun 2025 09:11:37 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >>  	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
> >>  	 * in shmem_evict_inode().
> >>  	 */
> >> -	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
> >> +	shmem_recalc_inode(inode, 0, -nr_pages);
> >>  	swap_free_nr(swap, nr_pages);
> >>  }
> > 
> > Huh, three years ago.  What do we think might be the userspace-visible
> > runtime effects of this?
> This could trigger WARN_ON(i_blocks) in shmem_evict_inode() as i_blocks
> is supposed to be dropped in the quota free routine.

I don't believe we've seen such a report in those three years so perhaps
no need to backport.  But it's a one-liner so let's backport ;) And
possibly [2/7] and [3/7] should receive the same treatment.

I don't think any of these need to be fast-tracked into mm-hotfixes so
please resend after a suitable review period and include the cc:stable
on those which -stable needs.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-06  1:11     ` Kemeng Shi
  2025-06-06  1:28       ` Andrew Morton
@ 2025-06-06  1:31       ` Kemeng Shi
  1 sibling, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-06  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, baolin.wang, willy, linux-mm, linux-kernel, linux-fsdevel



on 6/6/2025 9:11 AM, Kemeng Shi wrote:
> 
> 
> on 6/6/2025 3:57 AM, Andrew Morton wrote:
>> On Fri,  6 Jun 2025 06:10:31 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>>> As noted in the comments, we need to release block usage for swap entry
>>> which was replaced with poisoned swap entry. However, no block usage is
>>> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
>>> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
>>> the block usage.
>>>
>>> ...
>>>
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>  	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>>  	 * in shmem_evict_inode().
>>>  	 */
>>> -	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>>> +	shmem_recalc_inode(inode, 0, -nr_pages);
>>>  	swap_free_nr(swap, nr_pages);
>>>  }
>>
>> Huh, three years ago.  What do we think might be the userspace-visible
>> runtime effects of this?
> This could trigger WARN_ON(i_blocks) in shmem_evict_inode() as i_blocks
> is supposed to be dropped in the quota free routine.
Besides, the leak of block usage will reduce the available space to user.
As the amount of leakage accumulates over time, the available space may
eventually be exhausted.
>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-06  1:28       ` Andrew Morton
@ 2025-06-06  2:29         ` Kemeng Shi
  0 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-06  2:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, baolin.wang, willy, linux-mm, linux-kernel, linux-fsdevel



on 6/6/2025 9:28 AM, Andrew Morton wrote:
> On Fri, 6 Jun 2025 09:11:37 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>  	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>>>  	 * in shmem_evict_inode().
>>>>  	 */
>>>> -	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>>>> +	shmem_recalc_inode(inode, 0, -nr_pages);
>>>>  	swap_free_nr(swap, nr_pages);
>>>>  }
>>>
>>> Huh, three years ago.  What do we think might be the userspace-visible
>>> runtime effects of this?
>> This could trigger WARN_ON(i_blocks) in shmem_evict_inode() as i_blocks
>> is supposed to be dropped in the quota free routine.
> 
> I don't believe we've seen such a report in those three years so perhaps
> no need to backport.  But it's a one-liner so let's backport ;) And
> possibly [2/7] and [3/7] should receive the same treatment.
> 
> I don't think any of these need to be fast-tracked into mm-hotfixes so
> please resend after a suitable review period and include the cc:stable
> on those which -stable needs.
> 
Sure, all issues are hard to trigger. I will resend this series later.
Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-05 22:10 ` [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON() Kemeng Shi
  2025-06-05 19:57   ` Andrew Morton
@ 2025-06-07  6:11   ` Baolin Wang
  2025-06-09  0:46     ` Kemeng Shi
  1 sibling, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2025-06-07  6:11 UTC (permalink / raw)
  To: Kemeng Shi, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



On 2025/6/6 06:10, Kemeng Shi wrote:
> As noted in the comments, we need to release block usage for swap entry
> which was replaced with poisoned swap entry. However, no block usage is
> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
> the block usage.
> 
> Fixes: 6cec2b95dadf7 ("mm/shmem: fix infinite loop when swap in shmem error at swapoff time")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>   mm/shmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4b42419ce6b2..e27d19867e03 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>   	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>   	 * in shmem_evict_inode().
>   	 */
> -	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
> +	shmem_recalc_inode(inode, 0, -nr_pages);
>   	swap_free_nr(swap, nr_pages);
>   }

Have you tested your patch? When I inject an error to test your patch, 
the following issue will be triggered:

[  127.173330] ------------[ cut here ]------------
[  127.173331] WARNING: CPU: 13 PID: 6860 at mm/shmem.c:1388 
shmem_evict_inode+0xf0/0x348
[  127.173920] CPU: 13 UID: 0 PID: 6860 Comm: shmem_swapin_er Kdump: 
loaded Tainted: G            E       6.15.0-rc6+ #54 VOLUNTARY
[  127.173925] pstate: 63401005 (nZCv daif +PAN -UAO +TCO +DIT +SSBS 
BTYPE=--)
[  127.173927] pc : shmem_evict_inode+0xf0/0x348
[  127.173929] lr : shmem_evict_inode+0x68/0x348
[  127.173931] sp : ffff8000895639e0
[  127.173932] x29: ffff8000895639e0 x28: 0000000000000006 x27: 
ffff00013754bfc0
[  127.173935] x26: ffff800080d8f160 x25: 0000000000000006 x24: 
ffff0000c0aab440
[  127.173937] x23: ffff00013754b780 x22: ffff00013754b780 x21: 
ffff0000cbc9c6b0
[  127.173940] x20: ffff0000c0aab440 x19: ffff0000cbc9c700 x18: 
0000000000000030
[  127.173942] x17: 0000ffffa1f4cfff x16: 0000000000000003 x15: 
0000000000001000
[  127.173945] x14: 00000000ffffffff x13: 0000000000000004 x12: 
ffff800089563108
[  127.173947] x11: 0000000000000000 x10: 0000000000000002 x9 : 
ffff800080352080
[  127.173949] x8 : fffffffffffffffe x7 : ffff800089563700 x6 : 
0000000000000001
[  127.173952] x5 : 0000000000000004 x4 : 0000000000000002 x3 : 
0000000000000002
[  127.173954] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
ffffffffffffff80
[  127.173957] Call trace:
[  127.173958]  shmem_evict_inode+0xf0/0x348 (P)
[  127.173961]  evict+0x1c8/0x2c8
[  127.173964]  iput_final+0x84/0x1a0
[  127.173966]  iput.part.0+0xd0/0xf0
[  127.173968]  iput+0x20/0x38
[  127.173971]  dentry_unlink_inode+0xc0/0x158
[  127.173973]  __dentry_kill+0x80/0x248
[  127.173974]  dput+0xf0/0x240
[  127.173976]  __fput+0x120/0x2f0
[  127.173978]  ____fput+0x18/0x28
[  127.173980]  task_work_run+0x88/0x120
[  127.173983]  do_exit+0x198/0x3c0
[  127.173986]  do_group_exit+0x38/0xa0
[  127.173987]  get_signal+0x6ac/0x6b8
[  127.173990]  do_signal+0x100/0x208
[  127.173991]  do_notify_resume+0xc8/0x158
[  127.173994]  el0_da+0xbc/0xc0
[  127.173997]  el0t_64_sync_handler+0x70/0xc8
[  127.173999]  el0t_64_sync+0x154/0x158
[  127.174001] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error()
  2025-06-05 22:10 ` [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error() Kemeng Shi
@ 2025-06-07  6:20   ` Baolin Wang
  2025-06-09  1:19     ` Kemeng Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2025-06-07  6:20 UTC (permalink / raw)
  To: Kemeng Shi, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



On 2025/6/6 06:10, Kemeng Shi wrote:
> When large entry is splited, the first entry splited from large entry
> retains the same entry value and index as original large entry but it's
> order is reduced. In shmem_set_folio_swapin_error(), if large entry is
> splited before xa_cmpxchg_irq(), we may replace the first splited entry
> with error entry while using the size of original large entry for release
> operations. This could lead to a WARN_ON(i_blocks) due to incorrect
> nr_pages used by shmem_recalc_inode() and could lead to used after free
> due to incorrect nr_pages used by swap_free_nr().

I wonder if you have actually triggered this issue? When a large swap 
entry is split, it means the folio is already at order 0, so why would 
the size of the original large entry be used for release operations? Or 
is there another race condition?

> Skip setting error if entry spliiting is detected to fix the issue. The
> bad entry will be replaced with error entry anyway as we will still get
> IO error when we swap in the bad entry at next time.
> 
> Fixes: 12885cbe88ddf ("mm: shmem: split large entry if the swapin folio is not large")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>   mm/shmem.c | 21 +++++++++++++++------
>   1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e27d19867e03..f1062910a4de 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2127,16 +2127,25 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>   	struct address_space *mapping = inode->i_mapping;
>   	swp_entry_t swapin_error;
>   	void *old;
> -	int nr_pages;
> +	int nr_pages = folio_nr_pages(folio);
> +	int order;
>   
>   	swapin_error = make_poisoned_swp_entry();
> -	old = xa_cmpxchg_irq(&mapping->i_pages, index,
> -			     swp_to_radix_entry(swap),
> -			     swp_to_radix_entry(swapin_error), 0);
> -	if (old != swp_to_radix_entry(swap))
> +	xa_lock_irq(&mapping->i_pages);
> +	order = xa_get_order(&mapping->i_pages, index);
> +	if (nr_pages != (1 << order)) {
> +		xa_unlock_irq(&mapping->i_pages);
>   		return;
> +	}
> +	old = __xa_cmpxchg(&mapping->i_pages, index,
> +			   swp_to_radix_entry(swap),
> +			   swp_to_radix_entry(swapin_error), 0);
> +	if (old != swp_to_radix_entry(swap)) {
> +		xa_unlock_irq(&mapping->i_pages);
> +		return;
> +	}
> +	xa_unlock_irq(&mapping->i_pages);
>   
> -	nr_pages = folio_nr_pages(folio);
>   	folio_wait_writeback(folio);
>   	if (!skip_swapcache)
>   		delete_from_swap_cache(folio);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-07  6:11   ` Baolin Wang
@ 2025-06-09  0:46     ` Kemeng Shi
  2025-06-10  1:02       ` Kemeng Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Kemeng Shi @ 2025-06-09  0:46 UTC (permalink / raw)
  To: Baolin Wang, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



on 6/7/2025 2:11 PM, Baolin Wang wrote:
> 
> 
> On 2025/6/6 06:10, Kemeng Shi wrote:
>> As noted in the comments, we need to release block usage for swap entry
>> which was replaced with poisoned swap entry. However, no block usage is
>> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
>> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
>> the block usage.
>>
>> Fixes: 6cec2b95dadf7 ("mm/shmem: fix infinite loop when swap in shmem error at swapoff time")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>   mm/shmem.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 4b42419ce6b2..e27d19867e03 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>        * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>        * in shmem_evict_inode().
>>        */
>> -    shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>> +    shmem_recalc_inode(inode, 0, -nr_pages);
>>       swap_free_nr(swap, nr_pages);
>>   }
> 
> Have you tested your patch? When I inject an error to test your patch, the following issue will be triggered:As all issues are hard to trigger, I only run some simple test to ensure normal
process is fine. Could you share how to inject the error to trigger following
issue. I will have a deep look. Thanks
> 
> [  127.173330] ------------[ cut here ]------------
> [  127.173331] WARNING: CPU: 13 PID: 6860 at mm/shmem.c:1388 shmem_evict_inode+0xf0/0x348
> [  127.173920] CPU: 13 UID: 0 PID: 6860 Comm: shmem_swapin_er Kdump: loaded Tainted: G            E       6.15.0-rc6+ #54 VOLUNTARY
> [  127.173925] pstate: 63401005 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
> [  127.173927] pc : shmem_evict_inode+0xf0/0x348
> [  127.173929] lr : shmem_evict_inode+0x68/0x348
> [  127.173931] sp : ffff8000895639e0
> [  127.173932] x29: ffff8000895639e0 x28: 0000000000000006 x27: ffff00013754bfc0
> [  127.173935] x26: ffff800080d8f160 x25: 0000000000000006 x24: ffff0000c0aab440
> [  127.173937] x23: ffff00013754b780 x22: ffff00013754b780 x21: ffff0000cbc9c6b0
> [  127.173940] x20: ffff0000c0aab440 x19: ffff0000cbc9c700 x18: 0000000000000030
> [  127.173942] x17: 0000ffffa1f4cfff x16: 0000000000000003 x15: 0000000000001000
> [  127.173945] x14: 00000000ffffffff x13: 0000000000000004 x12: ffff800089563108
> [  127.173947] x11: 0000000000000000 x10: 0000000000000002 x9 : ffff800080352080
> [  127.173949] x8 : fffffffffffffffe x7 : ffff800089563700 x6 : 0000000000000001
> [  127.173952] x5 : 0000000000000004 x4 : 0000000000000002 x3 : 0000000000000002
> [  127.173954] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffffffffffff80
> [  127.173957] Call trace:
> [  127.173958]  shmem_evict_inode+0xf0/0x348 (P)
> [  127.173961]  evict+0x1c8/0x2c8
> [  127.173964]  iput_final+0x84/0x1a0
> [  127.173966]  iput.part.0+0xd0/0xf0
> [  127.173968]  iput+0x20/0x38
> [  127.173971]  dentry_unlink_inode+0xc0/0x158
> [  127.173973]  __dentry_kill+0x80/0x248
> [  127.173974]  dput+0xf0/0x240
> [  127.173976]  __fput+0x120/0x2f0
> [  127.173978]  ____fput+0x18/0x28
> [  127.173980]  task_work_run+0x88/0x120
> [  127.173983]  do_exit+0x198/0x3c0
> [  127.173986]  do_group_exit+0x38/0xa0
> [  127.173987]  get_signal+0x6ac/0x6b8
> [  127.173990]  do_signal+0x100/0x208
> [  127.173991]  do_notify_resume+0xc8/0x158
> [  127.173994]  el0_da+0xbc/0xc0
> [  127.173997]  el0t_64_sync_handler+0x70/0xc8
> [  127.173999]  el0t_64_sync+0x154/0x158
> [  127.174001] ---[ end trace 0000000000000000 ]---
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error()
  2025-06-07  6:20   ` Baolin Wang
@ 2025-06-09  1:19     ` Kemeng Shi
  2025-06-11  7:41       ` Baolin Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Kemeng Shi @ 2025-06-09  1:19 UTC (permalink / raw)
  To: Baolin Wang, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



on 6/7/2025 2:20 PM, Baolin Wang wrote:
> 
> 
> On 2025/6/6 06:10, Kemeng Shi wrote:
>> When large entry is splited, the first entry splited from large entry
>> retains the same entry value and index as original large entry but it's
>> order is reduced. In shmem_set_folio_swapin_error(), if large entry is
>> splited before xa_cmpxchg_irq(), we may replace the first splited entry
>> with error entry while using the size of original large entry for release
>> operations. This could lead to a WARN_ON(i_blocks) due to incorrect
>> nr_pages used by shmem_recalc_inode() and could lead to used after free
>> due to incorrect nr_pages used by swap_free_nr().
> 
> I wonder if you have actually triggered this issue? When a large swap entry is split, it means the folio is already at order 0, so why would the size of the original large entry be used for release operations? Or is there another race condition?
All issues are found during review the code of shmem as I menthioned in
cover letter.
The folio could be allocated from shmem_swap_alloc_folio() and the folio
order will keep unchange when swap entry is split.

> 
>> Skip setting error if entry spliiting is detected to fix the issue. The
>> bad entry will be replaced with error entry anyway as we will still get
>> IO error when we swap in the bad entry at next time.
>>
>> Fixes: 12885cbe88ddf ("mm: shmem: split large entry if the swapin folio is not large")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>   mm/shmem.c | 21 +++++++++++++++------
>>   1 file changed, 15 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index e27d19867e03..f1062910a4de 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2127,16 +2127,25 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>       struct address_space *mapping = inode->i_mapping;
>>       swp_entry_t swapin_error;
>>       void *old;
>> -    int nr_pages;
>> +    int nr_pages = folio_nr_pages(folio);
>> +    int order;
>>         swapin_error = make_poisoned_swp_entry();
>> -    old = xa_cmpxchg_irq(&mapping->i_pages, index,
>> -                 swp_to_radix_entry(swap),
>> -                 swp_to_radix_entry(swapin_error), 0);
>> -    if (old != swp_to_radix_entry(swap))
>> +    xa_lock_irq(&mapping->i_pages);
>> +    order = xa_get_order(&mapping->i_pages, index);
>> +    if (nr_pages != (1 << order)) {
>> +        xa_unlock_irq(&mapping->i_pages);
>>           return;
>> +    }
>> +    old = __xa_cmpxchg(&mapping->i_pages, index,
>> +               swp_to_radix_entry(swap),
>> +               swp_to_radix_entry(swapin_error), 0);
>> +    if (old != swp_to_radix_entry(swap)) {
>> +        xa_unlock_irq(&mapping->i_pages);
>> +        return;
>> +    }
>> +    xa_unlock_irq(&mapping->i_pages);
>>   -    nr_pages = folio_nr_pages(folio);
>>       folio_wait_writeback(folio);
>>       if (!skip_swapcache)
>>           delete_from_swap_cache(folio);
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-09  0:46     ` Kemeng Shi
@ 2025-06-10  1:02       ` Kemeng Shi
  2025-06-11  7:29         ` Baolin Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Kemeng Shi @ 2025-06-10  1:02 UTC (permalink / raw)
  To: Baolin Wang, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



on 6/9/2025 8:46 AM, Kemeng Shi wrote:
> 
> 
> on 6/7/2025 2:11 PM, Baolin Wang wrote:
>>
>>
>> On 2025/6/6 06:10, Kemeng Shi wrote:
>>> As noted in the comments, we need to release block usage for swap entry
>>> which was replaced with poisoned swap entry. However, no block usage is
>>> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
>>> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
>>> the block usage.
>>>
>>> Fixes: 6cec2b95dadf7 ("mm/shmem: fix infinite loop when swap in shmem error at swapoff time")
>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>> ---
>>>   mm/shmem.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 4b42419ce6b2..e27d19867e03 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>        * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>>        * in shmem_evict_inode().
>>>        */
>>> -    shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>>> +    shmem_recalc_inode(inode, 0, -nr_pages);
>>>       swap_free_nr(swap, nr_pages);
>>>   }
>>
>> Have you tested your patch? When I inject an error to test your patch, the following issue will be triggered:As all issues are hard to trigger, I only run some simple test to ensure normal
> process is fine. Could you share how to inject the error to trigger following
> issue. I will have a deep look. Thanks
Sorry that the message is truncated. I mean I only test normal process is fine.
Besides, I think there is another long-standing issue which could trigger the
following issue. Here is the issue which is possible to blame:
When swap entry is replaced with error entry in shmem_set_folio_swapin_error(),
we will reduce info->swapped. Afterwards, error entry could be deleted in
shmem_undo_range() and the info->swapped is reduced again. As a result, we
reduce info->swapped twice for a single swap entry.
A simple way to confirm this is injecting error to original code. Could you
share how to trigger the issue or could you do the same test to original code?
Thanks.

>>
>> [  127.173330] ------------[ cut here ]------------
>> [  127.173331] WARNING: CPU: 13 PID: 6860 at mm/shmem.c:1388 shmem_evict_inode+0xf0/0x348
>> [  127.173920] CPU: 13 UID: 0 PID: 6860 Comm: shmem_swapin_er Kdump: loaded Tainted: G            E       6.15.0-rc6+ #54 VOLUNTARY
>> [  127.173925] pstate: 63401005 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>> [  127.173927] pc : shmem_evict_inode+0xf0/0x348
>> [  127.173929] lr : shmem_evict_inode+0x68/0x348
>> [  127.173931] sp : ffff8000895639e0
>> [  127.173932] x29: ffff8000895639e0 x28: 0000000000000006 x27: ffff00013754bfc0
>> [  127.173935] x26: ffff800080d8f160 x25: 0000000000000006 x24: ffff0000c0aab440
>> [  127.173937] x23: ffff00013754b780 x22: ffff00013754b780 x21: ffff0000cbc9c6b0
>> [  127.173940] x20: ffff0000c0aab440 x19: ffff0000cbc9c700 x18: 0000000000000030
>> [  127.173942] x17: 0000ffffa1f4cfff x16: 0000000000000003 x15: 0000000000001000
>> [  127.173945] x14: 00000000ffffffff x13: 0000000000000004 x12: ffff800089563108
>> [  127.173947] x11: 0000000000000000 x10: 0000000000000002 x9 : ffff800080352080
>> [  127.173949] x8 : fffffffffffffffe x7 : ffff800089563700 x6 : 0000000000000001
>> [  127.173952] x5 : 0000000000000004 x4 : 0000000000000002 x3 : 0000000000000002
>> [  127.173954] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffffffffffff80
>> [  127.173957] Call trace:
>> [  127.173958]  shmem_evict_inode+0xf0/0x348 (P)
>> [  127.173961]  evict+0x1c8/0x2c8
>> [  127.173964]  iput_final+0x84/0x1a0
>> [  127.173966]  iput.part.0+0xd0/0xf0
>> [  127.173968]  iput+0x20/0x38
>> [  127.173971]  dentry_unlink_inode+0xc0/0x158
>> [  127.173973]  __dentry_kill+0x80/0x248
>> [  127.173974]  dput+0xf0/0x240
>> [  127.173976]  __fput+0x120/0x2f0
>> [  127.173978]  ____fput+0x18/0x28
>> [  127.173980]  task_work_run+0x88/0x120
>> [  127.173983]  do_exit+0x198/0x3c0
>> [  127.173986]  do_group_exit+0x38/0xa0
>> [  127.173987]  get_signal+0x6ac/0x6b8
>> [  127.173990]  do_signal+0x100/0x208
>> [  127.173991]  do_notify_resume+0xc8/0x158
>> [  127.173994]  el0_da+0xbc/0xc0
>> [  127.173997]  el0t_64_sync_handler+0x70/0xc8
>> [  127.173999]  el0t_64_sync+0x154/0x158
>> [  127.174001] ---[ end trace 0000000000000000 ]---
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-10  1:02       ` Kemeng Shi
@ 2025-06-11  7:29         ` Baolin Wang
  2025-06-11  8:38           ` Kemeng Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2025-06-11  7:29 UTC (permalink / raw)
  To: Kemeng Shi, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



On 2025/6/10 09:02, Kemeng Shi wrote:
> 
> 
> on 6/9/2025 8:46 AM, Kemeng Shi wrote:
>>
>>
>> on 6/7/2025 2:11 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/6/6 06:10, Kemeng Shi wrote:
>>>> As noted in the comments, we need to release block usage for swap entry
>>>> which was replaced with poisoned swap entry. However, no block usage is
>>>> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
>>>> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
>>>> the block usage.
>>>>
>>>> Fixes: 6cec2b95dadf7 ("mm/shmem: fix infinite loop when swap in shmem error at swapoff time")
>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>>> ---
>>>>    mm/shmem.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 4b42419ce6b2..e27d19867e03 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>         * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>>>         * in shmem_evict_inode().
>>>>         */
>>>> -    shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>>>> +    shmem_recalc_inode(inode, 0, -nr_pages);
>>>>        swap_free_nr(swap, nr_pages);
>>>>    }
>>>
>>> Have you tested your patch? When I inject an error to test your patch, the following issue will be triggered:As all issues are hard to trigger, I only run some simple test to ensure normal
>> process is fine. Could you share how to inject the error to trigger following
>> issue. I will have a deep look. Thanks
> Sorry that the message is truncated. I mean I only test normal process is fine.

Please also test the swapin error case you try to fix. Obviously your 
current patch is incorrect.

> Besides, I think there is another long-standing issue which could trigger the
> following issue. Here is the issue which is possible to blame:
> When swap entry is replaced with error entry in shmem_set_folio_swapin_error(),
> we will reduce info->swapped. Afterwards, error entry could be deleted in
> shmem_undo_range() and the info->swapped is reduced again. As a result, we
> reduce info->swapped twice for a single swap entry.

OK. So you should do something like in shmem_find_swap_entries() to 
avoid decreasing info->swapped again.

entry = radix_to_swp_entry(folio);
/*
* swapin error entries can be found in the mapping. But they're
* deliberately ignored here as we've done everything we can do.
*/
if (swp_type(entry) != type)
	continue;

> A simple way to confirm this is injecting error to original code. Could you
> share how to trigger the issue or could you do the same test to original code?

Yes, original code is good.

A simple test procedure is to allocate some shmem memory and swap them 
out, then swap in the shmem while injecting an error to trigger the 
swap-in error case, and finally unmap the program.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error()
  2025-06-09  1:19     ` Kemeng Shi
@ 2025-06-11  7:41       ` Baolin Wang
  2025-06-11  9:11         ` Kemeng Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Baolin Wang @ 2025-06-11  7:41 UTC (permalink / raw)
  To: Kemeng Shi, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



On 2025/6/9 09:19, Kemeng Shi wrote:
> 
> 
> on 6/7/2025 2:20 PM, Baolin Wang wrote:
>>
>>
>> On 2025/6/6 06:10, Kemeng Shi wrote:
>>> When large entry is splited, the first entry splited from large entry
>>> retains the same entry value and index as original large entry but it's
>>> order is reduced. In shmem_set_folio_swapin_error(), if large entry is
>>> splited before xa_cmpxchg_irq(), we may replace the first splited entry
>>> with error entry while using the size of original large entry for release
>>> operations. This could lead to a WARN_ON(i_blocks) due to incorrect
>>> nr_pages used by shmem_recalc_inode() and could lead to used after free
>>> due to incorrect nr_pages used by swap_free_nr().
>>
>> I wonder if you have actually triggered this issue? When a large swap entry is split, it means the folio is already at order 0, so why would the size of the original large entry be used for release operations? Or is there another race condition?
> All issues are found during review the code of shmem as I menthioned in
> cover letter.
> The folio could be allocated from shmem_swap_alloc_folio() and the folio
> order will keep unchange when swap entry is split.

Sorry, I did not get your point. If a large swap entry is split, we must 
ensure that the corresponding folio is order 0.

However, I missed one potential case which was recently fixed by Kairui[1].

[1] https://lore.kernel.org/all/20250610181645.45922-1-ryncsn@gmail.com/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON()
  2025-06-11  7:29         ` Baolin Wang
@ 2025-06-11  8:38           ` Kemeng Shi
  0 siblings, 0 replies; 24+ messages in thread
From: Kemeng Shi @ 2025-06-11  8:38 UTC (permalink / raw)
  To: Baolin Wang, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



on 6/11/2025 3:29 PM, Baolin Wang wrote:
> 
> 
> On 2025/6/10 09:02, Kemeng Shi wrote:
>>
>>
>> on 6/9/2025 8:46 AM, Kemeng Shi wrote:
>>>
>>>
>>> on 6/7/2025 2:11 PM, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/6/6 06:10, Kemeng Shi wrote:
>>>>> As noted in the comments, we need to release block usage for swap entry
>>>>> which was replaced with poisoned swap entry. However, no block usage is
>>>>> actually freed by calling shmem_recalc_inode(inode, -nr_pages, -nr_pages).
>>>>> Instead, call shmem_recalc_inode(inode, 0, -nr_pages) can correctly release
>>>>> the block usage.
>>>>>
>>>>> Fixes: 6cec2b95dadf7 ("mm/shmem: fix infinite loop when swap in shmem error at swapoff time")
>>>>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>>>> ---
>>>>>    mm/shmem.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index 4b42419ce6b2..e27d19867e03 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -2145,7 +2145,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>>         * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
>>>>>         * in shmem_evict_inode().
>>>>>         */
>>>>> -    shmem_recalc_inode(inode, -nr_pages, -nr_pages);
>>>>> +    shmem_recalc_inode(inode, 0, -nr_pages);
>>>>>        swap_free_nr(swap, nr_pages);
>>>>>    }
>>>>
>>>> Have you tested your patch? When I inject an error to test your patch, the following issue will be triggered:As all issues are hard to trigger, I only run some simple test to ensure normal
>>> process is fine. Could you share how to inject the error to trigger following
>>> issue. I will have a deep look. Thanks
>> Sorry that the message is truncated. I mean I only test normal process is fine.
> 
> Please also test the swapin error case you try to fix. Obviously your current patch is incorrect.
> 
>> Besides, I think there is another long-standing issue which could trigger the
>> following issue. Here is the issue which is possible to blame:
>> When swap entry is replaced with error entry in shmem_set_folio_swapin_error(),
>> we will reduce info->swapped. Afterwards, error entry could be deleted in
>> shmem_undo_range() and the info->swapped is reduced again. As a result, we
>> reduce info->swapped twice for a single swap entry.
> 
> OK. So you should do something like in shmem_find_swap_entries() to avoid decreasing info->swapped again.
> 
> entry = radix_to_swp_entry(folio);
> /*
> * swapin error entries can be found in the mapping. But they're
> * deliberately ignored here as we've done everything we can do.
> */
> if (swp_type(entry) != type)
>     continue;
> 
>> A simple way to confirm this is injecting error to original code. Could you
>> share how to trigger the issue or could you do the same test to original code?
> 
> Yes, original code is good.
I still suspect that it's another long-standing issue which is triggerd by
this by accident.
> 
> A simple test procedure is to allocate some shmem memory and swap them out, then swap in the shmem while injecting an error to trigger the swap-in error case, and finally unmap the program.
> 
Sure, will fix the mentiond long-standing issue first and try to run this
test.
I will appreciate if you can share your test code if it's convenient.

Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error()
  2025-06-11  7:41       ` Baolin Wang
@ 2025-06-11  9:11         ` Kemeng Shi
  2025-06-13  6:56           ` Baolin Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Kemeng Shi @ 2025-06-11  9:11 UTC (permalink / raw)
  To: Baolin Wang, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



on 6/11/2025 3:41 PM, Baolin Wang wrote:
> 
> 
> On 2025/6/9 09:19, Kemeng Shi wrote:
>>
>>
>> on 6/7/2025 2:20 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/6/6 06:10, Kemeng Shi wrote:
>>>> When large entry is splited, the first entry splited from large entry
>>>> retains the same entry value and index as original large entry but it's
>>>> order is reduced. In shmem_set_folio_swapin_error(), if large entry is
>>>> splited before xa_cmpxchg_irq(), we may replace the first splited entry
>>>> with error entry while using the size of original large entry for release
>>>> operations. This could lead to a WARN_ON(i_blocks) due to incorrect
>>>> nr_pages used by shmem_recalc_inode() and could lead to used after free
>>>> due to incorrect nr_pages used by swap_free_nr().
>>>
>>> I wonder if you have actually triggered this issue? When a large swap entry is split, it means the folio is already at order 0, so why would the size of the original large entry be used for release operations? Or is there another race condition?
>> All issues are found during review the code of shmem as I menthioned in
>> cover letter.
>> The folio could be allocated from shmem_swap_alloc_folio() and the folio
>> order will keep unchange when swap entry is split.
> 
> Sorry, I did not get your point. If a large swap entry is split, we must ensure that the corresponding folio is order 0.
> 
> However, I missed one potential case which was recently fixed by Kairui[1].
> 
> [1] https://lore.kernel.org/all/20250610181645.45922-1-ryncsn@gmail.com/
> 
Here is a possible code routine which I think could trigger the issue:
shmem_swapin_folio          shmem_swapin_folio
folio = swap_cache_get_folio()
order = xa_get_order(&mapping->i_pages, index);
if (!folio)
 ...
 /* suppose large folio allocation is failed, we will try to split large entry */
 folio = shmem_swap_alloc_folio(..., order, ...)

                            folio = swap_cache_get_folio()
                            order = xa_get_order(&mapping->i_pages, index);
                            if (!folio)
                             ...
                             /* suppose large folio allocation is successful this time */
                             folio = shmem_swap_alloc_folio(..., order, ...)
                            ...
                            /* suppose IO of large folio is failed, will set swapin error later */
                            if (!folio_test_uptodate(folio)) {
                             error = -EIO;
                             goto failed:
                            }

 ...
 shmem_split_large_entry()

                            ...
                            shmem_set_folio_swapin_error(..., folio, ...)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error()
  2025-06-11  9:11         ` Kemeng Shi
@ 2025-06-13  6:56           ` Baolin Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Baolin Wang @ 2025-06-13  6:56 UTC (permalink / raw)
  To: Kemeng Shi, hughd, willy, akpm; +Cc: linux-mm, linux-kernel, linux-fsdevel



On 2025/6/11 17:11, Kemeng Shi wrote:
> 
> 
> on 6/11/2025 3:41 PM, Baolin Wang wrote:
>>
>>
>> On 2025/6/9 09:19, Kemeng Shi wrote:
>>>
>>>
>>> on 6/7/2025 2:20 PM, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/6/6 06:10, Kemeng Shi wrote:
>>>>> When large entry is splited, the first entry splited from large entry
>>>>> retains the same entry value and index as original large entry but it's
>>>>> order is reduced. In shmem_set_folio_swapin_error(), if large entry is
>>>>> splited before xa_cmpxchg_irq(), we may replace the first splited entry
>>>>> with error entry while using the size of original large entry for release
>>>>> operations. This could lead to a WARN_ON(i_blocks) due to incorrect
>>>>> nr_pages used by shmem_recalc_inode() and could lead to used after free
>>>>> due to incorrect nr_pages used by swap_free_nr().
>>>>
>>>> I wonder if you have actually triggered this issue? When a large swap entry is split, it means the folio is already at order 0, so why would the size of the original large entry be used for release operations? Or is there another race condition?
>>> All issues are found during review the code of shmem as I menthioned in
>>> cover letter.
>>> The folio could be allocated from shmem_swap_alloc_folio() and the folio
>>> order will keep unchange when swap entry is split.
>>
>> Sorry, I did not get your point. If a large swap entry is split, we must ensure that the corresponding folio is order 0.
>>
>> However, I missed one potential case which was recently fixed by Kairui[1].
>>
>> [1] https://lore.kernel.org/all/20250610181645.45922-1-ryncsn@gmail.com/
>>
> Here is a possible code routine which I think could trigger the issue:
> shmem_swapin_folio          shmem_swapin_folio
> folio = swap_cache_get_folio()
> order = xa_get_order(&mapping->i_pages, index);
> if (!folio)
>   ...
>   /* suppose large folio allocation is failed, we will try to split large entry */
>   folio = shmem_swap_alloc_folio(..., order, ...)
> 
>                              folio = swap_cache_get_folio()
>                              order = xa_get_order(&mapping->i_pages, index);
>                              if (!folio)
>                               ...
>                               /* suppose large folio allocation is successful this time */
>                               folio = shmem_swap_alloc_folio(..., order, ...)
>                              ...
>                              /* suppose IO of large folio is failed, will set swapin error later */
>                              if (!folio_test_uptodate(folio)) {
>                               error = -EIO;
>                               goto failed:
>                              }
> 
>   ...
>   shmem_split_large_entry()
> 
>                              ...
>                              shmem_set_folio_swapin_error(..., folio, ...)

OK. I think this is a good example of a potiential race condition. 
Please include this information in the commit message to make it easier 
for others to understand. Thanks.

I will find some time to test your patch.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-06-13  6:56 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 22:10 [PATCH 0/7] Some random fixes and cleanups to shmem Kemeng Shi
2025-06-05 20:02 ` Andrew Morton
2025-06-05 22:10 ` [PATCH 1/7] mm: shmem: correctly pass alloced parameter to shmem_recalc_inode() to avoid WARN_ON() Kemeng Shi
2025-06-05 19:57   ` Andrew Morton
2025-06-06  1:11     ` Kemeng Shi
2025-06-06  1:28       ` Andrew Morton
2025-06-06  2:29         ` Kemeng Shi
2025-06-06  1:31       ` Kemeng Shi
2025-06-07  6:11   ` Baolin Wang
2025-06-09  0:46     ` Kemeng Shi
2025-06-10  1:02       ` Kemeng Shi
2025-06-11  7:29         ` Baolin Wang
2025-06-11  8:38           ` Kemeng Shi
2025-06-05 22:10 ` [PATCH 2/7] mm: shmem: avoid setting error on splited entries in shmem_set_folio_swapin_error() Kemeng Shi
2025-06-07  6:20   ` Baolin Wang
2025-06-09  1:19     ` Kemeng Shi
2025-06-11  7:41       ` Baolin Wang
2025-06-11  9:11         ` Kemeng Shi
2025-06-13  6:56           ` Baolin Wang
2025-06-05 22:10 ` [PATCH 3/7] mm: shmem: avoid missing entries in shmem_undo_range() when entries was splited concurrently Kemeng Shi
2025-06-05 22:10 ` [PATCH 4/7] mm: shmem: handle special case of shmem_recalc_inode() in it's caller Kemeng Shi
2025-06-05 22:10 ` [PATCH 5/7] mm: shmem: wrap additional shmem quota related code with CONFIG_TMPFS_QUOTA Kemeng Shi
2025-06-05 22:10 ` [PATCH 6/7] mm: shmem: simplify error flow in thpsize_shmem_enabled_store() Kemeng Shi
2025-06-05 22:10 ` [PATCH 7/7] mm: shmem: eliminate unneeded page counting in shmem_unuse_swap_entries() Kemeng Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).