* Re: [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
@ 2025-05-22 3:55 ` Kairui Song
2025-05-30 1:31 ` Baoquan He
1 sibling, 0 replies; 20+ messages in thread
From: Kairui Song @ 2025-05-22 3:55 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel
On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> When folio_alloc_swap() encounters a failure in either
> mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
> is not decremented for allocated entry. However, the following
> put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
> an imbalance.
>
> Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
> to pair the nr_swap_pages counting.
>
> Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 026090bf3efe..75b69213c2e7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
> if (vm_swap_full())
> schedule_work(&si->reclaim_work);
> }
> + atomic_long_sub(nr_entries, &nr_swap_pages);
> }
>
> static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> @@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> goto out_free;
>
> - atomic_long_sub(size, &nr_swap_pages);
> return 0;
>
> out_free:
> --
> 2.30.0
Good catch! Moving the counter update to swap_range_alloc makes the
logic much easier to follow.
Reviewed-by: Kairui Song <kasong@tencent.com>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 0/4] Some randome fixes and cleanups to swapfile
@ 2025-05-22 12:25 Kemeng Shi
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
` (4 more replies)
0 siblings, 5 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel
Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can
be found in respective patches. Thanks.
Kemeng Shi (4):
mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap()
to swap_range_alloc()
mm: swap: correctly use maxpages in swapon syscall to avoid potensial
deadloop
mm: swap: fix potensial buffer overflow in setup_clusters()
mm: swap: remove stale comment stale comment in
cluster_alloc_swap_entry()
mm/swapfile.c | 64 ++++++++++++++++++++++++---------------------------
1 file changed, 30 insertions(+), 34 deletions(-)
--
2.30.0
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
2025-05-22 3:55 ` Kairui Song
2025-05-30 1:31 ` Baoquan He
2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
` (3 subsequent siblings)
4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel
When folio_alloc_swap() encounters a failure in either
mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
is not decremented for allocated entry. However, the following
put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
an imbalance.
Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
to pair the nr_swap_pages counting.
Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 026090bf3efe..75b69213c2e7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
if (vm_swap_full())
schedule_work(&si->reclaim_work);
}
+ atomic_long_sub(nr_entries, &nr_swap_pages);
}
static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
@@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
goto out_free;
- atomic_long_sub(size, &nr_swap_pages);
return 0;
out_free:
--
2.30.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
2025-05-25 17:08 ` Kairui Song
2025-05-30 2:50 ` Baoquan He
2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
` (2 subsequent siblings)
4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel
We use maxpages from read_swap_header() to initialize swap_info_struct,
however the maxpages might be reduced in setup_swap_extents() and the
si->max is assigned with the reduced maxpages from the
setup_swap_extents().
Obviously, this could lead to memory waste as we allocated memory based on
larger maxpages, besides, this could lead to a potensial deadloop as
following:
1) When calling setup_clusters() with larger maxpages, unavailable pages
within range [si->max, larger maxpages) are not accounted with
inc_cluster_info_page(). As a result, these pages are assumed available
but can not be allocated. The cluster contains these pages can be moved
to frag_clusters list after it's all available pages were allocated.
2) When the cluster mentioned in 1) is the only cluster in frag_clusters
list, cluster_alloc_swap_entry() assume order 0 allocation will never
failed and will enter a deadloop by keep trying to allocate page from the
only cluster in frag_clusters which contains no actually available page.
Call setup_swap_extents() to get the final maxpages before swap_info_struct
initialization to fix the issue.
Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 27 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 75b69213c2e7..a82f4ebefca3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
return maxpages;
}
-static int setup_swap_map_and_extents(struct swap_info_struct *si,
- union swap_header *swap_header,
- unsigned char *swap_map,
- unsigned long maxpages,
- sector_t *span)
+static int setup_swap_map(struct swap_info_struct *si,
+ union swap_header *swap_header,
+ unsigned char *swap_map,
+ unsigned long maxpages)
{
- unsigned int nr_good_pages;
unsigned long i;
- int nr_extents;
-
- nr_good_pages = maxpages - 1; /* omit header page */
+ swap_map[0] = SWAP_MAP_BAD; /* omit header page */
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
if (page_nr == 0 || page_nr > swap_header->info.last_page)
return -EINVAL;
if (page_nr < maxpages) {
swap_map[page_nr] = SWAP_MAP_BAD;
- nr_good_pages--;
+ si->pages--;
}
}
- if (nr_good_pages) {
- swap_map[0] = SWAP_MAP_BAD;
- si->max = maxpages;
- si->pages = nr_good_pages;
- nr_extents = setup_swap_extents(si, span);
- if (nr_extents < 0)
- return nr_extents;
- nr_good_pages = si->pages;
- }
- if (!nr_good_pages) {
+ if (!si->pages) {
pr_warn("Empty swap-file\n");
return -EINVAL;
}
- return nr_extents;
+ return 0;
}
#define SWAP_CLUSTER_INFO_COLS \
@@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
* Mark unusable pages as unavailable. The clusters aren't
* marked free yet, so no list operations are involved yet.
*
- * See setup_swap_map_and_extents(): header page, bad pages,
+ * See setup_swap_map(): header page, bad pages,
* and the EOF part of the last cluster.
*/
inc_cluster_info_page(si, cluster_info, 0);
@@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
goto bad_swap_unlock_inode;
}
+ si->max = maxpages;
+ si->pages = maxpages - 1;
+ nr_extents = setup_swap_extents(si, &span);
+ if (nr_extents < 0) {
+ error = nr_extents;
+ goto bad_swap_unlock_inode;
+ }
+ maxpages = si->max;
+
/* OK, set up the swap map and apply the bad block list */
swap_map = vzalloc(maxpages);
if (!swap_map) {
@@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (error)
goto bad_swap_unlock_inode;
- nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
- maxpages, &span);
- if (unlikely(nr_extents < 0)) {
- error = nr_extents;
+ error = setup_swap_map(si, swap_header, swap_map, maxpages);
+ if (error)
goto bad_swap_unlock_inode;
- }
/*
* Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
--
2.30.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
2025-05-25 18:44 ` Kairui Song
2025-05-30 2:56 ` Baoquan He
2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
2025-05-22 21:41 ` [PATCH 0/4] Some randome fixes and cleanups to swapfile Andrew Morton
4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel
In setup_swap_map(), we only ensure badpages are in range (0, last_page].
As maxpages might be < last_page, setup_clusters() will encounter a
buffer overflow when a badpage is >= maxpages.
Only call inc_cluster_info_page() for badpage which is < maxpages to
fix the issue.
Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
mm/swapfile.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a82f4ebefca3..63ab9f14b2c6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
* and the EOF part of the last cluster.
*/
inc_cluster_info_page(si, cluster_info, 0);
- for (i = 0; i < swap_header->info.nr_badpages; i++)
- inc_cluster_info_page(si, cluster_info,
- swap_header->info.badpages[i]);
+ for (i = 0; i < swap_header->info.nr_badpages; i++) {
+ unsigned int page_nr = swap_header->info.badpages[i];
+
+ if (page_nr >= maxpages)
+ continue;
+ inc_cluster_info_page(si, cluster_info, page_nr);
+ }
for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
inc_cluster_info_page(si, cluster_info, i);
--
2.30.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry()
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
` (2 preceding siblings ...)
2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
2025-05-25 17:05 ` Kairui Song
2025-05-30 5:24 ` Baoquan He
2025-05-22 21:41 ` [PATCH 0/4] Some randome fixes and cleanups to swapfile Andrew Morton
4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel
As cluster_next_cpu was already dropped, the associated comment is stale
now.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
mm/swapfile.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63ab9f14b2c6..8525515fb06c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
}
/*
- * We don't have free cluster but have some clusters in
- * discarding, do discard now and reclaim them, then
- * reread cluster_next_cpu since we dropped si->lock
+ * We don't have free cluster but have some clusters in discarding,
+ * do discard now and reclaim them.
*/
if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
goto new_cluster;
--
2.30.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Some randome fixes and cleanups to swapfile
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
` (3 preceding siblings ...)
2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
@ 2025-05-22 21:41 ` Andrew Morton
4 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2025-05-22 21:41 UTC (permalink / raw)
To: Kemeng Shi; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel
On Thu, 22 May 2025 20:25:50 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can
> be found in respective patches. Thanks.
Cool. I'll add these to mm-new but I won't advance them to mm-unstable
until after the merge window, so we have 2+ weeks for review and test.
I added cc:stable to patches 1-3, so they'll eventually find their way
into earlier kernels.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry()
2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
@ 2025-05-25 17:05 ` Kairui Song
2025-05-30 5:24 ` Baoquan He
1 sibling, 0 replies; 20+ messages in thread
From: Kairui Song @ 2025-05-25 17:05 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel
On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> As cluster_next_cpu was already dropped, the associated comment is stale
> now.
>
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63ab9f14b2c6..8525515fb06c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> }
>
> /*
> - * We don't have free cluster but have some clusters in
> - * discarding, do discard now and reclaim them, then
> - * reread cluster_next_cpu since we dropped si->lock
> + * We don't have free cluster but have some clusters in discarding,
> + * do discard now and reclaim them.
> */
> if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> goto new_cluster;
> --
> 2.30.0
>
Nice.
Reviewed-by: Kairui Song <kasong@tencent.com>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
@ 2025-05-25 17:08 ` Kairui Song
2025-06-11 7:54 ` Kemeng Shi
2025-05-30 2:50 ` Baoquan He
1 sibling, 1 reply; 20+ messages in thread
From: Kairui Song @ 2025-05-25 17:08 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel
On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> We use maxpages from read_swap_header() to initialize swap_info_struct,
> however the maxpages might be reduced in setup_swap_extents() and the
> si->max is assigned with the reduced maxpages from the
> setup_swap_extents().
>
> Obviously, this could lead to memory waste as we allocated memory based on
> larger maxpages, besides, this could lead to a potensial deadloop as
> following:
> 1) When calling setup_clusters() with larger maxpages, unavailable pages
> within range [si->max, larger maxpages) are not accounted with
> inc_cluster_info_page(). As a result, these pages are assumed available
> but can not be allocated. The cluster contains these pages can be moved
> to frag_clusters list after it's all available pages were allocated.
> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
> list, cluster_alloc_swap_entry() assume order 0 allocation will never
> failed and will enter a deadloop by keep trying to allocate page from the
> only cluster in frag_clusters which contains no actually available page.
>
> Call setup_swap_extents() to get the final maxpages before swap_info_struct
> initialization to fix the issue.
>
> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
>
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
> 1 file changed, 20 insertions(+), 27 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 75b69213c2e7..a82f4ebefca3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
> return maxpages;
> }
>
> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
> - union swap_header *swap_header,
> - unsigned char *swap_map,
> - unsigned long maxpages,
> - sector_t *span)
> +static int setup_swap_map(struct swap_info_struct *si,
> + union swap_header *swap_header,
> + unsigned char *swap_map,
> + unsigned long maxpages)
> {
> - unsigned int nr_good_pages;
> unsigned long i;
> - int nr_extents;
> -
> - nr_good_pages = maxpages - 1; /* omit header page */
>
> + swap_map[0] = SWAP_MAP_BAD; /* omit header page */
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> if (page_nr == 0 || page_nr > swap_header->info.last_page)
> return -EINVAL;
> if (page_nr < maxpages) {
> swap_map[page_nr] = SWAP_MAP_BAD;
> - nr_good_pages--;
> + si->pages--;
> }
> }
>
> - if (nr_good_pages) {
> - swap_map[0] = SWAP_MAP_BAD;
> - si->max = maxpages;
> - si->pages = nr_good_pages;
> - nr_extents = setup_swap_extents(si, span);
> - if (nr_extents < 0)
> - return nr_extents;
> - nr_good_pages = si->pages;
> - }
> - if (!nr_good_pages) {
> + if (!si->pages) {
> pr_warn("Empty swap-file\n");
> return -EINVAL;
> }
>
>
> - return nr_extents;
> + return 0;
> }
>
> #define SWAP_CLUSTER_INFO_COLS \
> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> * Mark unusable pages as unavailable. The clusters aren't
> * marked free yet, so no list operations are involved yet.
> *
> - * See setup_swap_map_and_extents(): header page, bad pages,
> + * See setup_swap_map(): header page, bad pages,
> * and the EOF part of the last cluster.
> */
> inc_cluster_info_page(si, cluster_info, 0);
> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> goto bad_swap_unlock_inode;
> }
>
> + si->max = maxpages;
> + si->pages = maxpages - 1;
> + nr_extents = setup_swap_extents(si, &span);
> + if (nr_extents < 0) {
> + error = nr_extents;
> + goto bad_swap_unlock_inode;
> + }
> + maxpages = si->max;
There seems to be a trivial problem here, previously the si->pages
will be seen by swap_activate after bad blocks have been counted and
si->pages means the actual available slots. But now si->pages will be
seen by swap_active as `maxpages - 1`.
One current side effect now is the span value will not be updated
properly so the pr_info in swap on may print a larger value, if the
swap header contains badblocks and swapfile is on nfs/cifs.
This should not be a problem but it's better to mention or add
comments about it.
And I think it's better to add a sanity check here to check if
si->pages still equal to si->max - 1, setup_swap_map_and_extents /
setup_swap_map assumes the header section was already counted. This
also helps indicate the setup_swap_extents may shrink and modify these
two values.
BTW, I was thinking that we should get rid of the whole extents design
after the swap table series is ready, so mTHP allocation will be
usable for swap over fs too.
> /* OK, set up the swap map and apply the bad block list */
> swap_map = vzalloc(maxpages);
> if (!swap_map) {
> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> if (error)
> goto bad_swap_unlock_inode;
>
> - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
> - maxpages, &span);
> - if (unlikely(nr_extents < 0)) {
> - error = nr_extents;
> + error = setup_swap_map(si, swap_header, swap_map, maxpages);
> + if (error)
> goto bad_swap_unlock_inode;
> - }
>
> /*
> * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
> --
> 2.30.0
>
Other than that:
Reviewed-by: Kairui Song <kasong@tencent.com>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
@ 2025-05-25 18:44 ` Kairui Song
2025-05-30 2:55 ` Baoquan He
2025-06-11 8:27 ` Kemeng Shi
2025-05-30 2:56 ` Baoquan He
1 sibling, 2 replies; 20+ messages in thread
From: Kairui Song @ 2025-05-25 18:44 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel
On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> In setup_swap_map(), we only ensure badpages are in range (0, last_page].
> As maxpages might be < last_page, setup_clusters() will encounter a
> buffer overflow when a badpage is >= maxpages.
> Only call inc_cluster_info_page() for badpage which is < maxpages to
> fix the issue.
>
> Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a82f4ebefca3..63ab9f14b2c6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> * and the EOF part of the last cluster.
> */
> inc_cluster_info_page(si, cluster_info, 0);
> - for (i = 0; i < swap_header->info.nr_badpages; i++)
> - inc_cluster_info_page(si, cluster_info,
> - swap_header->info.badpages[i]);
> + for (i = 0; i < swap_header->info.nr_badpages; i++) {
> + unsigned int page_nr = swap_header->info.badpages[i];
> +
> + if (page_nr >= maxpages)
> + continue;
> + inc_cluster_info_page(si, cluster_info, page_nr);
I think we might need a pr_err or pr_warn here, this means mkswap
marked the wrong region as a bad block? Or some fs side things went
wrong.
> + }
> for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
> inc_cluster_info_page(si, cluster_info, i);
>
> --
> 2.30.0
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
2025-05-22 3:55 ` Kairui Song
@ 2025-05-30 1:31 ` Baoquan He
1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30 1:31 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel
On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> When folio_alloc_swap() encounters a failure in either
> mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
> is not decremented for allocated entry. However, the following
> put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
> an imbalance.
>
> Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
> to pair the nr_swap_pages counting.
>
> Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
LGTM,
Reviewed-by: Baoquan He <bhe@redhat.com>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 026090bf3efe..75b69213c2e7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
> if (vm_swap_full())
> schedule_work(&si->reclaim_work);
> }
> + atomic_long_sub(nr_entries, &nr_swap_pages);
> }
>
> static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> @@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> goto out_free;
>
> - atomic_long_sub(size, &nr_swap_pages);
> return 0;
>
> out_free:
> --
> 2.30.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
2025-05-25 17:08 ` Kairui Song
@ 2025-05-30 2:50 ` Baoquan He
2025-06-11 8:27 ` Kemeng Shi
1 sibling, 1 reply; 20+ messages in thread
From: Baoquan He @ 2025-05-30 2:50 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel
On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> We use maxpages from read_swap_header() to initialize swap_info_struct,
> however the maxpages might be reduced in setup_swap_extents() and the
> si->max is assigned with the reduced maxpages from the
> setup_swap_extents().
> Obviously, this could lead to memory waste as we allocated memory based on
> larger maxpages, besides, this could lead to a potensial deadloop as
^ typo, potential
> following:
> 1) When calling setup_clusters() with larger maxpages, unavailable pages
> within range [si->max, larger maxpages) are not accounted with
> inc_cluster_info_page(). As a result, these pages are assumed available
> but can not be allocated. The cluster contains these pages can be moved
> to frag_clusters list after it's all available pages were allocated.
> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
> list, cluster_alloc_swap_entry() assume order 0 allocation will never
> failed and will enter a deadloop by keep trying to allocate page from the
> only cluster in frag_clusters which contains no actually available page.
>
> Call setup_swap_extents() to get the final maxpages before swap_info_struct
> initialization to fix the issue.
>
> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
> 1 file changed, 20 insertions(+), 27 deletions(-)
Reviedwed-by: Baoquan He <bhe@redhat.com>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 75b69213c2e7..a82f4ebefca3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
> return maxpages;
> }
>
> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
> - union swap_header *swap_header,
> - unsigned char *swap_map,
> - unsigned long maxpages,
> - sector_t *span)
> +static int setup_swap_map(struct swap_info_struct *si,
> + union swap_header *swap_header,
> + unsigned char *swap_map,
> + unsigned long maxpages)
> {
> - unsigned int nr_good_pages;
> unsigned long i;
> - int nr_extents;
> -
> - nr_good_pages = maxpages - 1; /* omit header page */
>
> + swap_map[0] = SWAP_MAP_BAD; /* omit header page */
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> if (page_nr == 0 || page_nr > swap_header->info.last_page)
> return -EINVAL;
> if (page_nr < maxpages) {
> swap_map[page_nr] = SWAP_MAP_BAD;
> - nr_good_pages--;
> + si->pages--;
> }
> }
>
> - if (nr_good_pages) {
> - swap_map[0] = SWAP_MAP_BAD;
> - si->max = maxpages;
> - si->pages = nr_good_pages;
> - nr_extents = setup_swap_extents(si, span);
> - if (nr_extents < 0)
> - return nr_extents;
> - nr_good_pages = si->pages;
> - }
> - if (!nr_good_pages) {
> + if (!si->pages) {
> pr_warn("Empty swap-file\n");
> return -EINVAL;
> }
>
> - return nr_extents;
> + return 0;
> }
>
> #define SWAP_CLUSTER_INFO_COLS \
> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> * Mark unusable pages as unavailable. The clusters aren't
> * marked free yet, so no list operations are involved yet.
> *
> - * See setup_swap_map_and_extents(): header page, bad pages,
> + * See setup_swap_map(): header page, bad pages,
> * and the EOF part of the last cluster.
> */
> inc_cluster_info_page(si, cluster_info, 0);
> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> goto bad_swap_unlock_inode;
> }
>
> + si->max = maxpages;
> + si->pages = maxpages - 1;
> + nr_extents = setup_swap_extents(si, &span);
> + if (nr_extents < 0) {
> + error = nr_extents;
> + goto bad_swap_unlock_inode;
> + }
> + maxpages = si->max;
> +
> /* OK, set up the swap map and apply the bad block list */
> swap_map = vzalloc(maxpages);
> if (!swap_map) {
> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> if (error)
> goto bad_swap_unlock_inode;
>
> - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
> - maxpages, &span);
> - if (unlikely(nr_extents < 0)) {
> - error = nr_extents;
> + error = setup_swap_map(si, swap_header, swap_map, maxpages);
> + if (error)
> goto bad_swap_unlock_inode;
> - }
>
> /*
> * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
> --
> 2.30.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
2025-05-25 18:44 ` Kairui Song
@ 2025-05-30 2:55 ` Baoquan He
2025-06-11 8:27 ` Kemeng Shi
1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30 2:55 UTC (permalink / raw)
To: Kairui Song; +Cc: Kemeng Shi, akpm, hannes, linux-mm, linux-kernel
On 05/26/25 at 02:44am, Kairui Song wrote:
> On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> >
> > In setup_swap_map(), we only ensure badpages are in range (0, last_page].
> > As maxpages might be < last_page, setup_clusters() will encounter a
> > buffer overflow when a badpage is >= maxpages.
> > Only call inc_cluster_info_page() for badpage which is < maxpages to
> > fix the issue.
> >
> > Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
> > Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > ---
> > mm/swapfile.c | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index a82f4ebefca3..63ab9f14b2c6 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> > * and the EOF part of the last cluster.
> > */
> > inc_cluster_info_page(si, cluster_info, 0);
> > - for (i = 0; i < swap_header->info.nr_badpages; i++)
> > - inc_cluster_info_page(si, cluster_info,
> > - swap_header->info.badpages[i]);
> > + for (i = 0; i < swap_header->info.nr_badpages; i++) {
> > + unsigned int page_nr = swap_header->info.badpages[i];
> > +
> > + if (page_nr >= maxpages)
> > + continue;
> > + inc_cluster_info_page(si, cluster_info, page_nr);
>
> I think we might need a pr_err or pr_warn here, this means mkswap
> marked the wrong region as a bad block? Or some fs side things went
> wrong.
There's aready warning in read_swap_header():
static unsigned long read_swap_header(struct swap_info_struct *si,
union swap_header *swap_header,
struct inode *inode)
{
......
if (last_page > maxpages) {
pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
K(maxpages), K(last_page));
}
...
}
And if we add pr_err|warn here, we also need add it in setup_swap_map()
when filling swap_map.
>
>
> > + }
> > for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
> > inc_cluster_info_page(si, cluster_info, i);
> >
> > --
> > 2.30.0
> >
> >
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
2025-05-25 18:44 ` Kairui Song
@ 2025-05-30 2:56 ` Baoquan He
1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30 2:56 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel
On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> In setup_swap_map(), we only ensure badpages are in range (0, last_page].
> As maxpages might be < last_page, setup_clusters() will encounter a
> buffer overflow when a badpage is >= maxpages.
> Only call inc_cluster_info_page() for badpage which is < maxpages to
> fix the issue.
>
> Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
Reviewed-by: Baoquan He <bhe@redhat.com>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a82f4ebefca3..63ab9f14b2c6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> * and the EOF part of the last cluster.
> */
> inc_cluster_info_page(si, cluster_info, 0);
> - for (i = 0; i < swap_header->info.nr_badpages; i++)
> - inc_cluster_info_page(si, cluster_info,
> - swap_header->info.badpages[i]);
> + for (i = 0; i < swap_header->info.nr_badpages; i++) {
> + unsigned int page_nr = swap_header->info.badpages[i];
> +
> + if (page_nr >= maxpages)
> + continue;
> + inc_cluster_info_page(si, cluster_info, page_nr);
> + }
> for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
> inc_cluster_info_page(si, cluster_info, i);
>
> --
> 2.30.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry()
2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
2025-05-25 17:05 ` Kairui Song
@ 2025-05-30 5:24 ` Baoquan He
1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30 5:24 UTC (permalink / raw)
To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel
On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> As cluster_next_cpu was already dropped, the associated comment is stale
> now.
>
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
> mm/swapfile.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
Reviewed-by: Baoquan He <bhe@redhat.com>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63ab9f14b2c6..8525515fb06c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> }
>
> /*
> - * We don't have free cluster but have some clusters in
> - * discarding, do discard now and reclaim them, then
> - * reread cluster_next_cpu since we dropped si->lock
> + * We don't have free cluster but have some clusters in discarding,
> + * do discard now and reclaim them.
> */
> if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> goto new_cluster;
> --
> 2.30.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-05-25 17:08 ` Kairui Song
@ 2025-06-11 7:54 ` Kemeng Shi
2025-07-17 23:21 ` Andrew Morton
0 siblings, 1 reply; 20+ messages in thread
From: Kemeng Shi @ 2025-06-11 7:54 UTC (permalink / raw)
To: Kairui Song; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel
on 5/26/2025 1:08 AM, Kairui Song wrote:
> On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>> We use maxpages from read_swap_header() to initialize swap_info_struct,
>> however the maxpages might be reduced in setup_swap_extents() and the
>> si->max is assigned with the reduced maxpages from the
>> setup_swap_extents().
>>
>> Obviously, this could lead to memory waste as we allocated memory based on
>> larger maxpages, besides, this could lead to a potensial deadloop as
>> following:
>> 1) When calling setup_clusters() with larger maxpages, unavailable pages
>> within range [si->max, larger maxpages) are not accounted with
>> inc_cluster_info_page(). As a result, these pages are assumed available
>> but can not be allocated. The cluster contains these pages can be moved
>> to frag_clusters list after it's all available pages were allocated.
>> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
>> list, cluster_alloc_swap_entry() assume order 0 allocation will never
>> failed and will enter a deadloop by keep trying to allocate page from the
>> only cluster in frag_clusters which contains no actually available page.
>>
>> Call setup_swap_extents() to get the final maxpages before swap_info_struct
>> initialization to fix the issue.
>>
>> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>> mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
>> 1 file changed, 20 insertions(+), 27 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 75b69213c2e7..a82f4ebefca3 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>> return maxpages;
>> }
>>
>> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
>> - union swap_header *swap_header,
>> - unsigned char *swap_map,
>> - unsigned long maxpages,
>> - sector_t *span)
>> +static int setup_swap_map(struct swap_info_struct *si,
>> + union swap_header *swap_header,
>> + unsigned char *swap_map,
>> + unsigned long maxpages)
>> {
>> - unsigned int nr_good_pages;
>> unsigned long i;
>> - int nr_extents;
>> -
>> - nr_good_pages = maxpages - 1; /* omit header page */
>>
>> + swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>> for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> unsigned int page_nr = swap_header->info.badpages[i];
>> if (page_nr == 0 || page_nr > swap_header->info.last_page)
>> return -EINVAL;
>> if (page_nr < maxpages) {
>> swap_map[page_nr] = SWAP_MAP_BAD;
>> - nr_good_pages--;
>> + si->pages--;
>> }
>> }
>>
>> - if (nr_good_pages) {
>> - swap_map[0] = SWAP_MAP_BAD;
>> - si->max = maxpages;
>> - si->pages = nr_good_pages;
>> - nr_extents = setup_swap_extents(si, span);
>> - if (nr_extents < 0)
>> - return nr_extents;
>> - nr_good_pages = si->pages;
>> - }
>> - if (!nr_good_pages) {
>> + if (!si->pages) {
>> pr_warn("Empty swap-file\n");
>> return -EINVAL;
>> }
>>
>>
>> - return nr_extents;
>> + return 0;
>> }
>>
>> #define SWAP_CLUSTER_INFO_COLS \
>> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>> * Mark unusable pages as unavailable. The clusters aren't
>> * marked free yet, so no list operations are involved yet.
>> *
>> - * See setup_swap_map_and_extents(): header page, bad pages,
>> + * See setup_swap_map(): header page, bad pages,
>> * and the EOF part of the last cluster.
>> */
>> inc_cluster_info_page(si, cluster_info, 0);
>> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> goto bad_swap_unlock_inode;
>> }
>>
>> + si->max = maxpages;
>> + si->pages = maxpages - 1;
>> + nr_extents = setup_swap_extents(si, &span);
>> + if (nr_extents < 0) {
>> + error = nr_extents;
>> + goto bad_swap_unlock_inode;
>> + }
>> + maxpages = si->max;
>
Hello,
> There seems to be a trivial problem here, previously the si->pages
> will be seen by swap_activate after bad blocks have been counted and
> si->pages means the actual available slots. But now si->pages will be
> seen by swap_active as `maxpages - 1`.
>
> One current side effect now is the span value will not be updated
> properly so the pr_info in swap on may print a larger value, if the
> swap header contains badblocks and swapfile is on nfs/cifs.
Thanks for point this out. But I think the larger value is actually
correct result.
In summary, there are two kinds of swapfile_activate operations.
1. Filesystem style: Treat all blocks logical continuity and find
useable physical extents in logical range. In this way, si->pages
will be actual useable physical blocks and span will be "1 +
highest_block - lowest_block".
2. Block device style: Treat all blocks physically continue and
only one single extent is added. In this way, si->pages will be
si->max and span will be "si->pages - 1".
Actually, si->pages and si->max is only used in block device style
and span value is set with si->pages. As a result, span value in
block device style will become a larger value as you mentioned.
I think larger value is correct based on:
1. Span value in filesystem style is "1 + highest_block -
lowest_block" which is the range cover all possible phisical blocks
including the badblocks.
2. For block device style, si->pages is the actual useable block
number and is already in pr_info. The orignal span value before
this patch is also refer to useable block number which is redundant
in pr_info.
>
> This should not be a problem but it's better to mention or add
> comments about it
I'd like to mention this change as a fix in changelog in next version.
>
> And I think it's better to add a sanity check here to check if
> si->pages still equal to si->max - 1, setup_swap_map_and_extents /
> setup_swap_map assumes the header section was already counted. This
> also helps indicate the setup_swap_extents may shrink and modify these
> two values.
Sure, will add this in next version.
>
> BTW, I was thinking that we should get rid of the whole extents design
> after the swap table series is ready, so mTHP allocation will be
> usable for swap over fs too.
I also noticed this limitation but have not taken a deep look. Look
forward to your solution in future.
>
>> /* OK, set up the swap map and apply the bad block list */
>> swap_map = vzalloc(maxpages);
>> if (!swap_map) {
>> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> if (error)
>> goto bad_swap_unlock_inode;
>>
>> - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
>> - maxpages, &span);
>> - if (unlikely(nr_extents < 0)) {
>> - error = nr_extents;
>> + error = setup_swap_map(si, swap_header, swap_map, maxpages);
>> + if (error)
>> goto bad_swap_unlock_inode;
>> - }
>>
>> /*
>> * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
>> --
>> 2.30.0
>>
>
> Other than that:
>
> Reviewed-by: Kairui Song <kasong@tencent.com>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
2025-05-25 18:44 ` Kairui Song
2025-05-30 2:55 ` Baoquan He
@ 2025-06-11 8:27 ` Kemeng Shi
1 sibling, 0 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-06-11 8:27 UTC (permalink / raw)
To: Kairui Song; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel
on 5/26/2025 2:44 AM, Kairui Song wrote:
> On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>> In setup_swap_map(), we only ensure badpages are in range (0, last_page].
>> As maxpages might be < last_page, setup_clusters() will encounter a
>> buffer overflow when a badpage is >= maxpages.
>> Only call inc_cluster_info_page() for badpage which is < maxpages to
>> fix the issue.
>>
>> Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>> mm/swapfile.c | 10 +++++++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index a82f4ebefca3..63ab9f14b2c6 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>> * and the EOF part of the last cluster.
>> */
>> inc_cluster_info_page(si, cluster_info, 0);
>> - for (i = 0; i < swap_header->info.nr_badpages; i++)
>> - inc_cluster_info_page(si, cluster_info,
>> - swap_header->info.badpages[i]);
>> + for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> + unsigned int page_nr = swap_header->info.badpages[i];
>> +
>> + if (page_nr >= maxpages)
>> + continue;
>> + inc_cluster_info_page(si, cluster_info, page_nr);
>
> I think we might need a pr_err or pr_warn here, this means mkswap
> marked the wrong region as a bad block? Or some fs side things went
> wrong.
As Baoquan metioned that there is already warning in read_swap_header().
Besides, I think last_page in swap_header already indicates the range
which is acceptable to fs and we only need to explicitly handle page_nr
which is > last_page.
>
>
>> + }
>> for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
>> inc_cluster_info_page(si, cluster_info, i);
>>
>> --
>> 2.30.0
>>
>>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-05-30 2:50 ` Baoquan He
@ 2025-06-11 8:27 ` Kemeng Shi
0 siblings, 0 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-06-11 8:27 UTC (permalink / raw)
To: Baoquan He; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel
on 5/30/2025 10:50 AM, Baoquan He wrote:
> On 05/22/25 at 08:25pm, Kemeng Shi wrote:
>> We use maxpages from read_swap_header() to initialize swap_info_struct,
>> however the maxpages might be reduced in setup_swap_extents() and the
>> si->max is assigned with the reduced maxpages from the
>> setup_swap_extents().
>> Obviously, this could lead to memory waste as we allocated memory based on
>> larger maxpages, besides, this could lead to a potensial deadloop as
> ^ typo, potential
Thanks, will fix this in next version.
>> following:
>> 1) When calling setup_clusters() with larger maxpages, unavailable pages
>> within range [si->max, larger maxpages) are not accounted with
>> inc_cluster_info_page(). As a result, these pages are assumed available
>> but can not be allocated. The cluster contains these pages can be moved
>> to frag_clusters list after it's all available pages were allocated.
>> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
>> list, cluster_alloc_swap_entry() assume order 0 allocation will never
>> failed and will enter a deadloop by keep trying to allocate page from the
>> only cluster in frag_clusters which contains no actually available page.
>>
>> Call setup_swap_extents() to get the final maxpages before swap_info_struct
>> initialization to fix the issue.
>>
>> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>> mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
>> 1 file changed, 20 insertions(+), 27 deletions(-)
>
> Reviedwed-by: Baoquan He <bhe@redhat.com>
>
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 75b69213c2e7..a82f4ebefca3 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>> return maxpages;
>> }
>>
>> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
>> - union swap_header *swap_header,
>> - unsigned char *swap_map,
>> - unsigned long maxpages,
>> - sector_t *span)
>> +static int setup_swap_map(struct swap_info_struct *si,
>> + union swap_header *swap_header,
>> + unsigned char *swap_map,
>> + unsigned long maxpages)
>> {
>> - unsigned int nr_good_pages;
>> unsigned long i;
>> - int nr_extents;
>> -
>> - nr_good_pages = maxpages - 1; /* omit header page */
>>
>> + swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>> for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> unsigned int page_nr = swap_header->info.badpages[i];
>> if (page_nr == 0 || page_nr > swap_header->info.last_page)
>> return -EINVAL;
>> if (page_nr < maxpages) {
>> swap_map[page_nr] = SWAP_MAP_BAD;
>> - nr_good_pages--;
>> + si->pages--;
>> }
>> }
>>
>> - if (nr_good_pages) {
>> - swap_map[0] = SWAP_MAP_BAD;
>> - si->max = maxpages;
>> - si->pages = nr_good_pages;
>> - nr_extents = setup_swap_extents(si, span);
>> - if (nr_extents < 0)
>> - return nr_extents;
>> - nr_good_pages = si->pages;
>> - }
>> - if (!nr_good_pages) {
>> + if (!si->pages) {
>> pr_warn("Empty swap-file\n");
>> return -EINVAL;
>> }
>>
>> - return nr_extents;
>> + return 0;
>> }
>>
>> #define SWAP_CLUSTER_INFO_COLS \
>> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>> * Mark unusable pages as unavailable. The clusters aren't
>> * marked free yet, so no list operations are involved yet.
>> *
>> - * See setup_swap_map_and_extents(): header page, bad pages,
>> + * See setup_swap_map(): header page, bad pages,
>> * and the EOF part of the last cluster.
>> */
>> inc_cluster_info_page(si, cluster_info, 0);
>> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> goto bad_swap_unlock_inode;
>> }
>>
>> + si->max = maxpages;
>> + si->pages = maxpages - 1;
>> + nr_extents = setup_swap_extents(si, &span);
>> + if (nr_extents < 0) {
>> + error = nr_extents;
>> + goto bad_swap_unlock_inode;
>> + }
>> + maxpages = si->max;
>> +
>> /* OK, set up the swap map and apply the bad block list */
>> swap_map = vzalloc(maxpages);
>> if (!swap_map) {
>> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>> if (error)
>> goto bad_swap_unlock_inode;
>>
>> - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
>> - maxpages, &span);
>> - if (unlikely(nr_extents < 0)) {
>> - error = nr_extents;
>> + error = setup_swap_map(si, swap_header, swap_map, maxpages);
>> + if (error)
>> goto bad_swap_unlock_inode;
>> - }
>>
>> /*
>> * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
>> --
>> 2.30.0
>>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-06-11 7:54 ` Kemeng Shi
@ 2025-07-17 23:21 ` Andrew Morton
2025-07-18 6:12 ` Kemeng Shi
0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2025-07-17 23:21 UTC (permalink / raw)
To: Kemeng Shi; +Cc: Kairui Song, bhe, hannes, linux-mm, linux-kernel
On Wed, 11 Jun 2025 15:54:21 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
>
> on 5/26/2025 1:08 AM, Kairui Song wrote:
Nearly two months!
> Sure, will add this in next version.
Do we actually need a new version? Having rescanned the v1 review I'm
inclined to merge this series as-is?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
2025-07-17 23:21 ` Andrew Morton
@ 2025-07-18 6:12 ` Kemeng Shi
0 siblings, 0 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-07-18 6:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: Kairui Song, bhe, hannes, linux-mm, linux-kernel
on 7/18/2025 7:21 AM, Andrew Morton wrote:
> On Wed, 11 Jun 2025 15:54:21 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
>>
>>
>> on 5/26/2025 1:08 AM, Kairui Song wrote:
>
> Nearly two months!
>
>> Sure, will add this in next version.
>
> Do we actually need a new version? Having rescanned the v1 review I'm
> inclined to merge this series as-is?
>
>
So sorry for the late. I will write and send a v2 version soon.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-07-18 6:13 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
2025-05-22 3:55 ` Kairui Song
2025-05-30 1:31 ` Baoquan He
2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
2025-05-25 17:08 ` Kairui Song
2025-06-11 7:54 ` Kemeng Shi
2025-07-17 23:21 ` Andrew Morton
2025-07-18 6:12 ` Kemeng Shi
2025-05-30 2:50 ` Baoquan He
2025-06-11 8:27 ` Kemeng Shi
2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
2025-05-25 18:44 ` Kairui Song
2025-05-30 2:55 ` Baoquan He
2025-06-11 8:27 ` Kemeng Shi
2025-05-30 2:56 ` Baoquan He
2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
2025-05-25 17:05 ` Kairui Song
2025-05-30 5:24 ` Baoquan He
2025-05-22 21:41 ` [PATCH 0/4] Some randome fixes and cleanups to swapfile Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.