linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
  2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
@ 2025-05-22  3:55   ` Kairui Song
  2025-05-30  1:31   ` Baoquan He
  1 sibling, 0 replies; 20+ messages in thread
From: Kairui Song @ 2025-05-22  3:55 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel

On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> When folio_alloc_swap() encounters a failure in either
> mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
> is not decremented for allocated entry. However, the following
> put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
> an imbalance.
>
> Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
> to pair the nr_swap_pages counting.
>
> Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 026090bf3efe..75b69213c2e7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
>                 if (vm_swap_full())
>                         schedule_work(&si->reclaim_work);
>         }
> +       atomic_long_sub(nr_entries, &nr_swap_pages);
>  }
>
>  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> @@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>         if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
>                 goto out_free;
>
> -       atomic_long_sub(size, &nr_swap_pages);
>         return 0;
>
>  out_free:
> --
> 2.30.0

Good catch! Moving the counter update to swap_range_alloc makes the
logic much easier to follow.

Reviewed-by: Kairui Song <kasong@tencent.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 0/4] Some randome fixes and cleanups to swapfile
@ 2025-05-22 12:25 Kemeng Shi
  2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
  To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel

Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can
be found in respective patches. Thanks.

Kemeng Shi (4):
  mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap()
    to swap_range_alloc()
  mm: swap: correctly use maxpages in swapon syscall to avoid potensial
    deadloop
  mm: swap: fix potensial buffer overflow in setup_clusters()
  mm: swap: remove stale comment stale comment in
    cluster_alloc_swap_entry()

 mm/swapfile.c | 64 ++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 34 deletions(-)

-- 
2.30.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
  2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
  2025-05-22  3:55   ` Kairui Song
  2025-05-30  1:31   ` Baoquan He
  2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
  To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel

When folio_alloc_swap() encounters a failure in either
mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
is not decremented for allocated entry. However, the following
put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
an imbalance.

Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
to pair the nr_swap_pages counting.

Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 026090bf3efe..75b69213c2e7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
 		if (vm_swap_full())
 			schedule_work(&si->reclaim_work);
 	}
+	atomic_long_sub(nr_entries, &nr_swap_pages);
 }
 
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
@@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
 		goto out_free;
 
-	atomic_long_sub(size, &nr_swap_pages);
 	return 0;
 
 out_free:
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
  2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
  2025-05-25 17:08   ` Kairui Song
  2025-05-30  2:50   ` Baoquan He
  2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
  To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel

We use maxpages from read_swap_header() to initialize swap_info_struct,
however the maxpages might be reduced in setup_swap_extents() and the
si->max is assigned with the reduced maxpages from the
setup_swap_extents().
Obviously, this could lead to memory waste as we allocated memory based on
larger maxpages, besides, this could lead to a potensial deadloop as
following:
1) When calling setup_clusters() with larger maxpages, unavailable pages
within range [si->max, larger maxpages) are not accounted with
inc_cluster_info_page(). As a result, these pages are assumed available
but can not be allocated. The cluster contains these pages can be moved
to frag_clusters list after it's all available pages were allocated.
2) When the cluster mentioned in 1) is the only cluster in frag_clusters
list, cluster_alloc_swap_entry() assume order 0 allocation will never
failed and will enter a deadloop by keep trying to allocate page from the
only cluster in frag_clusters which contains no actually available page.

Call setup_swap_extents() to get the final maxpages before swap_info_struct
initialization to fix the issue.

Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
 1 file changed, 20 insertions(+), 27 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 75b69213c2e7..a82f4ebefca3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 	return maxpages;
 }
 
-static int setup_swap_map_and_extents(struct swap_info_struct *si,
-					union swap_header *swap_header,
-					unsigned char *swap_map,
-					unsigned long maxpages,
-					sector_t *span)
+static int setup_swap_map(struct swap_info_struct *si,
+			  union swap_header *swap_header,
+			  unsigned char *swap_map,
+			  unsigned long maxpages)
 {
-	unsigned int nr_good_pages;
 	unsigned long i;
-	int nr_extents;
-
-	nr_good_pages = maxpages - 1;	/* omit header page */
 
+	swap_map[0] = SWAP_MAP_BAD; /* omit header page */
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)
 			return -EINVAL;
 		if (page_nr < maxpages) {
 			swap_map[page_nr] = SWAP_MAP_BAD;
-			nr_good_pages--;
+			si->pages--;
 		}
 	}
 
-	if (nr_good_pages) {
-		swap_map[0] = SWAP_MAP_BAD;
-		si->max = maxpages;
-		si->pages = nr_good_pages;
-		nr_extents = setup_swap_extents(si, span);
-		if (nr_extents < 0)
-			return nr_extents;
-		nr_good_pages = si->pages;
-	}
-	if (!nr_good_pages) {
+	if (!si->pages) {
 		pr_warn("Empty swap-file\n");
 		return -EINVAL;
 	}
 
-	return nr_extents;
+	return 0;
 }
 
 #define SWAP_CLUSTER_INFO_COLS						\
@@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * Mark unusable pages as unavailable. The clusters aren't
 	 * marked free yet, so no list operations are involved yet.
 	 *
-	 * See setup_swap_map_and_extents(): header page, bad pages,
+	 * See setup_swap_map(): header page, bad pages,
 	 * and the EOF part of the last cluster.
 	 */
 	inc_cluster_info_page(si, cluster_info, 0);
@@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 
+	si->max = maxpages;
+	si->pages = maxpages - 1;
+	nr_extents = setup_swap_extents(si, &span);
+	if (nr_extents < 0) {
+		error = nr_extents;
+		goto bad_swap_unlock_inode;
+	}
+	maxpages = si->max;
+
 	/* OK, set up the swap map and apply the bad block list */
 	swap_map = vzalloc(maxpages);
 	if (!swap_map) {
@@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
-						maxpages, &span);
-	if (unlikely(nr_extents < 0)) {
-		error = nr_extents;
+	error = setup_swap_map(si, swap_header, swap_map, maxpages);
+	if (error)
 		goto bad_swap_unlock_inode;
-	}
 
 	/*
 	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
  2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
  2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
  2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
  2025-05-25 18:44   ` Kairui Song
  2025-05-30  2:56   ` Baoquan He
  2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
  2025-05-22 21:41 ` [PATCH 0/4] Some randome fixes and cleanups to swapfile Andrew Morton
  4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
  To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel

In setup_swap_map(), we only ensure badpages are in range (0, last_page].
As maxpages might be < last_page, setup_clusters() will encounter a
buffer overflow when a badpage is >= maxpages.
Only call inc_cluster_info_page() for badpage which is < maxpages to
fix the issue.

Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/swapfile.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a82f4ebefca3..63ab9f14b2c6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * and the EOF part of the last cluster.
 	 */
 	inc_cluster_info_page(si, cluster_info, 0);
-	for (i = 0; i < swap_header->info.nr_badpages; i++)
-		inc_cluster_info_page(si, cluster_info,
-				      swap_header->info.badpages[i]);
+	for (i = 0; i < swap_header->info.nr_badpages; i++) {
+		unsigned int page_nr = swap_header->info.badpages[i];
+
+		if (page_nr >= maxpages)
+			continue;
+		inc_cluster_info_page(si, cluster_info, page_nr);
+	}
 	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
 		inc_cluster_info_page(si, cluster_info, i);
 
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry()
  2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
                   ` (2 preceding siblings ...)
  2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
@ 2025-05-22 12:25 ` Kemeng Shi
  2025-05-25 17:05   ` Kairui Song
  2025-05-30  5:24   ` Baoquan He
  2025-05-22 21:41 ` [PATCH 0/4] Some randome fixes and cleanups to swapfile Andrew Morton
  4 siblings, 2 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-05-22 12:25 UTC (permalink / raw)
  To: akpm; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel

As cluster_next_cpu was already dropped, the associated comment is stale
now.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/swapfile.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63ab9f14b2c6..8525515fb06c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	}
 
 	/*
-	 * We don't have free cluster but have some clusters in
-	 * discarding, do discard now and reclaim them, then
-	 * reread cluster_next_cpu since we dropped si->lock
+	 * We don't have free cluster but have some clusters in discarding,
+	 * do discard now and reclaim them.
 	 */
 	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
 		goto new_cluster;
-- 
2.30.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/4] Some randome fixes and cleanups to swapfile
  2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
                   ` (3 preceding siblings ...)
  2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
@ 2025-05-22 21:41 ` Andrew Morton
  4 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2025-05-22 21:41 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: kasong, bhe, hannes, linux-mm, linux-kernel

On Thu, 22 May 2025 20:25:50 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can
> be found in respective patches. Thanks.

Cool.  I'll add these to mm-new but I won't advance them to mm-unstable
until after the merge window, so we have 2+ weeks for review and test.

I added cc:stable to patches 1-3, so they'll eventually find their way
into earlier kernels.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry()
  2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
@ 2025-05-25 17:05   ` Kairui Song
  2025-05-30  5:24   ` Baoquan He
  1 sibling, 0 replies; 20+ messages in thread
From: Kairui Song @ 2025-05-25 17:05 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel

On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> As cluster_next_cpu was already dropped, the associated comment is stale
> now.
>
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63ab9f14b2c6..8525515fb06c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         }
>
>         /*
> -        * We don't have free cluster but have some clusters in
> -        * discarding, do discard now and reclaim them, then
> -        * reread cluster_next_cpu since we dropped si->lock
> +        * We don't have free cluster but have some clusters in discarding,
> +        * do discard now and reclaim them.
>          */
>         if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
>                 goto new_cluster;
> --
> 2.30.0
>

Nice.

Reviewed-by: Kairui Song <kasong@tencent.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
@ 2025-05-25 17:08   ` Kairui Song
  2025-06-11  7:54     ` Kemeng Shi
  2025-05-30  2:50   ` Baoquan He
  1 sibling, 1 reply; 20+ messages in thread
From: Kairui Song @ 2025-05-25 17:08 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel

On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> We use maxpages from read_swap_header() to initialize swap_info_struct,
> however the maxpages might be reduced in setup_swap_extents() and the
> si->max is assigned with the reduced maxpages from the
> setup_swap_extents().
>
> Obviously, this could lead to memory waste as we allocated memory based on
> larger maxpages, besides, this could lead to a potensial deadloop as
> following:
> 1) When calling setup_clusters() with larger maxpages, unavailable pages
> within range [si->max, larger maxpages) are not accounted with
> inc_cluster_info_page(). As a result, these pages are assumed available
> but can not be allocated. The cluster contains these pages can be moved
> to frag_clusters list after it's all available pages were allocated.
> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
> list, cluster_alloc_swap_entry() assume order 0 allocation will never
> failed and will enter a deadloop by keep trying to allocate page from the
> only cluster in frag_clusters which contains no actually available page.
>
> Call setup_swap_extents() to get the final maxpages before swap_info_struct
> initialization to fix the issue.
>
> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
>
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 27 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 75b69213c2e7..a82f4ebefca3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>         return maxpages;
>  }
>
> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
> -                                       union swap_header *swap_header,
> -                                       unsigned char *swap_map,
> -                                       unsigned long maxpages,
> -                                       sector_t *span)
> +static int setup_swap_map(struct swap_info_struct *si,
> +                         union swap_header *swap_header,
> +                         unsigned char *swap_map,
> +                         unsigned long maxpages)
>  {
> -       unsigned int nr_good_pages;
>         unsigned long i;
> -       int nr_extents;
> -
> -       nr_good_pages = maxpages - 1;   /* omit header page */
>
> +       swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
>                 unsigned int page_nr = swap_header->info.badpages[i];
>                 if (page_nr == 0 || page_nr > swap_header->info.last_page)
>                         return -EINVAL;
>                 if (page_nr < maxpages) {
>                         swap_map[page_nr] = SWAP_MAP_BAD;
> -                       nr_good_pages--;
> +                       si->pages--;
>                 }
>         }
>
> -       if (nr_good_pages) {
> -               swap_map[0] = SWAP_MAP_BAD;
> -               si->max = maxpages;
> -               si->pages = nr_good_pages;
> -               nr_extents = setup_swap_extents(si, span);
> -               if (nr_extents < 0)
> -                       return nr_extents;
> -               nr_good_pages = si->pages;
> -       }
> -       if (!nr_good_pages) {
> +       if (!si->pages) {
>                 pr_warn("Empty swap-file\n");
>                 return -EINVAL;
>         }
>
>
> -       return nr_extents;
> +       return 0;
>  }
>
>  #define SWAP_CLUSTER_INFO_COLS                                         \
> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>          * Mark unusable pages as unavailable. The clusters aren't
>          * marked free yet, so no list operations are involved yet.
>          *
> -        * See setup_swap_map_and_extents(): header page, bad pages,
> +        * See setup_swap_map(): header page, bad pages,
>          * and the EOF part of the last cluster.
>          */
>         inc_cluster_info_page(si, cluster_info, 0);
> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 goto bad_swap_unlock_inode;
>         }
>
> +       si->max = maxpages;
> +       si->pages = maxpages - 1;
> +       nr_extents = setup_swap_extents(si, &span);
> +       if (nr_extents < 0) {
> +               error = nr_extents;
> +               goto bad_swap_unlock_inode;
> +       }
> +       maxpages = si->max;

There seems to be a trivial problem here, previously the si->pages
will be seen by swap_activate after bad blocks have been counted and
si->pages means the actual available slots. But now si->pages will be
seen by swap_active as `maxpages - 1`.

One current side effect now is the span value will not be updated
properly so the pr_info in swap on may print a larger value, if the
swap header contains badblocks and swapfile is on nfs/cifs.

This should not be a problem but it's better to mention or add
comments about it.

And I think it's better to add a sanity check here to check if
si->pages still equal to si->max - 1,  setup_swap_map_and_extents /
setup_swap_map assumes the header section was already counted. This
also helps indicate the setup_swap_extents may shrink and modify these
two values.

BTW, I was thinking that we should get rid of the whole extents design
after the swap table series is ready, so mTHP allocation will be
usable for swap over fs too.

>         /* OK, set up the swap map and apply the bad block list */
>         swap_map = vzalloc(maxpages);
>         if (!swap_map) {
> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (error)
>                 goto bad_swap_unlock_inode;
>
> -       nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
> -                                               maxpages, &span);
> -       if (unlikely(nr_extents < 0)) {
> -               error = nr_extents;
> +       error = setup_swap_map(si, swap_header, swap_map, maxpages);
> +       if (error)
>                 goto bad_swap_unlock_inode;
> -       }
>
>         /*
>          * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
> --
> 2.30.0
>

Other than that:

Reviewed-by: Kairui Song <kasong@tencent.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
  2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
@ 2025-05-25 18:44   ` Kairui Song
  2025-05-30  2:55     ` Baoquan He
  2025-06-11  8:27     ` Kemeng Shi
  2025-05-30  2:56   ` Baoquan He
  1 sibling, 2 replies; 20+ messages in thread
From: Kairui Song @ 2025-05-25 18:44 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel

On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
> In setup_swap_map(), we only ensure badpages are in range (0, last_page].
> As maxpages might be < last_page, setup_clusters() will encounter a
> buffer overflow when a badpage is >= maxpages.
> Only call inc_cluster_info_page() for badpage which is < maxpages to
> fix the issue.
>
> Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a82f4ebefca3..63ab9f14b2c6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>          * and the EOF part of the last cluster.
>          */
>         inc_cluster_info_page(si, cluster_info, 0);
> -       for (i = 0; i < swap_header->info.nr_badpages; i++)
> -               inc_cluster_info_page(si, cluster_info,
> -                                     swap_header->info.badpages[i]);
> +       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> +               unsigned int page_nr = swap_header->info.badpages[i];
> +
> +               if (page_nr >= maxpages)
> +                       continue;
> +               inc_cluster_info_page(si, cluster_info, page_nr);

I think we might need a pr_err or pr_warn here, this means mkswap
marked the wrong region as a bad block? Or some fs side things went
wrong.


> +       }
>         for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
>                 inc_cluster_info_page(si, cluster_info, i);
>
> --
> 2.30.0
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
  2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
  2025-05-22  3:55   ` Kairui Song
@ 2025-05-30  1:31   ` Baoquan He
  1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30  1:31 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel

On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> When folio_alloc_swap() encounters a failure in either
> mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
> is not decremented for allocated entry. However, the following
> put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
> an imbalance.
> 
> Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
> to pair the nr_swap_pages counting.
> 
> Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 026090bf3efe..75b69213c2e7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
>  		if (vm_swap_full())
>  			schedule_work(&si->reclaim_work);
>  	}
> +	atomic_long_sub(nr_entries, &nr_swap_pages);
>  }
>  
>  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> @@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>  	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
>  		goto out_free;
>  
> -	atomic_long_sub(size, &nr_swap_pages);
>  	return 0;
>  
>  out_free:
> -- 
> 2.30.0
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
  2025-05-25 17:08   ` Kairui Song
@ 2025-05-30  2:50   ` Baoquan He
  2025-06-11  8:27     ` Kemeng Shi
  1 sibling, 1 reply; 20+ messages in thread
From: Baoquan He @ 2025-05-30  2:50 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel

On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> We use maxpages from read_swap_header() to initialize swap_info_struct,
> however the maxpages might be reduced in setup_swap_extents() and the
> si->max is assigned with the reduced maxpages from the
> setup_swap_extents().
> Obviously, this could lead to memory waste as we allocated memory based on
> larger maxpages, besides, this could lead to a potensial deadloop as
                                                 ^ typo, potential
> following:
> 1) When calling setup_clusters() with larger maxpages, unavailable pages
> within range [si->max, larger maxpages) are not accounted with
> inc_cluster_info_page(). As a result, these pages are assumed available
> but can not be allocated. The cluster contains these pages can be moved
> to frag_clusters list after it's all available pages were allocated.
> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
> list, cluster_alloc_swap_entry() assume order 0 allocation will never
> failed and will enter a deadloop by keep trying to allocate page from the
> only cluster in frag_clusters which contains no actually available page.
> 
> Call setup_swap_extents() to get the final maxpages before swap_info_struct
> initialization to fix the issue.
> 
> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 27 deletions(-)

Reviedwed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 75b69213c2e7..a82f4ebefca3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>  	return maxpages;
>  }
>  
> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
> -					union swap_header *swap_header,
> -					unsigned char *swap_map,
> -					unsigned long maxpages,
> -					sector_t *span)
> +static int setup_swap_map(struct swap_info_struct *si,
> +			  union swap_header *swap_header,
> +			  unsigned char *swap_map,
> +			  unsigned long maxpages)
>  {
> -	unsigned int nr_good_pages;
>  	unsigned long i;
> -	int nr_extents;
> -
> -	nr_good_pages = maxpages - 1;	/* omit header page */
>  
> +	swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>  	for (i = 0; i < swap_header->info.nr_badpages; i++) {
>  		unsigned int page_nr = swap_header->info.badpages[i];
>  		if (page_nr == 0 || page_nr > swap_header->info.last_page)
>  			return -EINVAL;
>  		if (page_nr < maxpages) {
>  			swap_map[page_nr] = SWAP_MAP_BAD;
> -			nr_good_pages--;
> +			si->pages--;
>  		}
>  	}
>  
> -	if (nr_good_pages) {
> -		swap_map[0] = SWAP_MAP_BAD;
> -		si->max = maxpages;
> -		si->pages = nr_good_pages;
> -		nr_extents = setup_swap_extents(si, span);
> -		if (nr_extents < 0)
> -			return nr_extents;
> -		nr_good_pages = si->pages;
> -	}
> -	if (!nr_good_pages) {
> +	if (!si->pages) {
>  		pr_warn("Empty swap-file\n");
>  		return -EINVAL;
>  	}
>  
> -	return nr_extents;
> +	return 0;
>  }
>  
>  #define SWAP_CLUSTER_INFO_COLS						\
> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  	 * Mark unusable pages as unavailable. The clusters aren't
>  	 * marked free yet, so no list operations are involved yet.
>  	 *
> -	 * See setup_swap_map_and_extents(): header page, bad pages,
> +	 * See setup_swap_map(): header page, bad pages,
>  	 * and the EOF part of the last cluster.
>  	 */
>  	inc_cluster_info_page(si, cluster_info, 0);
> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		goto bad_swap_unlock_inode;
>  	}
>  
> +	si->max = maxpages;
> +	si->pages = maxpages - 1;
> +	nr_extents = setup_swap_extents(si, &span);
> +	if (nr_extents < 0) {
> +		error = nr_extents;
> +		goto bad_swap_unlock_inode;
> +	}
> +	maxpages = si->max;
> +
>  	/* OK, set up the swap map and apply the bad block list */
>  	swap_map = vzalloc(maxpages);
>  	if (!swap_map) {
> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	if (error)
>  		goto bad_swap_unlock_inode;
>  
> -	nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
> -						maxpages, &span);
> -	if (unlikely(nr_extents < 0)) {
> -		error = nr_extents;
> +	error = setup_swap_map(si, swap_header, swap_map, maxpages);
> +	if (error)
>  		goto bad_swap_unlock_inode;
> -	}
>  
>  	/*
>  	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
> -- 
> 2.30.0
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
  2025-05-25 18:44   ` Kairui Song
@ 2025-05-30  2:55     ` Baoquan He
  2025-06-11  8:27     ` Kemeng Shi
  1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30  2:55 UTC (permalink / raw)
  To: Kairui Song; +Cc: Kemeng Shi, akpm, hannes, linux-mm, linux-kernel

On 05/26/25 at 02:44am, Kairui Song wrote:
> On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> >
> > In setup_swap_map(), we only ensure badpages are in range (0, last_page].
> > As maxpages might be < last_page, setup_clusters() will encounter a
> > buffer overflow when a badpage is >= maxpages.
> > Only call inc_cluster_info_page() for badpage which is < maxpages to
> > fix the issue.
> >
> > Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
> > Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > ---
> >  mm/swapfile.c | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index a82f4ebefca3..63ab9f14b2c6 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >          * and the EOF part of the last cluster.
> >          */
> >         inc_cluster_info_page(si, cluster_info, 0);
> > -       for (i = 0; i < swap_header->info.nr_badpages; i++)
> > -               inc_cluster_info_page(si, cluster_info,
> > -                                     swap_header->info.badpages[i]);
> > +       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> > +               unsigned int page_nr = swap_header->info.badpages[i];
> > +
> > +               if (page_nr >= maxpages)
> > +                       continue;
> > +               inc_cluster_info_page(si, cluster_info, page_nr);
> 
> I think we might need a pr_err or pr_warn here, this means mkswap
> marked the wrong region as a bad block? Or some fs side things went
> wrong.

There's aready warning in read_swap_header():
static unsigned long read_swap_header(struct swap_info_struct *si,
                                        union swap_header *swap_header,
                                        struct inode *inode)
{
	......
	if (last_page > maxpages) {
                pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
                        K(maxpages), K(last_page));
        }
	...
}

And if we add pr_err|warn here, we also need add it in setup_swap_map()
when filling swap_map.

> 
> 
> > +       }
> >         for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
> >                 inc_cluster_info_page(si, cluster_info, i);
> >
> > --
> > 2.30.0
> >
> >
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
  2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
  2025-05-25 18:44   ` Kairui Song
@ 2025-05-30  2:56   ` Baoquan He
  1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30  2:56 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel

On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> In setup_swap_map(), we only ensure badpages are in range (0, last_page].
> As maxpages might be < last_page, setup_clusters() will encounter a
> buffer overflow when a badpage is >= maxpages.
> Only call inc_cluster_info_page() for badpage which is < maxpages to
> fix the issue.
> 
> Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a82f4ebefca3..63ab9f14b2c6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  	 * and the EOF part of the last cluster.
>  	 */
>  	inc_cluster_info_page(si, cluster_info, 0);
> -	for (i = 0; i < swap_header->info.nr_badpages; i++)
> -		inc_cluster_info_page(si, cluster_info,
> -				      swap_header->info.badpages[i]);
> +	for (i = 0; i < swap_header->info.nr_badpages; i++) {
> +		unsigned int page_nr = swap_header->info.badpages[i];
> +
> +		if (page_nr >= maxpages)
> +			continue;
> +		inc_cluster_info_page(si, cluster_info, page_nr);
> +	}
>  	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
>  		inc_cluster_info_page(si, cluster_info, i);
>  
> -- 
> 2.30.0
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry()
  2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
  2025-05-25 17:05   ` Kairui Song
@ 2025-05-30  5:24   ` Baoquan He
  1 sibling, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-05-30  5:24 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel

On 05/22/25 at 08:25pm, Kemeng Shi wrote:
> As cluster_next_cpu was already dropped, the associated comment is stale
> now.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> ---
>  mm/swapfile.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 63ab9f14b2c6..8525515fb06c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -956,9 +956,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  	}
>  
>  	/*
> -	 * We don't have free cluster but have some clusters in
> -	 * discarding, do discard now and reclaim them, then
> -	 * reread cluster_next_cpu since we dropped si->lock
> +	 * We don't have free cluster but have some clusters in discarding,
> +	 * do discard now and reclaim them.
>  	 */
>  	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
>  		goto new_cluster;
> -- 
> 2.30.0
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-05-25 17:08   ` Kairui Song
@ 2025-06-11  7:54     ` Kemeng Shi
  2025-07-17 23:21       ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Kemeng Shi @ 2025-06-11  7:54 UTC (permalink / raw)
  To: Kairui Song; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel



on 5/26/2025 1:08 AM, Kairui Song wrote:
> On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>> We use maxpages from read_swap_header() to initialize swap_info_struct,
>> however the maxpages might be reduced in setup_swap_extents() and the
>> si->max is assigned with the reduced maxpages from the
>> setup_swap_extents().
>>
>> Obviously, this could lead to memory waste as we allocated memory based on
>> larger maxpages, besides, this could lead to a potensial deadloop as
>> following:
>> 1) When calling setup_clusters() with larger maxpages, unavailable pages
>> within range [si->max, larger maxpages) are not accounted with
>> inc_cluster_info_page(). As a result, these pages are assumed available
>> but can not be allocated. The cluster contains these pages can be moved
>> to frag_clusters list after it's all available pages were allocated.
>> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
>> list, cluster_alloc_swap_entry() assume order 0 allocation will never
>> failed and will enter a deadloop by keep trying to allocate page from the
>> only cluster in frag_clusters which contains no actually available page.
>>
>> Call setup_swap_extents() to get the final maxpages before swap_info_struct
>> initialization to fix the issue.
>>
>> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
>>
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
>>  1 file changed, 20 insertions(+), 27 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 75b69213c2e7..a82f4ebefca3 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>>         return maxpages;
>>  }
>>
>> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
>> -                                       union swap_header *swap_header,
>> -                                       unsigned char *swap_map,
>> -                                       unsigned long maxpages,
>> -                                       sector_t *span)
>> +static int setup_swap_map(struct swap_info_struct *si,
>> +                         union swap_header *swap_header,
>> +                         unsigned char *swap_map,
>> +                         unsigned long maxpages)
>>  {
>> -       unsigned int nr_good_pages;
>>         unsigned long i;
>> -       int nr_extents;
>> -
>> -       nr_good_pages = maxpages - 1;   /* omit header page */
>>
>> +       swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
>>                 unsigned int page_nr = swap_header->info.badpages[i];
>>                 if (page_nr == 0 || page_nr > swap_header->info.last_page)
>>                         return -EINVAL;
>>                 if (page_nr < maxpages) {
>>                         swap_map[page_nr] = SWAP_MAP_BAD;
>> -                       nr_good_pages--;
>> +                       si->pages--;
>>                 }
>>         }
>>
>> -       if (nr_good_pages) {
>> -               swap_map[0] = SWAP_MAP_BAD;
>> -               si->max = maxpages;
>> -               si->pages = nr_good_pages;
>> -               nr_extents = setup_swap_extents(si, span);
>> -               if (nr_extents < 0)
>> -                       return nr_extents;
>> -               nr_good_pages = si->pages;
>> -       }
>> -       if (!nr_good_pages) {
>> +       if (!si->pages) {
>>                 pr_warn("Empty swap-file\n");
>>                 return -EINVAL;
>>         }
>>
>>
>> -       return nr_extents;
>> +       return 0;
>>  }
>>
>>  #define SWAP_CLUSTER_INFO_COLS                                         \
>> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>>          * Mark unusable pages as unavailable. The clusters aren't
>>          * marked free yet, so no list operations are involved yet.
>>          *
>> -        * See setup_swap_map_and_extents(): header page, bad pages,
>> +        * See setup_swap_map(): header page, bad pages,
>>          * and the EOF part of the last cluster.
>>          */
>>         inc_cluster_info_page(si, cluster_info, 0);
>> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>                 goto bad_swap_unlock_inode;
>>         }
>>
>> +       si->max = maxpages;
>> +       si->pages = maxpages - 1;
>> +       nr_extents = setup_swap_extents(si, &span);
>> +       if (nr_extents < 0) {
>> +               error = nr_extents;
>> +               goto bad_swap_unlock_inode;
>> +       }
>> +       maxpages = si->max;
> 
Hello,
> There seems to be a trivial problem here, previously the si->pages
> will be seen by swap_activate after bad blocks have been counted and
> si->pages means the actual available slots. But now si->pages will be
> seen by swap_active as `maxpages - 1`.
> 
> One current side effect now is the span value will not be updated
> properly so the pr_info in swap on may print a larger value, if the
> swap header contains badblocks and swapfile is on nfs/cifs.
Thanks for point this out. But I think the larger value is actually
correct result.
In summary, there are two kinds of swapfile_activate operations.
1. Filesystem style: Treat all blocks logical continuity and find
useable physical extents in logical range. In this way, si->pages
will be actual useable physical blocks and span will be "1 +
highest_block - lowest_block".
2. Block device style: Treat all blocks physically continue and
only one single extent is added. In this way, si->pages will be
si->max and span will be "si->pages - 1".
Actually, si->pages and si->max is only used in block device style
and span value is set with si->pages. As a result, span value in
block device style will become a larger value as you mentioned.

I think larger value is correct based on:
1. Span value in filesystem style is "1 + highest_block -
lowest_block" which is the range cover all possible phisical blocks
including the badblocks.
2. For block device style, si->pages is the actual useable block
number and is already in pr_info. The orignal span value before
this patch is also refer to useable block number which is redundant
in pr_info.
> 
> This should not be a problem but it's better to mention or add
> comments about it
I'd like to mention this change as a fix in changelog in next version.
> 
> And I think it's better to add a sanity check here to check if
> si->pages still equal to si->max - 1,  setup_swap_map_and_extents /
> setup_swap_map assumes the header section was already counted. This
> also helps indicate the setup_swap_extents may shrink and modify these
> two values.
Sure, will add this in next version.
> 
> BTW, I was thinking that we should get rid of the whole extents design
> after the swap table series is ready, so mTHP allocation will be
> usable for swap over fs too.
I also noticed this limitation but have not taken a deep look. Look
forward to your solution in future.
> 
>>         /* OK, set up the swap map and apply the bad block list */
>>         swap_map = vzalloc(maxpages);
>>         if (!swap_map) {
>> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>         if (error)
>>                 goto bad_swap_unlock_inode;
>>
>> -       nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
>> -                                               maxpages, &span);
>> -       if (unlikely(nr_extents < 0)) {
>> -               error = nr_extents;
>> +       error = setup_swap_map(si, swap_header, swap_map, maxpages);
>> +       if (error)
>>                 goto bad_swap_unlock_inode;
>> -       }
>>
>>         /*
>>          * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
>> --
>> 2.30.0
>>
> 
> Other than that:
> 
> Reviewed-by: Kairui Song <kasong@tencent.com>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters()
  2025-05-25 18:44   ` Kairui Song
  2025-05-30  2:55     ` Baoquan He
@ 2025-06-11  8:27     ` Kemeng Shi
  1 sibling, 0 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-06-11  8:27 UTC (permalink / raw)
  To: Kairui Song; +Cc: akpm, bhe, hannes, linux-mm, linux-kernel



on 5/26/2025 2:44 AM, Kairui Song wrote:
> On Thu, May 22, 2025 at 11:32 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>>
>> In setup_swap_map(), we only ensure badpages are in range (0, last_page].
>> As maxpages might be < last_page, setup_clusters() will encounter a
>> buffer overflow when a badpage is >= maxpages.
>> Only call inc_cluster_info_page() for badpage which is < maxpages to
>> fix the issue.
>>
>> Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  mm/swapfile.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index a82f4ebefca3..63ab9f14b2c6 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>>          * and the EOF part of the last cluster.
>>          */
>>         inc_cluster_info_page(si, cluster_info, 0);
>> -       for (i = 0; i < swap_header->info.nr_badpages; i++)
>> -               inc_cluster_info_page(si, cluster_info,
>> -                                     swap_header->info.badpages[i]);
>> +       for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> +               unsigned int page_nr = swap_header->info.badpages[i];
>> +
>> +               if (page_nr >= maxpages)
>> +                       continue;
>> +               inc_cluster_info_page(si, cluster_info, page_nr);
> 
> I think we might need a pr_err or pr_warn here, this means mkswap
> marked the wrong region as a bad block? Or some fs side things went
> wrong.
As Baoquan metioned that there is already warning in read_swap_header().
Besides, I think last_page in swap_header already indicates the range
which is acceptable to fs and we only need to explicitly handle page_nr
which is > last_page.
> 
> 
>> +       }
>>         for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
>>                 inc_cluster_info_page(si, cluster_info, i);
>>
>> --
>> 2.30.0
>>
>>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-05-30  2:50   ` Baoquan He
@ 2025-06-11  8:27     ` Kemeng Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-06-11  8:27 UTC (permalink / raw)
  To: Baoquan He; +Cc: akpm, kasong, hannes, linux-mm, linux-kernel



on 5/30/2025 10:50 AM, Baoquan He wrote:
> On 05/22/25 at 08:25pm, Kemeng Shi wrote:
>> We use maxpages from read_swap_header() to initialize swap_info_struct,
>> however the maxpages might be reduced in setup_swap_extents() and the
>> si->max is assigned with the reduced maxpages from the
>> setup_swap_extents().
>> Obviously, this could lead to memory waste as we allocated memory based on
>> larger maxpages, besides, this could lead to a potensial deadloop as
>                                                  ^ typo, potential
Thanks, will fix this in next version.
>> following:
>> 1) When calling setup_clusters() with larger maxpages, unavailable pages
>> within range [si->max, larger maxpages) are not accounted with
>> inc_cluster_info_page(). As a result, these pages are assumed available
>> but can not be allocated. The cluster contains these pages can be moved
>> to frag_clusters list after it's all available pages were allocated.
>> 2) When the cluster mentioned in 1) is the only cluster in frag_clusters
>> list, cluster_alloc_swap_entry() assume order 0 allocation will never
>> failed and will enter a deadloop by keep trying to allocate page from the
>> only cluster in frag_clusters which contains no actually available page.
>>
>> Call setup_swap_extents() to get the final maxpages before swap_info_struct
>> initialization to fix the issue.
>>
>> Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
>> ---
>>  mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
>>  1 file changed, 20 insertions(+), 27 deletions(-)
> 
> Reviedwed-by: Baoquan He <bhe@redhat.com>
> 
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 75b69213c2e7..a82f4ebefca3 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>>  	return maxpages;
>>  }
>>  
>> -static int setup_swap_map_and_extents(struct swap_info_struct *si,
>> -					union swap_header *swap_header,
>> -					unsigned char *swap_map,
>> -					unsigned long maxpages,
>> -					sector_t *span)
>> +static int setup_swap_map(struct swap_info_struct *si,
>> +			  union swap_header *swap_header,
>> +			  unsigned char *swap_map,
>> +			  unsigned long maxpages)
>>  {
>> -	unsigned int nr_good_pages;
>>  	unsigned long i;
>> -	int nr_extents;
>> -
>> -	nr_good_pages = maxpages - 1;	/* omit header page */
>>  
>> +	swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>>  	for (i = 0; i < swap_header->info.nr_badpages; i++) {
>>  		unsigned int page_nr = swap_header->info.badpages[i];
>>  		if (page_nr == 0 || page_nr > swap_header->info.last_page)
>>  			return -EINVAL;
>>  		if (page_nr < maxpages) {
>>  			swap_map[page_nr] = SWAP_MAP_BAD;
>> -			nr_good_pages--;
>> +			si->pages--;
>>  		}
>>  	}
>>  
>> -	if (nr_good_pages) {
>> -		swap_map[0] = SWAP_MAP_BAD;
>> -		si->max = maxpages;
>> -		si->pages = nr_good_pages;
>> -		nr_extents = setup_swap_extents(si, span);
>> -		if (nr_extents < 0)
>> -			return nr_extents;
>> -		nr_good_pages = si->pages;
>> -	}
>> -	if (!nr_good_pages) {
>> +	if (!si->pages) {
>>  		pr_warn("Empty swap-file\n");
>>  		return -EINVAL;
>>  	}
>>  
>> -	return nr_extents;
>> +	return 0;
>>  }
>>  
>>  #define SWAP_CLUSTER_INFO_COLS						\
>> @@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>>  	 * Mark unusable pages as unavailable. The clusters aren't
>>  	 * marked free yet, so no list operations are involved yet.
>>  	 *
>> -	 * See setup_swap_map_and_extents(): header page, bad pages,
>> +	 * See setup_swap_map(): header page, bad pages,
>>  	 * and the EOF part of the last cluster.
>>  	 */
>>  	inc_cluster_info_page(si, cluster_info, 0);
>> @@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		goto bad_swap_unlock_inode;
>>  	}
>>  
>> +	si->max = maxpages;
>> +	si->pages = maxpages - 1;
>> +	nr_extents = setup_swap_extents(si, &span);
>> +	if (nr_extents < 0) {
>> +		error = nr_extents;
>> +		goto bad_swap_unlock_inode;
>> +	}
>> +	maxpages = si->max;
>> +
>>  	/* OK, set up the swap map and apply the bad block list */
>>  	swap_map = vzalloc(maxpages);
>>  	if (!swap_map) {
>> @@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  	if (error)
>>  		goto bad_swap_unlock_inode;
>>  
>> -	nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
>> -						maxpages, &span);
>> -	if (unlikely(nr_extents < 0)) {
>> -		error = nr_extents;
>> +	error = setup_swap_map(si, swap_header, swap_map, maxpages);
>> +	if (error)
>>  		goto bad_swap_unlock_inode;
>> -	}
>>  
>>  	/*
>>  	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
>> -- 
>> 2.30.0
>>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-06-11  7:54     ` Kemeng Shi
@ 2025-07-17 23:21       ` Andrew Morton
  2025-07-18  6:12         ` Kemeng Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2025-07-17 23:21 UTC (permalink / raw)
  To: Kemeng Shi; +Cc: Kairui Song, bhe, hannes, linux-mm, linux-kernel

On Wed, 11 Jun 2025 15:54:21 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:

> 
> 
> on 5/26/2025 1:08 AM, Kairui Song wrote:

Nearly two months!

> Sure, will add this in next version.

Do we actually need a new version?  Having rescanned the v1 review I'm
inclined to merge this series as-is?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
  2025-07-17 23:21       ` Andrew Morton
@ 2025-07-18  6:12         ` Kemeng Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Kemeng Shi @ 2025-07-18  6:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Kairui Song, bhe, hannes, linux-mm, linux-kernel



on 7/18/2025 7:21 AM, Andrew Morton wrote:
> On Wed, 11 Jun 2025 15:54:21 +0800 Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> 
>>
>>
>> on 5/26/2025 1:08 AM, Kairui Song wrote:
> 
> Nearly two months!
> 
>> Sure, will add this in next version.
> 
> Do we actually need a new version?  Having rescanned the v1 review I'm
> inclined to merge this series as-is?
> 
> 
So sorry for the late. I will write and send a v2 version soon.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-07-18  6:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 12:25 [PATCH 0/4] Some randome fixes and cleanups to swapfile Kemeng Shi
2025-05-22 12:25 ` [PATCH 1/4] mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc() Kemeng Shi
2025-05-22  3:55   ` Kairui Song
2025-05-30  1:31   ` Baoquan He
2025-05-22 12:25 ` [PATCH 2/4] mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop Kemeng Shi
2025-05-25 17:08   ` Kairui Song
2025-06-11  7:54     ` Kemeng Shi
2025-07-17 23:21       ` Andrew Morton
2025-07-18  6:12         ` Kemeng Shi
2025-05-30  2:50   ` Baoquan He
2025-06-11  8:27     ` Kemeng Shi
2025-05-22 12:25 ` [PATCH 3/4] mm: swap: fix potensial buffer overflow in setup_clusters() Kemeng Shi
2025-05-25 18:44   ` Kairui Song
2025-05-30  2:55     ` Baoquan He
2025-06-11  8:27     ` Kemeng Shi
2025-05-30  2:56   ` Baoquan He
2025-05-22 12:25 ` [PATCH 4/4] mm: swap: remove stale comment stale comment in cluster_alloc_swap_entry() Kemeng Shi
2025-05-25 17:05   ` Kairui Song
2025-05-30  5:24   ` Baoquan He
2025-05-22 21:41 ` [PATCH 0/4] Some randome fixes and cleanups to swapfile Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).