* [PATCH] mm, swap: avoid leaving unused extend table after alloc race
@ 2026-05-13 9:21 ` Kairui Song
0 siblings, 0 replies; 4+ messages in thread
From: Kairui Song via B4 Relay @ 2026-05-13 9:21 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Breno Leitao, Chris Li, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Kairui Song, linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Allocating an extend table requires dropping the ci lock first.
While the lock is dropped, a concurrent put can decrease the slot's
swap count to a value that is no longer maxed out, so the extend
table is no longer required. The current allocation path still
attach the new extend table to the cluster anyway, leaving it unused.
It's not really leaked, the next maxed out count on the same cluster
reuses the table, and frees it properly. Swapoff will also clean it up.
The worst case is one unused page pinned per cluster until the next
maxed-out allocation or swapoff. To eliminate the waste, re-check
under the ci lock that the extend table is still needed before
publishing it, and free the local allocation otherwise. The added
overhead is ignorable.
Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/linux-mm/agG6Dp0umhs6O1SY@gmail.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4840fd40f36f..451d20bb9f47 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1442,8 +1442,10 @@ static bool swap_sync_discard(void)
}
static int swap_extend_table_alloc(struct swap_info_struct *si,
- struct swap_cluster_info *ci, gfp_t gfp)
+ struct swap_cluster_info *ci,
+ unsigned int ci_off, gfp_t gfp)
{
+ int count;
void *table;
table = kzalloc(sizeof(ci->extend_table[0]) * SWAPFILE_CLUSTER, gfp);
@@ -1451,11 +1453,27 @@ static int swap_extend_table_alloc(struct swap_info_struct *si,
return -ENOMEM;
spin_lock(&ci->lock);
- if (!ci->extend_table)
- ci->extend_table = table;
- else
- kfree(table);
+ /*
+ * Extend table allocation requires releasing ci lock first so it's
+ * possible that the slot has been freed, no longer overflowed, or
+ * a concurrent extend table allocation has already succeeded, so
+ * the allocation is no longer needed.
+ */
+ if (!cluster_table_is_alloced(ci))
+ goto out_free;
+ count = swp_tb_get_count(__swap_table_get(ci, ci_off));
+ if (count < (SWP_TB_COUNT_MAX - 1))
+ goto out_free;
+ if (ci->extend_table)
+ goto out_free;
+
+ ci->extend_table = table;
+ spin_unlock(&ci->lock);
+ return 0;
+
+out_free:
spin_unlock(&ci->lock);
+ kfree(table);
return 0;
}
@@ -1471,7 +1489,7 @@ int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
return 0;
ci = __swap_offset_to_cluster(si, offset);
- ret = swap_extend_table_alloc(si, ci, gfp);
+ ret = swap_extend_table_alloc(si, ci, swp_cluster_offset(entry), gfp);
put_swap_device(si);
return ret;
@@ -1664,7 +1682,7 @@ static int swap_dup_entries_cluster(struct swap_info_struct *si,
if (unlikely(err)) {
if (err == -ENOMEM) {
spin_unlock(&ci->lock);
- err = swap_extend_table_alloc(si, ci, GFP_ATOMIC);
+ err = swap_extend_table_alloc(si, ci, ci_off, GFP_ATOMIC);
spin_lock(&ci->lock);
if (!err)
goto restart;
---
base-commit: 972c53e0ec3abfc6f5fe2cb503640710fb23cf95
change-id: 20260512-swap-extend-table-fix-ed7d1f458dac
Best regards,
--
Kairui Song <kasong@tencent.com>
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH] mm, swap: avoid leaving unused extend table after alloc race
@ 2026-05-13 9:21 ` Kairui Song
0 siblings, 0 replies; 4+ messages in thread
From: Kairui Song @ 2026-05-13 9:21 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Breno Leitao, Chris Li, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Kairui Song, linux-kernel,
Kairui Song
Allocating an extend table requires dropping the ci lock first.
While the lock is dropped, a concurrent put can decrease the slot's
swap count to a value that is no longer maxed out, so the extend
table is no longer required. The current allocation path still
attach the new extend table to the cluster anyway, leaving it unused.
It's not really leaked, the next maxed out count on the same cluster
reuses the table, and frees it properly. Swapoff will also clean it up.
The worst case is one unused page pinned per cluster until the next
maxed-out allocation or swapoff. To eliminate the waste, re-check
under the ci lock that the extend table is still needed before
publishing it, and free the local allocation otherwise. The added
overhead is ignorable.
Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/linux-mm/agG6Dp0umhs6O1SY@gmail.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4840fd40f36f..451d20bb9f47 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1442,8 +1442,10 @@ static bool swap_sync_discard(void)
}
static int swap_extend_table_alloc(struct swap_info_struct *si,
- struct swap_cluster_info *ci, gfp_t gfp)
+ struct swap_cluster_info *ci,
+ unsigned int ci_off, gfp_t gfp)
{
+ int count;
void *table;
table = kzalloc(sizeof(ci->extend_table[0]) * SWAPFILE_CLUSTER, gfp);
@@ -1451,11 +1453,27 @@ static int swap_extend_table_alloc(struct swap_info_struct *si,
return -ENOMEM;
spin_lock(&ci->lock);
- if (!ci->extend_table)
- ci->extend_table = table;
- else
- kfree(table);
+ /*
+ * Extend table allocation requires releasing ci lock first so it's
+ * possible that the slot has been freed, no longer overflowed, or
+ * a concurrent extend table allocation has already succeeded, so
+ * the allocation is no longer needed.
+ */
+ if (!cluster_table_is_alloced(ci))
+ goto out_free;
+ count = swp_tb_get_count(__swap_table_get(ci, ci_off));
+ if (count < (SWP_TB_COUNT_MAX - 1))
+ goto out_free;
+ if (ci->extend_table)
+ goto out_free;
+
+ ci->extend_table = table;
+ spin_unlock(&ci->lock);
+ return 0;
+
+out_free:
spin_unlock(&ci->lock);
+ kfree(table);
return 0;
}
@@ -1471,7 +1489,7 @@ int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
return 0;
ci = __swap_offset_to_cluster(si, offset);
- ret = swap_extend_table_alloc(si, ci, gfp);
+ ret = swap_extend_table_alloc(si, ci, swp_cluster_offset(entry), gfp);
put_swap_device(si);
return ret;
@@ -1664,7 +1682,7 @@ static int swap_dup_entries_cluster(struct swap_info_struct *si,
if (unlikely(err)) {
if (err == -ENOMEM) {
spin_unlock(&ci->lock);
- err = swap_extend_table_alloc(si, ci, GFP_ATOMIC);
+ err = swap_extend_table_alloc(si, ci, ci_off, GFP_ATOMIC);
spin_lock(&ci->lock);
if (!err)
goto restart;
---
base-commit: 972c53e0ec3abfc6f5fe2cb503640710fb23cf95
change-id: 20260512-swap-extend-table-fix-ed7d1f458dac
Best regards,
--
Kairui Song <kasong@tencent.com>
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH] mm, swap: avoid leaving unused extend table after alloc race
2026-05-13 9:21 ` Kairui Song
(?)
@ 2026-05-13 12:49 ` Breno Leitao
-1 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-05-13 12:49 UTC (permalink / raw)
To: kasong
Cc: linux-mm, Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Kairui Song, linux-kernel
On Wed, May 13, 2026 at 05:21:11PM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Allocating an extend table requires dropping the ci lock first.
> While the lock is dropped, a concurrent put can decrease the slot's
> swap count to a value that is no longer maxed out, so the extend
> table is no longer required. The current allocation path still
> attach the new extend table to the cluster anyway, leaving it unused.
>
> It's not really leaked, the next maxed out count on the same cluster
> reuses the table, and frees it properly. Swapoff will also clean it up.
>
> The worst case is one unused page pinned per cluster until the next
> maxed-out allocation or swapoff. To eliminate the waste, re-check
> under the ci lock that the extend table is still needed before
> publishing it, and free the local allocation otherwise. The added
> overhead is ignorable.
>
> Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count")
> Reported-by: Breno Leitao <leitao@debian.org>
> Closes: https://lore.kernel.org/linux-mm/agG6Dp0umhs6O1SY@gmail.com/
> Signed-off-by: Kairui Song <kasong@tencent.com>
Tested-by: Breno Leitao <leitao@debian.org>
Thanks for the fix,
--breno
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm, swap: avoid leaving unused extend table after alloc race
2026-05-13 9:21 ` Kairui Song
(?)
(?)
@ 2026-05-15 8:21 ` Kairui Song
-1 siblings, 0 replies; 4+ messages in thread
From: Kairui Song @ 2026-05-15 8:21 UTC (permalink / raw)
To: linux-mm
Cc: kasong, Andrew Morton, Breno Leitao, Chris Li, Kemeng Shi,
Nhat Pham, Baoquan He, Barry Song, Youngjun Park, linux-kernel
Forwarding sashiko review:
> > @@ -1451,11 +1453,27 @@ static int swap_extend_table_alloc(struct swap_info_struct *si,
> > return -ENOMEM;
> >
> > spin_lock(&ci->lock);
> > - if (!ci->extend_table)
> > - ci->extend_table = table;
> > - else
> > - kfree(table);
> > + /*
> > + * Extend table allocation requires releasing ci lock first so it's
> > + * possible that the slot has been freed, no longer overflowed, or
> > + * a concurrent extend table allocation has already succeeded, so
> > + * the allocation is no longer needed.
> > + */
> > + if (!cluster_table_is_alloced(ci))
> > + goto out_free;
> > + count = swp_tb_get_count(__swap_table_get(ci, ci_off));
> > + if (count < (SWP_TB_COUNT_MAX - 1))
> > + goto out_free;
>
> Is it possible for the extend table to still be leaked here due to a
> subsequent race?
> For example, if swap_retry_table_alloc() publishes the table because count is
> still SWP_TB_COUNT_MAX - 1:
> 1. Thread A in swap_retry_table_alloc() publishes the extend_table.
> 2. Before Thread A reacquires the page table lock and retries, Thread B
> unmaps the PTE.
> 3. Thread B calls swap_free(), decrementing the count from
> SWP_TB_COUNT_MAX - 1 to SWP_TB_COUNT_MAX - 2.
> 4. __swap_cluster_put_entry() only frees the extend table when the count
> decrements from SWP_TB_COUNT_MAX to SWP_TB_COUNT_MAX - 1, so the table
> remains attached.
> 5. Thread A resumes, sees pte_none, and skips the entry.
> Since the count never reached SWP_TB_COUNT_MAX, does the table remain
> permanently attached until the cluster is freed, triggering a WARN_ON_ONCE
> and a memory leak?
That seems possible indeed. I can adjust this a bit and avoid that
potential race too. The change is minor, I will send V2 shortly.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 451d20bb9f47..365b4caeef4b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1536,13 +1536,21 @@ static void __swap_cluster_put_entry(struct
swap_cluster_info *ci,
if (count == (SWP_TB_COUNT_MAX - 1)) {
ci->extend_table[ci_off] = 0;
__swap_table_set(ci, ci_off,
__swp_tb_mk_count(swp_tb, count));
- swap_extend_table_try_free(ci);
} else {
ci->extend_table[ci_off] = count;
}
} else {
__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb,
--count));
}
+
+ /*
+ * `SWP_TB_COUNT_MAX - 1` triggers extend table allocation. If the
+ * count was above that then the extend table is no longer needed. And
+ * if we just put the count value from that value, it's possible that
+ * a pending dup just attached a extend table.
+ */
+ if (unlikely(count == SWP_TB_COUNT_MAX - 2 || count ==
SWP_TB_COUNT_MAX - 1))
+ swap_extend_table_try_free(ci);
}
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-15 8:22 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 9:21 [PATCH] mm, swap: avoid leaving unused extend table after alloc race Kairui Song via B4 Relay
2026-05-13 9:21 ` Kairui Song
2026-05-13 12:49 ` Breno Leitao
2026-05-15 8:21 ` Kairui Song
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.