From: Kairui Song <ryncsn@gmail.com>
To: Youngjun Park <youngjun.park@lge.com>
Cc: rafael@kernel.org, akpm@linux-foundation.org, chrisl@kernel.org,
kasong@tencent.com, pavel@kernel.org, shikemeng@huaweicloud.com,
nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org,
usama.arif@linux.dev, linux-pm@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference
Date: Fri, 20 Mar 2026 00:34:02 +0800 [thread overview]
Message-ID: <abp7aDgYLrxF3Me8@KASONG-MC4> (raw)
In-Reply-To: <20260317181318.2517015-2-youngjun.park@lge.com>
On Wed, Mar 18, 2026 at 03:13:16AM +0800, Youngjun Park wrote:
> Hibernation via uswsusp (/dev/snapshot ioctls) has a race: between
> setting the resume swap area and allocating a swap slot, user-space is
> not yet frozen, so swapoff can run and cause an incorrect slot allocation.
>
> Fix this by keeping swap_type_of() as a static helper that requires
> swap_lock to be held, and introducing new interfaces that wrap it with
> proper locking and reference management:
>
> - get_hibernation_swap_type(): Lookup under swap_lock + acquire a swap
> device reference to block swapoff (used by uswsusp).
> - find_hibernation_swap_type(): Lookup under swap_lock only, no
> reference. Used by the sysfs path where user-space is already frozen,
> making swapoff impossible.
> - put_hibernation_swap_type(): Release the reference.
>
> Because the reference is held via get_swap_device(), swapoff will block
> at wait_for_completion_interruptible() until put_hibernation_swap_type()
> releases it. The wait is interruptible, so swapoff can be cancelled by
> a signal.
>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
> include/linux/swap.h | 4 +-
> kernel/power/swap.c | 2 +-
> kernel/power/user.c | 15 ++++++--
> mm/swapfile.c | 92 ++++++++++++++++++++++++++++++++++++--------
> 4 files changed, 92 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7a09df6977a5..cf8cfdaf34a7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -433,7 +433,9 @@ static inline long get_nr_swap_pages(void)
> }
>
> extern void si_swapinfo(struct sysinfo *);
> -int swap_type_of(dev_t device, sector_t offset);
> +int get_hibernation_swap_type(dev_t device, sector_t offset);
> +int find_hibernation_swap_type(dev_t device, sector_t offset);
> +void put_hibernation_swap_type(int type);
Hi Youngjun,
Thanks! This set of API looks better that before.
>
> -/*
> - * Find the swap type that corresponds to given device (if any).
> - *
> - * @offset - number of the PAGE_SIZE-sized block of the device, starting
> - * from 0, in which the swap header is expected to be located.
> - *
> - * This is needed for the suspend to disk (aka swsusp).
> - */
> -int swap_type_of(dev_t device, sector_t offset)
> +static int swap_type_of(dev_t device, sector_t offset)
Maybe rename it while at it, it's a internal helper now and
no one is using it.
> {
> int type;
>
> + lockdep_assert_held(&swap_lock);
> +
> if (!device)
> return -1;
>
> - spin_lock(&swap_lock);
> for (type = 0; type < nr_swapfiles; type++) {
> struct swap_info_struct *sis = swap_info[type];
>
> @@ -2164,16 +2157,70 @@ int swap_type_of(dev_t device, sector_t offset)
> if (device == sis->bdev->bd_dev) {
> struct swap_extent *se = first_se(sis);
>
> - if (se->start_block == offset) {
> - spin_unlock(&swap_lock);
> + if (se->start_block == offset)
> return type;
> - }
> }
> }
> - spin_unlock(&swap_lock);
> return -ENODEV;
> }
>
> +/*
> + * Finds the swap type and safely acquires a reference to the swap device
> + * to prevent race conditions with swapoff.
> + *
> + * This should be used in environments like uswsusp where a race condition
> + * exists between configuring the resume device and allocating a swap slot.
> + * For sysfs hibernation where user-space is frozen (making swapoff
> + * impossible), use find_hibernation_swap_type() instead.
> + *
> + * The caller must drop the reference using put_hibernation_swap_type().
> + */
You can follow the kernel doc format.
> +int get_hibernation_swap_type(dev_t device, sector_t offset)
> +{
> + int type;
> + struct swap_info_struct *sis;
> +
> + spin_lock(&swap_lock);
> + type = swap_type_of(device, offset);
> + sis = swap_type_to_info(type);
> + if (!sis || !get_swap_device_info(sis))
> + type = -1;
> +
> + spin_unlock(&swap_lock);
> + return type;
> +}
> +
> +/*
> + * Drops the reference to the swap device previously acquired by
> + * get_hibernation_swap_type().
> + */
> +void put_hibernation_swap_type(int type)
> +{
> + struct swap_info_struct *sis;
> +
> + sis = swap_type_to_info(type);
> + if (!sis)
> + return;
I think it should never see sis == NULL here?
> +
> + put_swap_device(sis);
> +}
> +
> +/*
> + * Simple lookup without acquiring a reference. Used by the sysfs
> + * hibernation path where user-space is already frozen, making
> + * swapoff impossible.
> + */
> +int find_hibernation_swap_type(dev_t device, sector_t offset)
> +{
> + int type;
> +
> + spin_lock(&swap_lock);
> + type = swap_type_of(device, offset);
> + spin_unlock(&swap_lock);
> +
> + return type;
> +}
> +
> int find_first_swap(dev_t *device)
> {
> int type;
> @@ -2971,10 +3018,23 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> * spinlock) will be waited too. This makes it easy to
> * prevent folio_test_swapcache() and the following swap cache
> * operations from racing with swapoff.
> + *
> + * Note: if a hibernation session is actively holding a swap
> + * device reference, swapoff will block here until the reference
> + * is released via put_hibernation_swap_type() or the wait is
> + * interrupted by a signal.
This looks really bad... It means swapoff will hung unless user send a
signal to stop it? How does that happen?
If hibernate may hold a reference without allocating any slot for a
long time, then maybe we better introduce a new si->flags bit to
indicate a device is used by hibernate, so swapoff can abort early
with a -EBUSY.
If hibernate only holds the swap device reference after it allocated
any slot, then we can avoid use a flag but use a hibernate type in
swap table instead:
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 8415ffbe2b9c..025e004fc2e7 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -25,6 +25,7 @@ struct swap_table {
* PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot
* Pointer: |----------- Pointer ----------|100| - (Unused)
* Bad: |------------- 1 -------------|1000| - Bad slot
+ * Exclusive:|---- 0 ----| ------ 1 -----|10000| - Exclusive usage
*
* SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
*
@@ -46,6 +47,8 @@ struct swap_table {
* because only the lower three bits can be used as a marker for 8 bytes
* aligned pointers.
*
+ * - Exclusive usage: e.g. for hibernation.
+ *
* - Bad: Swap slot is reserved, protects swap header or holes on swap devices.
*/
Both readahead and swapoff (unuse scan) will abort upon seeing such slot,
we solve both readahead and swapoff issue together.
How do you think? We can use the flag idea first.
> */
> percpu_ref_kill(&p->users);
> synchronize_rcu();
> - wait_for_completion(&p->comp);
> + err = wait_for_completion_interruptible(&p->comp);
> + if (err) {
> + percpu_ref_resurrect(&p->users);
> + synchronize_rcu();
Do we need a synchronize_rcu here? It's required above because we are
releasing the resources so all RCU readers must exit from grace
period. But here we are aborting and not releasing anything so
there is no such need?
> + reinit_completion(&p->comp);
> + reinsert_swap_info(p);
> + goto out_dput;
> + }
> +
>
> flush_work(&p->discard_work);
> flush_work(&p->reclaim_work);
> --
> 2.34.1
next prev parent reply other threads:[~2026-03-19 16:34 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 18:13 [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Youngjun Park
2026-03-17 18:13 ` [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
2026-03-19 16:34 ` Kairui Song [this message]
2026-03-20 7:59 ` YoungJun Park
2026-03-17 18:13 ` [PATCH v4 2/3] mm/swap: remove redundant swap device reference in alloc/free Youngjun Park
2026-03-17 18:13 ` [PATCH v4 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
2026-03-17 19:16 ` [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Andrew Morton
2026-03-18 2:16 ` YoungJun Park
2026-03-19 13:33 ` Rafael J. Wysocki
2026-03-19 13:48 ` YoungJun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abp7aDgYLrxF3Me8@KASONG-MC4 \
--to=ryncsn@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=nphamcs@gmail.com \
--cc=pavel@kernel.org \
--cc=rafael@kernel.org \
--cc=shikemeng@huaweicloud.com \
--cc=usama.arif@linux.dev \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox