* [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference
2026-03-17 18:13 [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Youngjun Park
@ 2026-03-17 18:13 ` Youngjun Park
2026-03-19 16:34 ` Kairui Song
2026-03-17 18:13 ` [PATCH v4 2/3] mm/swap: remove redundant swap device reference in alloc/free Youngjun Park
` (3 subsequent siblings)
4 siblings, 1 reply; 10+ messages in thread
From: Youngjun Park @ 2026-03-17 18:13 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-pm, linux-mm
Hibernation via uswsusp (/dev/snapshot ioctls) has a race: between
setting the resume swap area and allocating a swap slot, user-space is
not yet frozen, so swapoff can run and cause an incorrect slot allocation.
Fix this by keeping swap_type_of() as a static helper that requires
swap_lock to be held, and introducing new interfaces that wrap it with
proper locking and reference management:
- get_hibernation_swap_type(): Lookup under swap_lock + acquire a swap
device reference to block swapoff (used by uswsusp).
- find_hibernation_swap_type(): Lookup under swap_lock only, no
reference. Used by the sysfs path where user-space is already frozen,
making swapoff impossible.
- put_hibernation_swap_type(): Release the reference.
Because the reference is held via get_swap_device(), swapoff will block
at wait_for_completion_interruptible() until put_hibernation_swap_type()
releases it. The wait is interruptible, so swapoff can be cancelled by
a signal.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
include/linux/swap.h | 4 +-
kernel/power/swap.c | 2 +-
kernel/power/user.c | 15 ++++++--
mm/swapfile.c | 92 ++++++++++++++++++++++++++++++++++++--------
4 files changed, 92 insertions(+), 21 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7a09df6977a5..cf8cfdaf34a7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -433,7 +433,9 @@ static inline long get_nr_swap_pages(void)
}
extern void si_swapinfo(struct sysinfo *);
-int swap_type_of(dev_t device, sector_t offset);
+int get_hibernation_swap_type(dev_t device, sector_t offset);
+int find_hibernation_swap_type(dev_t device, sector_t offset);
+void put_hibernation_swap_type(int type);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
extern sector_t swapdev_block(int, pgoff_t);
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 2e64869bb5a0..cc4764149e8f 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -341,7 +341,7 @@ static int swsusp_swap_check(void)
* This is called before saving the image.
*/
if (swsusp_resume_device)
- res = swap_type_of(swsusp_resume_device, swsusp_resume_block);
+ res = find_hibernation_swap_type(swsusp_resume_device, swsusp_resume_block);
else
res = find_first_swap(&swsusp_resume_device);
if (res < 0)
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4401cfe26e5c..3e41544b99d5 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -71,7 +71,7 @@ static int snapshot_open(struct inode *inode, struct file *filp)
memset(&data->handle, 0, sizeof(struct snapshot_handle));
if ((filp->f_flags & O_ACCMODE) == O_RDONLY) {
/* Hibernating. The image device should be accessible. */
- data->swap = swap_type_of(swsusp_resume_device, 0);
+ data->swap = get_hibernation_swap_type(swsusp_resume_device, 0);
data->mode = O_RDONLY;
data->free_bitmaps = false;
error = pm_notifier_call_chain_robust(PM_HIBERNATION_PREPARE, PM_POST_HIBERNATION);
@@ -90,8 +90,10 @@ static int snapshot_open(struct inode *inode, struct file *filp)
data->free_bitmaps = !error;
}
}
- if (error)
+ if (error) {
+ put_hibernation_swap_type(data->swap);
hibernate_release();
+ }
data->frozen = false;
data->ready = false;
@@ -115,6 +117,7 @@ static int snapshot_release(struct inode *inode, struct file *filp)
data = filp->private_data;
data->dev = 0;
free_all_swap_pages(data->swap);
+ put_hibernation_swap_type(data->swap);
if (data->frozen) {
pm_restore_gfp_mask();
free_basic_memory_bitmaps();
@@ -235,11 +238,17 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
offset = swap_area.offset;
}
+ /*
+ * Put the reference if a swap area was already
+ * set by SNAPSHOT_SET_SWAP_AREA.
+ */
+ put_hibernation_swap_type(data->swap);
+
/*
* User space encodes device types as two-byte values,
* so we need to recode them
*/
- data->swap = swap_type_of(swdev, offset);
+ data->swap = get_hibernation_swap_type(swdev, offset);
if (data->swap < 0)
return swdev ? -ENODEV : -EINVAL;
data->dev = swdev;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 71a7d6959f3e..7baa0f270cff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -134,7 +134,7 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
/* May return NULL on invalid type, caller must check for NULL return */
static struct swap_info_struct *swap_type_to_info(int type)
{
- if (type >= MAX_SWAPFILES)
+ if (type < 0 || type >= MAX_SWAPFILES)
return NULL;
return READ_ONCE(swap_info[type]); /* rcu_dereference() */
}
@@ -2139,22 +2139,15 @@ void swap_free_hibernation_slot(swp_entry_t entry)
put_swap_device(si);
}
-/*
- * Find the swap type that corresponds to given device (if any).
- *
- * @offset - number of the PAGE_SIZE-sized block of the device, starting
- * from 0, in which the swap header is expected to be located.
- *
- * This is needed for the suspend to disk (aka swsusp).
- */
-int swap_type_of(dev_t device, sector_t offset)
+static int swap_type_of(dev_t device, sector_t offset)
{
int type;
+ lockdep_assert_held(&swap_lock);
+
if (!device)
return -1;
- spin_lock(&swap_lock);
for (type = 0; type < nr_swapfiles; type++) {
struct swap_info_struct *sis = swap_info[type];
@@ -2164,16 +2157,70 @@ int swap_type_of(dev_t device, sector_t offset)
if (device == sis->bdev->bd_dev) {
struct swap_extent *se = first_se(sis);
- if (se->start_block == offset) {
- spin_unlock(&swap_lock);
+ if (se->start_block == offset)
return type;
- }
}
}
- spin_unlock(&swap_lock);
return -ENODEV;
}
+/*
+ * Finds the swap type and safely acquires a reference to the swap device
+ * to prevent race conditions with swapoff.
+ *
+ * This should be used in environments like uswsusp where a race condition
+ * exists between configuring the resume device and allocating a swap slot.
+ * For sysfs hibernation where user-space is frozen (making swapoff
+ * impossible), use find_hibernation_swap_type() instead.
+ *
+ * The caller must drop the reference using put_hibernation_swap_type().
+ */
+int get_hibernation_swap_type(dev_t device, sector_t offset)
+{
+ int type;
+ struct swap_info_struct *sis;
+
+ spin_lock(&swap_lock);
+ type = swap_type_of(device, offset);
+ sis = swap_type_to_info(type);
+ if (!sis || !get_swap_device_info(sis))
+ type = -1;
+
+ spin_unlock(&swap_lock);
+ return type;
+}
+
+/*
+ * Drops the reference to the swap device previously acquired by
+ * get_hibernation_swap_type().
+ */
+void put_hibernation_swap_type(int type)
+{
+ struct swap_info_struct *sis;
+
+ sis = swap_type_to_info(type);
+ if (!sis)
+ return;
+
+ put_swap_device(sis);
+}
+
+/*
+ * Simple lookup without acquiring a reference. Used by the sysfs
+ * hibernation path where user-space is already frozen, making
+ * swapoff impossible.
+ */
+int find_hibernation_swap_type(dev_t device, sector_t offset)
+{
+ int type;
+
+ spin_lock(&swap_lock);
+ type = swap_type_of(device, offset);
+ spin_unlock(&swap_lock);
+
+ return type;
+}
+
int find_first_swap(dev_t *device)
{
int type;
@@ -2971,10 +3018,23 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
* spinlock) will be waited too. This makes it easy to
* prevent folio_test_swapcache() and the following swap cache
* operations from racing with swapoff.
+ *
+ * Note: if a hibernation session is actively holding a swap
+ * device reference, swapoff will block here until the reference
+ * is released via put_hibernation_swap_type() or the wait is
+ * interrupted by a signal.
*/
percpu_ref_kill(&p->users);
synchronize_rcu();
- wait_for_completion(&p->comp);
+ err = wait_for_completion_interruptible(&p->comp);
+ if (err) {
+ percpu_ref_resurrect(&p->users);
+ synchronize_rcu();
+ reinit_completion(&p->comp);
+ reinsert_swap_info(p);
+ goto out_dput;
+ }
+
flush_work(&p->discard_work);
flush_work(&p->reclaim_work);
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference
2026-03-17 18:13 ` [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
@ 2026-03-19 16:34 ` Kairui Song
2026-03-20 7:59 ` YoungJun Park
0 siblings, 1 reply; 10+ messages in thread
From: Kairui Song @ 2026-03-19 16:34 UTC (permalink / raw)
To: Youngjun Park
Cc: rafael, akpm, chrisl, kasong, pavel, shikemeng, nphamcs, bhe,
baohua, usama.arif, linux-pm, linux-mm
On Wed, Mar 18, 2026 at 03:13:16AM +0800, Youngjun Park wrote:
> Hibernation via uswsusp (/dev/snapshot ioctls) has a race: between
> setting the resume swap area and allocating a swap slot, user-space is
> not yet frozen, so swapoff can run and cause an incorrect slot allocation.
>
> Fix this by keeping swap_type_of() as a static helper that requires
> swap_lock to be held, and introducing new interfaces that wrap it with
> proper locking and reference management:
>
> - get_hibernation_swap_type(): Lookup under swap_lock + acquire a swap
> device reference to block swapoff (used by uswsusp).
> - find_hibernation_swap_type(): Lookup under swap_lock only, no
> reference. Used by the sysfs path where user-space is already frozen,
> making swapoff impossible.
> - put_hibernation_swap_type(): Release the reference.
>
> Because the reference is held via get_swap_device(), swapoff will block
> at wait_for_completion_interruptible() until put_hibernation_swap_type()
> releases it. The wait is interruptible, so swapoff can be cancelled by
> a signal.
>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
> include/linux/swap.h | 4 +-
> kernel/power/swap.c | 2 +-
> kernel/power/user.c | 15 ++++++--
> mm/swapfile.c | 92 ++++++++++++++++++++++++++++++++++++--------
> 4 files changed, 92 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7a09df6977a5..cf8cfdaf34a7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -433,7 +433,9 @@ static inline long get_nr_swap_pages(void)
> }
>
> extern void si_swapinfo(struct sysinfo *);
> -int swap_type_of(dev_t device, sector_t offset);
> +int get_hibernation_swap_type(dev_t device, sector_t offset);
> +int find_hibernation_swap_type(dev_t device, sector_t offset);
> +void put_hibernation_swap_type(int type);
Hi Youngjun,
Thanks! This set of API looks better that before.
>
> -/*
> - * Find the swap type that corresponds to given device (if any).
> - *
> - * @offset - number of the PAGE_SIZE-sized block of the device, starting
> - * from 0, in which the swap header is expected to be located.
> - *
> - * This is needed for the suspend to disk (aka swsusp).
> - */
> -int swap_type_of(dev_t device, sector_t offset)
> +static int swap_type_of(dev_t device, sector_t offset)
Maybe rename it while at it, it's a internal helper now and
no one is using it.
> {
> int type;
>
> + lockdep_assert_held(&swap_lock);
> +
> if (!device)
> return -1;
>
> - spin_lock(&swap_lock);
> for (type = 0; type < nr_swapfiles; type++) {
> struct swap_info_struct *sis = swap_info[type];
>
> @@ -2164,16 +2157,70 @@ int swap_type_of(dev_t device, sector_t offset)
> if (device == sis->bdev->bd_dev) {
> struct swap_extent *se = first_se(sis);
>
> - if (se->start_block == offset) {
> - spin_unlock(&swap_lock);
> + if (se->start_block == offset)
> return type;
> - }
> }
> }
> - spin_unlock(&swap_lock);
> return -ENODEV;
> }
>
> +/*
> + * Finds the swap type and safely acquires a reference to the swap device
> + * to prevent race conditions with swapoff.
> + *
> + * This should be used in environments like uswsusp where a race condition
> + * exists between configuring the resume device and allocating a swap slot.
> + * For sysfs hibernation where user-space is frozen (making swapoff
> + * impossible), use find_hibernation_swap_type() instead.
> + *
> + * The caller must drop the reference using put_hibernation_swap_type().
> + */
You can follow the kernel doc format.
> +int get_hibernation_swap_type(dev_t device, sector_t offset)
> +{
> + int type;
> + struct swap_info_struct *sis;
> +
> + spin_lock(&swap_lock);
> + type = swap_type_of(device, offset);
> + sis = swap_type_to_info(type);
> + if (!sis || !get_swap_device_info(sis))
> + type = -1;
> +
> + spin_unlock(&swap_lock);
> + return type;
> +}
> +
> +/*
> + * Drops the reference to the swap device previously acquired by
> + * get_hibernation_swap_type().
> + */
> +void put_hibernation_swap_type(int type)
> +{
> + struct swap_info_struct *sis;
> +
> + sis = swap_type_to_info(type);
> + if (!sis)
> + return;
I think it should never see sis == NULL here?
> +
> + put_swap_device(sis);
> +}
> +
> +/*
> + * Simple lookup without acquiring a reference. Used by the sysfs
> + * hibernation path where user-space is already frozen, making
> + * swapoff impossible.
> + */
> +int find_hibernation_swap_type(dev_t device, sector_t offset)
> +{
> + int type;
> +
> + spin_lock(&swap_lock);
> + type = swap_type_of(device, offset);
> + spin_unlock(&swap_lock);
> +
> + return type;
> +}
> +
> int find_first_swap(dev_t *device)
> {
> int type;
> @@ -2971,10 +3018,23 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> * spinlock) will be waited too. This makes it easy to
> * prevent folio_test_swapcache() and the following swap cache
> * operations from racing with swapoff.
> + *
> + * Note: if a hibernation session is actively holding a swap
> + * device reference, swapoff will block here until the reference
> + * is released via put_hibernation_swap_type() or the wait is
> + * interrupted by a signal.
This looks really bad... It means swapoff will hung unless user send a
signal to stop it? How does that happen?
If hibernate may hold a reference without allocating any slot for a
long time, then maybe we better introduce a new si->flags bit to
indicate a device is used by hibernate, so swapoff can abort early
with a -EBUSY.
If hibernate only holds the swap device reference after it allocated
any slot, then we can avoid use a flag but use a hibernate type in
swap table instead:
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 8415ffbe2b9c..025e004fc2e7 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -25,6 +25,7 @@ struct swap_table {
* PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot
* Pointer: |----------- Pointer ----------|100| - (Unused)
* Bad: |------------- 1 -------------|1000| - Bad slot
+ * Exclusive:|---- 0 ----| ------ 1 -----|10000| - Exclusive usage
*
* SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
*
@@ -46,6 +47,8 @@ struct swap_table {
* because only the lower three bits can be used as a marker for 8 bytes
* aligned pointers.
*
+ * - Exclusive usage: e.g. for hibernation.
+ *
* - Bad: Swap slot is reserved, protects swap header or holes on swap devices.
*/
Both readahead and swapoff (unuse scan) will abort upon seeing such slot,
we solve both readahead and swapoff issue together.
How do you think? We can use the flag idea first.
> */
> percpu_ref_kill(&p->users);
> synchronize_rcu();
> - wait_for_completion(&p->comp);
> + err = wait_for_completion_interruptible(&p->comp);
> + if (err) {
> + percpu_ref_resurrect(&p->users);
> + synchronize_rcu();
Do we need a synchronize_rcu here? It's required above because we are
releasing the resources so all RCU readers must exit from grace
period. But here we are aborting and not releasing anything so
there is no such need?
> + reinit_completion(&p->comp);
> + reinsert_swap_info(p);
> + goto out_dput;
> + }
> +
>
> flush_work(&p->discard_work);
> flush_work(&p->reclaim_work);
> --
> 2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference
2026-03-19 16:34 ` Kairui Song
@ 2026-03-20 7:59 ` YoungJun Park
0 siblings, 0 replies; 10+ messages in thread
From: YoungJun Park @ 2026-03-20 7:59 UTC (permalink / raw)
To: Kairui Song
Cc: rafael, akpm, chrisl, kasong, pavel, shikemeng, nphamcs, bhe,
baohua, usama.arif, linux-pm, linux-mm
On Fri, Mar 20, 2026 at 12:34:02AM +0800, Kairui Song wrote:
> On Wed, Mar 18, 2026 at 03:13:16AM +0800, Youngjun Park wrote:
> Hi Youngjun,
>
> Thanks! This set of API looks better that before.
Hello Kairui! :)
> >
> > -/*
> > - * Find the swap type that corresponds to given device (if any).
> > - *
> > - * @offset - number of the PAGE_SIZE-sized block of the device, starting
> > - * from 0, in which the swap header is expected to be located.
> > - *
> > - * This is needed for the suspend to disk (aka swsusp).
> > - */
> > -int swap_type_of(dev_t device, sector_t offset)
> > +static int swap_type_of(dev_t device, sector_t offset)
>
> Maybe rename it while at it, it's a internal helper now and
> no one is using it.
I will change!
> > +/*
> > + * Finds the swap type and safely acquires a reference to the swap device
> > + * to prevent race conditions with swapoff.
> > + *
> > + * This should be used in environments like uswsusp where a race condition
> > + * exists between configuring the resume device and allocating a swap slot.
> > + * For sysfs hibernation where user-space is frozen (making swapoff
> > + * impossible), use find_hibernation_swap_type() instead.
> > + *
> > + * The caller must drop the reference using put_hibernation_swap_type().
> > + */
>
> You can follow the kernel doc format.
I will convert this to proper kernel-doc format as well. Thanks.
> > +int get_hibernation_swap_type(dev_t device, sector_t offset)
> > +{
> > + int type;
> > + struct swap_info_struct *sis;
> > +
> > + spin_lock(&swap_lock);
> > + type = swap_type_of(device, offset);
> > + sis = swap_type_to_info(type);
> > + if (!sis || !get_swap_device_info(sis))
> > + type = -1;
> > +
> > + spin_unlock(&swap_lock);
> > + return type;
> > +}
> > +
> > +/*
> > + * Drops the reference to the swap device previously acquired by
> > + * get_hibernation_swap_type().
> > + */
> > +void put_hibernation_swap_type(int type)
> > +{
> > + struct swap_info_struct *sis;
> > +
> > + sis = swap_type_to_info(type);
> > + if (!sis)
> > + return;
>
> I think it should never see sis == NULL here?
It can happen in the following cases:
1. The snapshot device is not configured at open time (i.e. no resume=
parameter and userspace intends to configure it later via ioctl).
2. The process exits or snapshot_release() is called with type = -1.
Although callers would normally check `type >= 0`, this helper
already handles invalid types defensively, so an extra check at the
call site is not strictly necessary.
> This looks really bad... It means swapoff will hung unless user send a
> signal to stop it? How does that happen?
>
> If hibernate may hold a reference without allocating any slot for a
> long time, then maybe we better introduce a new si->flags bit to
> indicate a device is used by hibernate, so swapoff can abort early
> with a -EBUSY.
>
> If hibernate only holds the swap device reference after it allocated
> any slot, then we can avoid use a flag but use a hibernate type in
> swap table instead:
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index 8415ffbe2b9c..025e004fc2e7 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -25,6 +25,7 @@ struct swap_table {
> * PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot
> * Pointer: |----------- Pointer ----------|100| - (Unused)
> * Bad: |------------- 1 -------------|1000| - Bad slot
> + * Exclusive:|---- 0 ----| ------ 1 -----|10000| - Exclusive usage
> *
> * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
> *
> @@ -46,6 +47,8 @@ struct swap_table {
> * because only the lower three bits can be used as a marker for 8 bytes
> * aligned pointers.
> *
> + * - Exclusive usage: e.g. for hibernation.
> + *
> * - Bad: Swap slot is reserved, protects swap header or holes on swap devices.
> */
>
> Both readahead and swapoff (unuse scan) will abort upon seeing such slot,
> we solve both readahead and swapoff issue together.
>
> How do you think? We can use the flag idea first.
That makes sense. I also considered the pinning flag approach (and
mentioned my previously rejected strategy in response to Andrew’s
question).
Based on your feedback, I’m now convinced that introducing a dedicated
SWP_HIBERNATION pinning flag in `si->flags` is the right direction. It
makes the interaction with swapoff explicit and avoids long waits or
implicit dependency on slot allocation timing.
In the future, this pinning flag could potentially be reused in other
swap paths if we ever need to distinguish hibernation-specific
allocations. (this part is just imagination)
Regarding the swap table “exclusive bit” idea: I don’t think it fits
well here because there is a window between selecting the swap device
type and actually allocating a swap slot. During that gap, the user
context may change, so encoding exclusivity at the slot level would not
fully cover the race.
So I will proceed with the pinning flag approach !
> > */
> > percpu_ref_kill(&p->users);
> > synchronize_rcu();
> > - wait_for_completion(&p->comp);
> > + err = wait_for_completion_interruptible(&p->comp);
> > + if (err) {
> > + percpu_ref_resurrect(&p->users);
> > + synchronize_rcu();
>
> Do we need a synchronize_rcu here? It's required above because we are
> releasing the resources so all RCU readers must exit from grace
> period. But here we are aborting and not releasing anything so
> there is no such need?
This strategy has been dropped with the introduction of the
SWP_HIBERNATION pinning flag.
For clarification: the `synchronize_rcu()` here was intended to wait for
the final percpu_ref callback, since that callback runs in RCU context.
However, as you pointed out, since we are aborting rather than releasing
resources, that extra grace period is likely unnecessary.
I will update the patch accordingly and include additional testing
results.
Thanks again for the valuable review.
Best regards,
Youngjun
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v4 2/3] mm/swap: remove redundant swap device reference in alloc/free
2026-03-17 18:13 [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Youngjun Park
2026-03-17 18:13 ` [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
@ 2026-03-17 18:13 ` Youngjun Park
2026-03-17 18:13 ` [PATCH v4 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
` (2 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Youngjun Park @ 2026-03-17 18:13 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-pm, linux-mm
In the previous commit, uswsusp was modified to acquire the swap device
reference at the time of determining the swap type. As a result, it is
no longer necessary to repeatedly acquire and release the reference to
protect against swapoff every time a swap slot is allocated.
For hibernation via the sysfs interface, user-space processes are already
frozen, making swapoff inherently impossible. Thus, acquiring and
releasing the reference during allocation is unnecessary. Furthermore,
even after returning from suspend, processes are not yet thawed when
swap slots are freed, meaning reference management is not required at
that stage either.
Therefore, remove the redundant swap device reference acquire and release
operations from the hibernation swap allocation and free functions.
Additionally, remove the SWP_WRITEOK check before allocation. This check
is redundant because the cluster allocation logic already handles it.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
mm/swapfile.c | 62 +++++++++++++++++++++++++--------------------------
1 file changed, 30 insertions(+), 32 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7baa0f270cff..73159535b085 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2078,7 +2078,12 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
}
#ifdef CONFIG_HIBERNATION
-/* Allocate a slot for hibernation */
+/*
+ * Allocate a slot for hibernation.
+ *
+ * Note: The caller must ensure the swap device is stable, either by
+ * holding a reference or by freezing user-space before calling this.
+ */
swp_entry_t swap_alloc_hibernation_slot(int type)
{
struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
@@ -2089,46 +2094,40 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
if (!si)
goto fail;
- /* This is called for allocating swap entry, not cache */
- if (get_swap_device_info(si)) {
- if (si->flags & SWP_WRITEOK) {
- /*
- * Try the local cluster first if it matches the device. If
- * not, try grab a new cluster and override local cluster.
- */
- local_lock(&percpu_swap_cluster.lock);
- pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
- pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
- if (pcp_si == si && pcp_offset) {
- ci = swap_cluster_lock(si, pcp_offset);
- if (cluster_is_usable(ci, 0))
- offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
- else
- swap_cluster_unlock(ci);
- }
- if (!offset)
- offset = cluster_alloc_swap_entry(si, NULL);
- local_unlock(&percpu_swap_cluster.lock);
- if (offset)
- entry = swp_entry(si->type, offset);
- }
- put_swap_device(si);
+ /*
+ * Try the local cluster first if it matches the device. If
+ * not, try grab a new cluster and override local cluster.
+ */
+ local_lock(&percpu_swap_cluster.lock);
+ pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
+ pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+ if (pcp_si == si && pcp_offset) {
+ ci = swap_cluster_lock(si, pcp_offset);
+ if (cluster_is_usable(ci, 0))
+ offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
+ else
+ swap_cluster_unlock(ci);
}
+ if (!offset)
+ offset = cluster_alloc_swap_entry(si, NULL);
+ local_unlock(&percpu_swap_cluster.lock);
+ if (offset)
+ entry = swp_entry(si->type, offset);
+
fail:
return entry;
}
-/* Free a slot allocated by swap_alloc_hibernation_slot */
+/*
+ * Free a slot allocated by swap_alloc_hibernation_slot.
+ * As with allocation, the caller must ensure the swap device is stable.
+ */
void swap_free_hibernation_slot(swp_entry_t entry)
{
- struct swap_info_struct *si;
+ struct swap_info_struct *si = __swap_entry_to_info(entry);
struct swap_cluster_info *ci;
pgoff_t offset = swp_offset(entry);
- si = get_swap_device(entry);
- if (WARN_ON(!si))
- return;
-
ci = swap_cluster_lock(si, offset);
__swap_cluster_put_entry(ci, offset % SWAPFILE_CLUSTER);
__swap_cluster_free_entries(si, ci, offset % SWAPFILE_CLUSTER, 1);
@@ -2136,7 +2135,6 @@ void swap_free_hibernation_slot(swp_entry_t entry)
/* In theory readahead might add it to the swap cache by accident */
__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
- put_swap_device(si);
}
static int swap_type_of(dev_t device, sector_t offset)
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH v4 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
2026-03-17 18:13 [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Youngjun Park
2026-03-17 18:13 ` [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
2026-03-17 18:13 ` [PATCH v4 2/3] mm/swap: remove redundant swap device reference in alloc/free Youngjun Park
@ 2026-03-17 18:13 ` Youngjun Park
2026-03-17 19:16 ` [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Andrew Morton
2026-03-19 13:33 ` Rafael J. Wysocki
4 siblings, 0 replies; 10+ messages in thread
From: Youngjun Park @ 2026-03-17 18:13 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-pm, linux-mm
Commit 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0:
WARNING: kernel/power/main.c:44 at pm_restore_gfp_mask+0xd7/0xf0
CPU: 0 UID: 0 PID: 373 Comm: s2disk
Call Trace:
snapshot_ioctl+0x964/0xbd0
__x64_sys_ioctl+0x724/0x1320
...
The uswsusp path calls pm_restore_gfp_mask() defensively in
SNAPSHOT_CREATE_IMAGE and SNAPSHOT_UNFREEZE where the GFP mask may
or may not be restricted depending on context (first call vs retry,
hibernate vs resume). Before the stacking patch this was a silent
no-op; now it triggers a WARNING.
Introduce pm_restore_gfp_mask_safe() that skips the call when
saved_gfp_count is 0. This is preferred over tracking the restrict
state in snapshot_ioctl, as incorrect tracking risks leaving the
GFP mask permanently restricted.
Fixes: 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
include/linux/suspend.h | 1 +
kernel/power/main.c | 7 +++++++
kernel/power/user.c | 4 ++--
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/suspend.h b/include/linux/suspend.h
index b02876f1ae38..7777931d88a5 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -454,6 +454,7 @@ extern void pm_report_hw_sleep_time(u64 t);
extern void pm_report_max_hw_sleep(u64 t);
void pm_restrict_gfp_mask(void);
void pm_restore_gfp_mask(void);
+void pm_restore_gfp_mask_safe(void);
#define pm_notifier(fn, pri) { \
static struct notifier_block fn##_nb = \
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 5f8c9e12eaec..e610a8c8b7ff 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -36,6 +36,13 @@
static unsigned int saved_gfp_count;
static gfp_t saved_gfp_mask;
+void pm_restore_gfp_mask_safe(void)
+{
+ if (!saved_gfp_count)
+ return;
+ pm_restore_gfp_mask();
+}
+
void pm_restore_gfp_mask(void)
{
WARN_ON(!mutex_is_locked(&system_transition_mutex));
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 3e41544b99d5..41cff6a89a1c 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -306,7 +306,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
case SNAPSHOT_UNFREEZE:
if (!data->frozen || data->ready)
break;
- pm_restore_gfp_mask();
+ pm_restore_gfp_mask_safe();
free_basic_memory_bitmaps();
data->free_bitmaps = false;
thaw_processes();
@@ -318,7 +318,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
error = -EPERM;
break;
}
- pm_restore_gfp_mask();
+ pm_restore_gfp_mask_safe();
error = hibernation_snapshot(data->platform_support);
if (!error) {
error = put_user(in_suspend, (int __user *)arg);
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap
2026-03-17 18:13 [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Youngjun Park
` (2 preceding siblings ...)
2026-03-17 18:13 ` [PATCH v4 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
@ 2026-03-17 19:16 ` Andrew Morton
2026-03-18 2:16 ` YoungJun Park
2026-03-19 13:33 ` Rafael J. Wysocki
4 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-03-17 19:16 UTC (permalink / raw)
To: Youngjun Park
Cc: rafael, chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
usama.arif, linux-pm, linux-mm
On Wed, 18 Mar 2026 03:13:15 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> Currently, in the uswsusp path, only the swap type value is retrieved at
> lookup time without holding a reference. If swapoff races after the type
> is acquired, subsequent slot allocations operate on a stale swap device.
Has this race been experienced or at least demonstrated? Or is this
patchset derived from reading the code?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap
2026-03-17 19:16 ` [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Andrew Morton
@ 2026-03-18 2:16 ` YoungJun Park
0 siblings, 0 replies; 10+ messages in thread
From: YoungJun Park @ 2026-03-18 2:16 UTC (permalink / raw)
To: Andrew Morton
Cc: rafael, chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
usama.arif, linux-pm, linux-mm
On Tue, Mar 17, 2026 at 12:16:28PM -0700, Andrew Morton wrote:
> On Wed, 18 Mar 2026 03:13:15 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
>
> > Currently, in the uswsusp path, only the swap type value is retrieved at
> > lookup time without holding a reference. If swapoff races after the type
> > is acquired, subsequent slot allocations operate on a stale swap device.
>
> Has this race been experienced or at least demonstrated? Or is this
> patchset derived from reading the code?
>
Hello Andrew,
I found and reproduced this race condition
(using a custom test program for verifying reproducibility):
Pre-condition.
- swapon /dev/sdb (intended for hibernation)
Process 1 (test program) Process 2
------------------- ---------
ioctl(SNAPSHOT_SET_SWAP_AREA)
swapoff /dev/sdb
swapon /dev/sdc
ioctl(SNAPSHOT_ALLOC_SWAP_PAGE)
(The same race applies if the swap device is set up at `open` time
via the `resume` parameter.)
Process 1 operates on a different swap device (or fails if no new swap is
enabled), breaking hibernation. The practical impact is that the subsequent
resume operation will fail. While rare in practice, this can be easily
prevented as shown in the patch.
I initially sent this as an RFC because there were no real-world bug reports.
However, after review and testing, I am submitting it as a formal patch
because it has clear benefits: it prevents a potential bug by enforcing the
code's underlying assumption that `swapoff` must not occur after
`SNAPSHOT_SET_SWAP_AREA` (or after `open`), and it removes unnecessary
reference get/put operations.
I also considered, but rejected, these alternatives:
- Enforcing SNAPSHOT_FREEZE first: Breaks existing user-space behavior
(e.g., `s2disk` in `suspend-utils` calls `SNAPSHOT_SET_SWAP_AREA`
before freezing).
- Marking the swap device (e.g., SWP_HIBERNATION) to prevent swapoff:
Using the existing reference counting mechanism is a cleaner approach.
Best regards,
Youngjun Park
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap
2026-03-17 18:13 [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Youngjun Park
` (3 preceding siblings ...)
2026-03-17 19:16 ` [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap Andrew Morton
@ 2026-03-19 13:33 ` Rafael J. Wysocki
2026-03-19 13:48 ` YoungJun Park
4 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2026-03-19 13:33 UTC (permalink / raw)
To: Youngjun Park
Cc: rafael, akpm, chrisl, kasong, pavel, shikemeng, nphamcs, bhe,
baohua, usama.arif, linux-pm, linux-mm
On Tue, Mar 17, 2026 at 7:13 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Currently, in the uswsusp path, only the swap type value is retrieved at
> lookup time without holding a reference. If swapoff races after the type
> is acquired, subsequent slot allocations operate on a stale swap device.
>
> Additionally, grabbing and releasing the swap device reference on every
> slot allocation is inefficient across the entire hibernation swap path.
>
> This patch series addresses these issues:
> - Patch 1: Fixes the swapoff race in uswsusp by holding the swap device
> reference from the point the swap device is looked up.
> - Patch 2: Removes the overhead of per-slot reference counting in alloc/free
> paths and cleans up the redundant SWP_WRITEOK check.
> - Patch 3: Fixes a spurious WARNING in the uswsusp GFP mask restore path.
> (Founded during uswsusp test)
>
> Links:
> RFC v1: https://lore.kernel.org/linux-mm/20260305202413.1888499-1-usama.arif@linux.dev/T/#m3693d45180f14f441b6951984f4b4bfd90ec0c9d
> RFC v2: https://lore.kernel.org/linux-mm/20260306024608.1720991-1-youngjun.park@lge.com/
> RFC v3: https://lore.kernel.org/linux-mm/20260312112511.3596781-1-youngjun.park@lge.com/
>
> Testing:
> - Hibernate/resume via sysfs
> (echo reboot > /sys/power/disk && echo disk > /sys/power/state)
> - Hibernate with suspend via sysfs
> (echo suspend > /sys/power/disk && echo disk > /sys/power/state)
> - Hibernate/resume via uswsusp (suspend-utils s2disk/resume on QEMU)
> - Verified swap I/O works correctly after resume.
> - Verified swapoff succeeds after snapshot resume completes.
> - Verified pm_restore_gfp_mask() WARNING no longer triggers (Patch 3).
> - swapoff during active uswsusp session:
> - Verified swapoff blocks while uswsusp holds swap reference.
> - Verified swapoff can be cancelled by signal (e.g. Ctrl+C).
> - Verified swapoff succeeds after uswsusp process terminates.
>
> Changelog:
> rfc v3 -> v4:
> - Introduced get/find/put_hibernation_swap_type() helpers per Kairui's
> feedback. find_ for lookup-only, get/put for reference management.
> - Switched to swap_type_to_info() and added type < 0 check per
> Kairui's suggestion.
> - Fixed get_hibernation_swap_type() return when ref == false (Reviewd by Kairui)
> - Made swapoff wait interruptible to prevent hang when uswsusp
> holds a swap reference.
> - Fixed spurious WARN_ON in pm_restore_gfp_mask() by introducing
> pm_restore_gfp_mask_safe() (Patch 3).
> - Updated commit messages and added comments for clarity.
> - Rebased onto latest mm-new tree.
>
> Note: Kairui suggested adding WARN on NULL in put_hibernation_swap_type(),
> but kept silent return instead, as type can legitimately be -1 when
> snapshot_open() fails to find a matching swap device. swap_type_to_info()
> returns NULL for type < 0, so the cleanup path stays simple.
>
> rfc v2 -> rfc v3:
> - Split into 2 patches per Chris Li's feedback.
> - Simplified by not holding reference in normal hibernation path
> per Chris Li's suggestion.
> - Removed redundant SWP_WRITEOK check.
> - Rebased onto f543926f9d0c3f6dfb354adfe7fbaeedd1277c6b.
>
> rfc v1 -> rfc v2:
> - Squashed into single patch per Usama Arif's feedback.
>
> Youngjun Park (3):
> mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap
> reference
> mm/swap: remove redundant swap device reference in alloc/free
> PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
>
> include/linux/suspend.h | 1 +
> include/linux/swap.h | 4 +-
> kernel/power/main.c | 7 ++
> kernel/power/swap.c | 2 +-
> kernel/power/user.c | 19 +++--
> mm/swapfile.c | 154 +++++++++++++++++++++++++++-------------
> 6 files changed, 132 insertions(+), 55 deletions(-)
>
> --
Which kernel version does this series apply to?
It doesn't apply for me on top of 7.0-rc4, so please consider rebasing.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v4 0/3] mm/swap, PM: hibernate: fix swapoff race and optimize swap
2026-03-19 13:33 ` Rafael J. Wysocki
@ 2026-03-19 13:48 ` YoungJun Park
0 siblings, 0 replies; 10+ messages in thread
From: YoungJun Park @ 2026-03-19 13:48 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: akpm, chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
usama.arif, linux-pm, linux-mm
On Thu, Mar 19, 2026 at 02:33:30PM +0100, Rafael J. Wysocki wrote:
> On Tue, Mar 17, 2026 at 7:13 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > Currently, in the uswsusp path, only the swap type value is retrieved at
> > lookup time without holding a reference. If swapoff races after the type
> > is acquired, subsequent slot allocations operate on a stale swap device.
> >
> > Additionally, grabbing and releasing the swap device reference on every
> > slot allocation is inefficient across the entire hibernation swap path.
> >
> > This patch series addresses these issues:
> > - Patch 1: Fixes the swapoff race in uswsusp by holding the swap device
> > reference from the point the swap device is looked up.
> > - Patch 2: Removes the overhead of per-slot reference counting in alloc/free
> > paths and cleans up the redundant SWP_WRITEOK check.
> > - Patch 3: Fixes a spurious WARNING in the uswsusp GFP mask restore path.
> > (Founded during uswsusp test)
> >
> > Links:
> > RFC v1: https://lore.kernel.org/linux-mm/20260305202413.1888499-1-usama.arif@linux.dev/T/#m3693d45180f14f441b6951984f4b4bfd90ec0c9d
> > RFC v2: https://lore.kernel.org/linux-mm/20260306024608.1720991-1-youngjun.park@lge.com/
> > RFC v3: https://lore.kernel.org/linux-mm/20260312112511.3596781-1-youngjun.park@lge.com/
> >
> > Testing:
> > - Hibernate/resume via sysfs
> > (echo reboot > /sys/power/disk && echo disk > /sys/power/state)
> > - Hibernate with suspend via sysfs
> > (echo suspend > /sys/power/disk && echo disk > /sys/power/state)
> > - Hibernate/resume via uswsusp (suspend-utils s2disk/resume on QEMU)
> > - Verified swap I/O works correctly after resume.
> > - Verified swapoff succeeds after snapshot resume completes.
> > - Verified pm_restore_gfp_mask() WARNING no longer triggers (Patch 3).
> > - swapoff during active uswsusp session:
> > - Verified swapoff blocks while uswsusp holds swap reference.
> > - Verified swapoff can be cancelled by signal (e.g. Ctrl+C).
> > - Verified swapoff succeeds after uswsusp process terminates.
> >
> > Changelog:
> > rfc v3 -> v4:
> > - Introduced get/find/put_hibernation_swap_type() helpers per Kairui's
> > feedback. find_ for lookup-only, get/put for reference management.
> > - Switched to swap_type_to_info() and added type < 0 check per
> > Kairui's suggestion.
> > - Fixed get_hibernation_swap_type() return when ref == false (Reviewd by Kairui)
> > - Made swapoff wait interruptible to prevent hang when uswsusp
> > holds a swap reference.
> > - Fixed spurious WARN_ON in pm_restore_gfp_mask() by introducing
> > pm_restore_gfp_mask_safe() (Patch 3).
> > - Updated commit messages and added comments for clarity.
> > - Rebased onto latest mm-new tree.
> >
> > Note: Kairui suggested adding WARN on NULL in put_hibernation_swap_type(),
> > but kept silent return instead, as type can legitimately be -1 when
> > snapshot_open() fails to find a matching swap device. swap_type_to_info()
> > returns NULL for type < 0, so the cleanup path stays simple.
> >
> > rfc v2 -> rfc v3:
> > - Split into 2 patches per Chris Li's feedback.
> > - Simplified by not holding reference in normal hibernation path
> > per Chris Li's suggestion.
> > - Removed redundant SWP_WRITEOK check.
> > - Rebased onto f543926f9d0c3f6dfb354adfe7fbaeedd1277c6b.
> >
> > rfc v1 -> rfc v2:
> > - Squashed into single patch per Usama Arif's feedback.
> >
> > Youngjun Park (3):
> > mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap
> > reference
> > mm/swap: remove redundant swap device reference in alloc/free
> > PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
> >
> > include/linux/suspend.h | 1 +
> > include/linux/swap.h | 4 +-
> > kernel/power/main.c | 7 ++
> > kernel/power/swap.c | 2 +-
> > kernel/power/user.c | 19 +++--
> > mm/swapfile.c | 154 +++++++++++++++++++++++++++-------------
> > 6 files changed, 132 insertions(+), 55 deletions(-)
> >
> > --
>
> Which kernel version does this series apply to?
>
> It doesn't apply for me on top of 7.0-rc4, so please consider rebasing.
It is based on mm-new
I will soon rebase on the top of 7.0-rc4 with build & test.
Best regards,
Youngjun Park
^ permalink raw reply [flat|nested] 10+ messages in thread