* [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path
@ 2026-03-19 14:24 Youngjun Park
2026-03-19 14:24 ` [PATCH v5 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Youngjun Park @ 2026-03-19 14:24 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-mm, linux-pm
Currently, in the uswsusp path, only the swap type value is retrieved at
lookup time without holding a reference. If swapoff races after the type
is acquired, subsequent slot allocations operate on a stale swap device.
Additionally, grabbing and releasing the swap device reference on every
slot allocation is inefficient across the entire hibernation swap path.
This patch series addresses these issues:
- Patch 1: Fixes the swapoff race in uswsusp by holding the swap device
reference from the point the swap device is looked up.
- Patch 2: Removes the overhead of per-slot reference counting in alloc/free
paths and cleans up the redundant SWP_WRITEOK check.
- Patch 3: Fixes a spurious WARNING in the uswsusp GFP mask restore path.
(Founded during uswsusp test)
Links:
RFC v1: https://lore.kernel.org/linux-mm/20260305202413.1888499-1-usama.arif@linux.dev/T/#m3693d45180f14f441b6951984f4b4bfd90ec0c9d
RFC v2: https://lore.kernel.org/linux-mm/20260306024608.1720991-1-youngjun.park@lge.com/
RFC v3: https://lore.kernel.org/linux-mm/20260312112511.3596781-1-youngjun.park@lge.com/
v4: https://lore.kernel.org/linux-mm/abv+rjgyArqZ2uym@yjaykim-PowerEdge-T330/T/#m924fa3e58d0f0da488300653163ee8db7e870e4a
Testing:
- Hibernate/resume via sysfs
(echo reboot > /sys/power/disk && echo disk > /sys/power/state)
- Hibernate with suspend via sysfs
(echo suspend > /sys/power/disk && echo disk > /sys/power/state)
- Hibernate/resume via uswsusp (suspend-utils s2disk/resume on QEMU)
- Verified swap I/O works correctly after resume.
- Verified swapoff succeeds after snapshot resume completes.
- Verified pm_restore_gfp_mask() WARNING no longer triggers (Patch 3).
- swapoff during active uswsusp session:
- Verified swapoff blocks while uswsusp holds swap reference.
- Verified swapoff can be cancelled by signal (e.g. Ctrl+C).
- Verified swapoff succeeds after uswsusp process terminates.
Changelog:
v4 -> v5:
- Rebased onto v7.0-rc4 (Rafael J. Wysocki comment)
- No functional changes. rebase conflict fix.
rfc v3 -> v4:
- Introduced get/find/put_hibernation_swap_type() helpers per Kairui's
feedback. find_ for lookup-only, get/put for reference management.
- Switched to swap_type_to_info() and added type < 0 check per
Kairui's suggestion.
- Fixed get_hibernation_swap_type() return when ref == false (Reviewd by Kairui)
- Made swapoff wait interruptible to prevent hang when uswsusp
holds a swap reference.
- Fixed spurious WARN_ON in pm_restore_gfp_mask() by introducing
pm_restore_gfp_mask_safe() (Patch 3).
- Updated commit messages and added comments for clarity.
- Rebased onto latest mm-new tree.
Note: Kairui suggested adding WARN on NULL in put_hibernation_swap_type(),
but kept silent return instead, as type can legitimately be -1 when
snapshot_open() fails to find a matching swap device. swap_type_to_info()
returns NULL for type < 0, so the cleanup path stays simple.
rfc v2 -> rfc v3:
- Split into 2 patches per Chris Li's feedback.
- Simplified by not holding reference in normal hibernation path
per Chris Li's suggestion.
- Removed redundant SWP_WRITEOK check.
- Rebased onto f543926f9d0c3f6dfb354adfe7fbaeedd1277c6b.
rfc v1 -> rfc v2:
- Squashed into single patch per Usama Arif's feedback.
Youngjun Park (3):
mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap
reference
mm/swap: remove redundant swap device reference in alloc/free
PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
include/linux/suspend.h | 1 +
include/linux/swap.h | 4 +-
kernel/power/main.c | 7 +++
kernel/power/swap.c | 2 +-
kernel/power/user.c | 19 ++++--
mm/swapfile.c | 135 ++++++++++++++++++++++++++++------------
6 files changed, 122 insertions(+), 46 deletions(-)
base-commit: f338e77383789c0cae23ca3d48adcc5e9e137e3c
--
2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v5 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference
2026-03-19 14:24 [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Youngjun Park
@ 2026-03-19 14:24 ` Youngjun Park
2026-03-19 14:24 ` [PATCH v5 2/3] mm/swap: remove redundant swap device reference in alloc/free Youngjun Park
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Youngjun Park @ 2026-03-19 14:24 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-mm, linux-pm
Hibernation via uswsusp (/dev/snapshot ioctls) has a race: between
setting the resume swap area and allocating a swap slot, user-space is
not yet frozen, so swapoff can run and cause an incorrect slot allocation.
Fix this by keeping swap_type_of() as a static helper that requires
swap_lock to be held, and introducing new interfaces that wrap it with
proper locking and reference management:
- get_hibernation_swap_type(): Lookup under swap_lock + acquire a swap
device reference to block swapoff (used by uswsusp).
- find_hibernation_swap_type(): Lookup under swap_lock only, no
reference. Used by the sysfs path where user-space is already frozen,
making swapoff impossible.
- put_hibernation_swap_type(): Release the reference.
Because the reference is held via get_swap_device(), swapoff will block
at wait_for_completion_interruptible() until put_hibernation_swap_type()
releases it. The wait is interruptible, so swapoff can be cancelled by
a signal.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
include/linux/swap.h | 4 +-
kernel/power/swap.c | 2 +-
kernel/power/user.c | 15 ++++++--
mm/swapfile.c | 92 ++++++++++++++++++++++++++++++++++++--------
4 files changed, 92 insertions(+), 21 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..4266356f928c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,7 +452,9 @@ static inline long get_nr_swap_pages(void)
extern void si_swapinfo(struct sysinfo *);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-int swap_type_of(dev_t device, sector_t offset);
+int get_hibernation_swap_type(dev_t device, sector_t offset);
+int find_hibernation_swap_type(dev_t device, sector_t offset);
+void put_hibernation_swap_type(int type);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
extern sector_t swapdev_block(int, pgoff_t);
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 2e64869bb5a0..cc4764149e8f 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -341,7 +341,7 @@ static int swsusp_swap_check(void)
* This is called before saving the image.
*/
if (swsusp_resume_device)
- res = swap_type_of(swsusp_resume_device, swsusp_resume_block);
+ res = find_hibernation_swap_type(swsusp_resume_device, swsusp_resume_block);
else
res = find_first_swap(&swsusp_resume_device);
if (res < 0)
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4401cfe26e5c..3e41544b99d5 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -71,7 +71,7 @@ static int snapshot_open(struct inode *inode, struct file *filp)
memset(&data->handle, 0, sizeof(struct snapshot_handle));
if ((filp->f_flags & O_ACCMODE) == O_RDONLY) {
/* Hibernating. The image device should be accessible. */
- data->swap = swap_type_of(swsusp_resume_device, 0);
+ data->swap = get_hibernation_swap_type(swsusp_resume_device, 0);
data->mode = O_RDONLY;
data->free_bitmaps = false;
error = pm_notifier_call_chain_robust(PM_HIBERNATION_PREPARE, PM_POST_HIBERNATION);
@@ -90,8 +90,10 @@ static int snapshot_open(struct inode *inode, struct file *filp)
data->free_bitmaps = !error;
}
}
- if (error)
+ if (error) {
+ put_hibernation_swap_type(data->swap);
hibernate_release();
+ }
data->frozen = false;
data->ready = false;
@@ -115,6 +117,7 @@ static int snapshot_release(struct inode *inode, struct file *filp)
data = filp->private_data;
data->dev = 0;
free_all_swap_pages(data->swap);
+ put_hibernation_swap_type(data->swap);
if (data->frozen) {
pm_restore_gfp_mask();
free_basic_memory_bitmaps();
@@ -235,11 +238,17 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
offset = swap_area.offset;
}
+ /*
+ * Put the reference if a swap area was already
+ * set by SNAPSHOT_SET_SWAP_AREA.
+ */
+ put_hibernation_swap_type(data->swap);
+
/*
* User space encodes device types as two-byte values,
* so we need to recode them
*/
- data->swap = swap_type_of(swdev, offset);
+ data->swap = get_hibernation_swap_type(swdev, offset);
if (data->swap < 0)
return swdev ? -ENODEV : -EINVAL;
data->dev = swdev;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94af29d1de88..5069074ab11b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -133,7 +133,7 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
/* May return NULL on invalid type, caller must check for NULL return */
static struct swap_info_struct *swap_type_to_info(int type)
{
- if (type >= MAX_SWAPFILES)
+ if (type < 0 || type >= MAX_SWAPFILES)
return NULL;
return READ_ONCE(swap_info[type]); /* rcu_dereference() */
}
@@ -1972,22 +1972,15 @@ void swap_free_hibernation_slot(swp_entry_t entry)
put_swap_device(si);
}
-/*
- * Find the swap type that corresponds to given device (if any).
- *
- * @offset - number of the PAGE_SIZE-sized block of the device, starting
- * from 0, in which the swap header is expected to be located.
- *
- * This is needed for the suspend to disk (aka swsusp).
- */
-int swap_type_of(dev_t device, sector_t offset)
+static int swap_type_of(dev_t device, sector_t offset)
{
int type;
+ lockdep_assert_held(&swap_lock);
+
if (!device)
return -1;
- spin_lock(&swap_lock);
for (type = 0; type < nr_swapfiles; type++) {
struct swap_info_struct *sis = swap_info[type];
@@ -1997,16 +1990,70 @@ int swap_type_of(dev_t device, sector_t offset)
if (device == sis->bdev->bd_dev) {
struct swap_extent *se = first_se(sis);
- if (se->start_block == offset) {
- spin_unlock(&swap_lock);
+ if (se->start_block == offset)
return type;
- }
}
}
- spin_unlock(&swap_lock);
return -ENODEV;
}
+/*
+ * Finds the swap type and safely acquires a reference to the swap device
+ * to prevent race conditions with swapoff.
+ *
+ * This should be used in environments like uswsusp where a race condition
+ * exists between configuring the resume device and allocating a swap slot.
+ * For sysfs hibernation where user-space is frozen (making swapoff
+ * impossible), use find_hibernation_swap_type() instead.
+ *
+ * The caller must drop the reference using put_hibernation_swap_type().
+ */
+int get_hibernation_swap_type(dev_t device, sector_t offset)
+{
+ int type;
+ struct swap_info_struct *sis;
+
+ spin_lock(&swap_lock);
+ type = swap_type_of(device, offset);
+ sis = swap_type_to_info(type);
+ if (!sis || !get_swap_device_info(sis))
+ type = -1;
+
+ spin_unlock(&swap_lock);
+ return type;
+}
+
+/*
+ * Drops the reference to the swap device previously acquired by
+ * get_hibernation_swap_type().
+ */
+void put_hibernation_swap_type(int type)
+{
+ struct swap_info_struct *sis;
+
+ sis = swap_type_to_info(type);
+ if (!sis)
+ return;
+
+ put_swap_device(sis);
+}
+
+/*
+ * Simple lookup without acquiring a reference. Used by the sysfs
+ * hibernation path where user-space is already frozen, making
+ * swapoff impossible.
+ */
+int find_hibernation_swap_type(dev_t device, sector_t offset)
+{
+ int type;
+
+ spin_lock(&swap_lock);
+ type = swap_type_of(device, offset);
+ spin_unlock(&swap_lock);
+
+ return type;
+}
+
int find_first_swap(dev_t *device)
{
int type;
@@ -2837,10 +2884,23 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
* spinlock) will be waited too. This makes it easy to
* prevent folio_test_swapcache() and the following swap cache
* operations from racing with swapoff.
+ *
+ * Note: if a hibernation session is actively holding a swap
+ * device reference, swapoff will block here until the reference
+ * is released via put_hibernation_swap_type() or the wait is
+ * interrupted by a signal.
*/
percpu_ref_kill(&p->users);
synchronize_rcu();
- wait_for_completion(&p->comp);
+ err = wait_for_completion_interruptible(&p->comp);
+ if (err) {
+ percpu_ref_resurrect(&p->users);
+ synchronize_rcu();
+ reinit_completion(&p->comp);
+ reinsert_swap_info(p);
+ goto out_dput;
+ }
+
flush_work(&p->discard_work);
flush_work(&p->reclaim_work);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v5 2/3] mm/swap: remove redundant swap device reference in alloc/free
2026-03-19 14:24 [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Youngjun Park
2026-03-19 14:24 ` [PATCH v5 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
@ 2026-03-19 14:24 ` Youngjun Park
2026-03-19 14:24 ` [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
2026-03-20 2:50 ` [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Andrew Morton
3 siblings, 0 replies; 8+ messages in thread
From: Youngjun Park @ 2026-03-19 14:24 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-mm, linux-pm
In the previous commit, uswsusp was modified to acquire the swap device
reference at the time of determining the swap type. As a result, it is
no longer necessary to repeatedly acquire and release the reference to
protect against swapoff every time a swap slot is allocated.
For hibernation via the sysfs interface, user-space processes are already
frozen, making swapoff inherently impossible. Thus, acquiring and
releasing the reference during allocation is unnecessary. Furthermore,
even after returning from suspend, processes are not yet thawed when
swap slots are freed, meaning reference management is not required at
that stage either.
Therefore, remove the redundant swap device reference acquire and release
operations from the hibernation swap allocation and free functions.
Additionally, remove the SWP_WRITEOK check before allocation. This check
is redundant because the cluster allocation logic already handles it.
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
mm/swapfile.c | 43 ++++++++++++++++++++-----------------------
1 file changed, 20 insertions(+), 23 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5069074ab11b..f1188743037a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1923,7 +1923,12 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
}
#ifdef CONFIG_HIBERNATION
-/* Allocate a slot for hibernation */
+/*
+ * Allocate a slot for hibernation.
+ *
+ * Note: The caller must ensure the swap device is stable, either by
+ * holding a reference or by freezing user-space before calling this.
+ */
swp_entry_t swap_alloc_hibernation_slot(int type)
{
struct swap_info_struct *si = swap_type_to_info(type);
@@ -1933,43 +1938,35 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
if (!si)
goto fail;
- /* This is called for allocating swap entry, not cache */
- if (get_swap_device_info(si)) {
- if (si->flags & SWP_WRITEOK) {
- /*
- * Grab the local lock to be compliant
- * with swap table allocation.
- */
- local_lock(&percpu_swap_cluster.lock);
- offset = cluster_alloc_swap_entry(si, NULL);
- local_unlock(&percpu_swap_cluster.lock);
- if (offset)
- entry = swp_entry(si->type, offset);
- }
- put_swap_device(si);
- }
+ /*
+ * Grab the local lock to be compliant
+ * with swap table allocation.
+ */
+ local_lock(&percpu_swap_cluster.lock);
+ offset = cluster_alloc_swap_entry(si, NULL);
+ local_unlock(&percpu_swap_cluster.lock);
+ if (offset)
+ entry = swp_entry(si->type, offset);
fail:
return entry;
}
-/* Free a slot allocated by swap_alloc_hibernation_slot */
+/*
+ * Free a slot allocated by swap_alloc_hibernation_slot.
+ * As with allocation, the caller must ensure the swap device is stable.
+ */
void swap_free_hibernation_slot(swp_entry_t entry)
{
- struct swap_info_struct *si;
+ struct swap_info_struct *si = __swap_entry_to_info(entry);
struct swap_cluster_info *ci;
pgoff_t offset = swp_offset(entry);
- si = get_swap_device(entry);
- if (WARN_ON(!si))
- return;
-
ci = swap_cluster_lock(si, offset);
swap_put_entry_locked(si, ci, offset);
swap_cluster_unlock(ci);
/* In theory readahead might add it to the swap cache by accident */
__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
- put_swap_device(si);
}
static int swap_type_of(dev_t device, sector_t offset)
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
2026-03-19 14:24 [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Youngjun Park
2026-03-19 14:24 ` [PATCH v5 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
2026-03-19 14:24 ` [PATCH v5 2/3] mm/swap: remove redundant swap device reference in alloc/free Youngjun Park
@ 2026-03-19 14:24 ` Youngjun Park
2026-03-19 19:55 ` Rafael J. Wysocki
2026-03-20 2:50 ` [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Andrew Morton
3 siblings, 1 reply; 8+ messages in thread
From: Youngjun Park @ 2026-03-19 14:24 UTC (permalink / raw)
To: rafael, akpm
Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
youngjun.park, usama.arif, linux-mm, linux-pm
Commit 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0:
WARNING: kernel/power/main.c:44 at pm_restore_gfp_mask+0xd7/0xf0
CPU: 0 UID: 0 PID: 373 Comm: s2disk
Call Trace:
snapshot_ioctl+0x964/0xbd0
__x64_sys_ioctl+0x724/0x1320
...
The uswsusp path calls pm_restore_gfp_mask() defensively in
SNAPSHOT_CREATE_IMAGE and SNAPSHOT_UNFREEZE where the GFP mask may
or may not be restricted depending on context (first call vs retry,
hibernate vs resume). Before the stacking patch this was a silent
no-op; now it triggers a WARNING.
Introduce pm_restore_gfp_mask_safe() that skips the call when
saved_gfp_count is 0. This is preferred over tracking the restrict
state in snapshot_ioctl, as incorrect tracking risks leaving the
GFP mask permanently restricted.
Fixes: 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
include/linux/suspend.h | 1 +
kernel/power/main.c | 7 +++++++
kernel/power/user.c | 4 ++--
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/suspend.h b/include/linux/suspend.h
index b02876f1ae38..7777931d88a5 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -454,6 +454,7 @@ extern void pm_report_hw_sleep_time(u64 t);
extern void pm_report_max_hw_sleep(u64 t);
void pm_restrict_gfp_mask(void);
void pm_restore_gfp_mask(void);
+void pm_restore_gfp_mask_safe(void);
#define pm_notifier(fn, pri) { \
static struct notifier_block fn##_nb = \
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 5f8c9e12eaec..e610a8c8b7ff 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -36,6 +36,13 @@
static unsigned int saved_gfp_count;
static gfp_t saved_gfp_mask;
+void pm_restore_gfp_mask_safe(void)
+{
+ if (!saved_gfp_count)
+ return;
+ pm_restore_gfp_mask();
+}
+
void pm_restore_gfp_mask(void)
{
WARN_ON(!mutex_is_locked(&system_transition_mutex));
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 3e41544b99d5..41cff6a89a1c 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -306,7 +306,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
case SNAPSHOT_UNFREEZE:
if (!data->frozen || data->ready)
break;
- pm_restore_gfp_mask();
+ pm_restore_gfp_mask_safe();
free_basic_memory_bitmaps();
data->free_bitmaps = false;
thaw_processes();
@@ -318,7 +318,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
error = -EPERM;
break;
}
- pm_restore_gfp_mask();
+ pm_restore_gfp_mask_safe();
error = hibernation_snapshot(data->platform_support);
if (!error) {
error = put_user(in_suspend, (int __user *)arg);
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
2026-03-19 14:24 ` [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
@ 2026-03-19 19:55 ` Rafael J. Wysocki
2026-03-20 8:18 ` YoungJun Park
0 siblings, 1 reply; 8+ messages in thread
From: Rafael J. Wysocki @ 2026-03-19 19:55 UTC (permalink / raw)
To: Youngjun Park
Cc: rafael, akpm, chrisl, kasong, pavel, shikemeng, nphamcs, bhe,
baohua, usama.arif, linux-mm, linux-pm
On Thu, Mar 19, 2026 at 3:24 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Commit 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask()
> stacking") introduced refcount-based GFP mask management that warns
> when pm_restore_gfp_mask() is called with saved_gfp_count == 0:
>
> WARNING: kernel/power/main.c:44 at pm_restore_gfp_mask+0xd7/0xf0
> CPU: 0 UID: 0 PID: 373 Comm: s2disk
> Call Trace:
> snapshot_ioctl+0x964/0xbd0
> __x64_sys_ioctl+0x724/0x1320
> ...
>
> The uswsusp path calls pm_restore_gfp_mask() defensively in
> SNAPSHOT_CREATE_IMAGE and SNAPSHOT_UNFREEZE where the GFP mask may
> or may not be restricted depending on context (first call vs retry,
> hibernate vs resume). Before the stacking patch this was a silent
> no-op; now it triggers a WARNING.
>
> Introduce pm_restore_gfp_mask_safe() that skips the call when
> saved_gfp_count is 0. This is preferred over tracking the restrict
> state in snapshot_ioctl, as incorrect tracking risks leaving the
> GFP mask permanently restricted.
>
> Fixes: 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
> include/linux/suspend.h | 1 +
> kernel/power/main.c | 7 +++++++
> kernel/power/user.c | 4 ++--
> 3 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/suspend.h b/include/linux/suspend.h
> index b02876f1ae38..7777931d88a5 100644
> --- a/include/linux/suspend.h
> +++ b/include/linux/suspend.h
> @@ -454,6 +454,7 @@ extern void pm_report_hw_sleep_time(u64 t);
> extern void pm_report_max_hw_sleep(u64 t);
> void pm_restrict_gfp_mask(void);
> void pm_restore_gfp_mask(void);
> +void pm_restore_gfp_mask_safe(void);
>
> #define pm_notifier(fn, pri) { \
> static struct notifier_block fn##_nb = \
> diff --git a/kernel/power/main.c b/kernel/power/main.c
> index 5f8c9e12eaec..e610a8c8b7ff 100644
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -36,6 +36,13 @@
> static unsigned int saved_gfp_count;
> static gfp_t saved_gfp_mask;
>
> +void pm_restore_gfp_mask_safe(void)
> +{
> + if (!saved_gfp_count)
> + return;
> + pm_restore_gfp_mask();
> +}
> +
> void pm_restore_gfp_mask(void)
> {
> WARN_ON(!mutex_is_locked(&system_transition_mutex));
> diff --git a/kernel/power/user.c b/kernel/power/user.c
> index 3e41544b99d5..41cff6a89a1c 100644
> --- a/kernel/power/user.c
> +++ b/kernel/power/user.c
> @@ -306,7 +306,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
> case SNAPSHOT_UNFREEZE:
> if (!data->frozen || data->ready)
> break;
> - pm_restore_gfp_mask();
> + pm_restore_gfp_mask_safe();
> free_basic_memory_bitmaps();
> data->free_bitmaps = false;
> thaw_processes();
> @@ -318,7 +318,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
> error = -EPERM;
> break;
> }
> - pm_restore_gfp_mask();
> + pm_restore_gfp_mask_safe();
> error = hibernation_snapshot(data->platform_support);
> if (!error) {
> error = put_user(in_suspend, (int __user *)arg);
> --
AFAICS, this patch doesn't depend on the rest of the series, so I can
apply it separately unless there is a problem with that.
However, for the other 2 patches in the series, I'd need some tags
(preferably Reviewed-by) from mm people.
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path
2026-03-19 14:24 [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Youngjun Park
` (2 preceding siblings ...)
2026-03-19 14:24 ` [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
@ 2026-03-20 2:50 ` Andrew Morton
2026-03-20 9:49 ` YoungJun Park
3 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2026-03-20 2:50 UTC (permalink / raw)
To: Youngjun Park
Cc: rafael, chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
usama.arif, linux-mm, linux-pm
On Thu, 19 Mar 2026 23:24:01 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> Currently, in the uswsusp path, only the swap type value is retrieved at
> lookup time without holding a reference. If swapoff races after the type
> is acquired, subsequent slot allocations operate on a stale swap device.
>
> Additionally, grabbing and releasing the swap device reference on every
> slot allocation is inefficient across the entire hibernation swap path.
AI review has questions:
https://sashiko.dev/#/patchset/20260319142404.3683019-1-youngjun.park%40lge.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path
2026-03-19 19:55 ` Rafael J. Wysocki
@ 2026-03-20 8:18 ` YoungJun Park
0 siblings, 0 replies; 8+ messages in thread
From: YoungJun Park @ 2026-03-20 8:18 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: akpm, chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
usama.arif, linux-mm, linux-pm
On Thu, Mar 19, 2026 at 08:55:43PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 19, 2026 at 3:24 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > Commit 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask()
> > stacking") introduced refcount-based GFP mask management that warns
> > when pm_restore_gfp_mask() is called with saved_gfp_count == 0:
> >
> > WARNING: kernel/power/main.c:44 at pm_restore_gfp_mask+0xd7/0xf0
> > CPU: 0 UID: 0 PID: 373 Comm: s2disk
> > Call Trace:
> > snapshot_ioctl+0x964/0xbd0
> > __x64_sys_ioctl+0x724/0x1320
> > ...
> >
> > The uswsusp path calls pm_restore_gfp_mask() defensively in
> > SNAPSHOT_CREATE_IMAGE and SNAPSHOT_UNFREEZE where the GFP mask may
> > or may not be restricted depending on context (first call vs retry,
> > hibernate vs resume). Before the stacking patch this was a silent
> > no-op; now it triggers a WARNING.
> >
> > Introduce pm_restore_gfp_mask_safe() that skips the call when
> > saved_gfp_count is 0. This is preferred over tracking the restrict
> > state in snapshot_ioctl, as incorrect tracking risks leaving the
> > GFP mask permanently restricted.
> >
> > Fixes: 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
> > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > ---
> > include/linux/suspend.h | 1 +
> > kernel/power/main.c | 7 +++++++
> > kernel/power/user.c | 4 ++--
> > 3 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/suspend.h b/include/linux/suspend.h
> > index b02876f1ae38..7777931d88a5 100644
> > --- a/include/linux/suspend.h
> > +++ b/include/linux/suspend.h
> > @@ -454,6 +454,7 @@ extern void pm_report_hw_sleep_time(u64 t);
> > extern void pm_report_max_hw_sleep(u64 t);
> > void pm_restrict_gfp_mask(void);
> > void pm_restore_gfp_mask(void);
> > +void pm_restore_gfp_mask_safe(void);
> >
> > #define pm_notifier(fn, pri) { \
> > static struct notifier_block fn##_nb = \
> > diff --git a/kernel/power/main.c b/kernel/power/main.c
> > index 5f8c9e12eaec..e610a8c8b7ff 100644
> > --- a/kernel/power/main.c
> > +++ b/kernel/power/main.c
> > @@ -36,6 +36,13 @@
> > static unsigned int saved_gfp_count;
> > static gfp_t saved_gfp_mask;
> >
> > +void pm_restore_gfp_mask_safe(void)
> > +{
> > + if (!saved_gfp_count)
> > + return;
> > + pm_restore_gfp_mask();
> > +}
> > +
> > void pm_restore_gfp_mask(void)
> > {
> > WARN_ON(!mutex_is_locked(&system_transition_mutex));
> > diff --git a/kernel/power/user.c b/kernel/power/user.c
> > index 3e41544b99d5..41cff6a89a1c 100644
> > --- a/kernel/power/user.c
> > +++ b/kernel/power/user.c
> > @@ -306,7 +306,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
> > case SNAPSHOT_UNFREEZE:
> > if (!data->frozen || data->ready)
> > break;
> > - pm_restore_gfp_mask();
> > + pm_restore_gfp_mask_safe();
> > free_basic_memory_bitmaps();
> > data->free_bitmaps = false;
> > thaw_processes();
> > @@ -318,7 +318,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd,
> > error = -EPERM;
> > break;
> > }
> > - pm_restore_gfp_mask();
> > + pm_restore_gfp_mask_safe();
> > error = hibernation_snapshot(data->platform_support);
> > if (!error) {
> > error = put_user(in_suspend, (int __user *)arg);
> > --
>
> AFAICS, this patch doesn't depend on the rest of the series, so I can
> apply it separately unless there is a problem with that.
>
> However, for the other 2 patches in the series, I'd need some tags
> (preferably Reviewed-by) from mm people.
>
> Thanks!
Hi Rafael,
While double-checking the code based on Andrew’s AI-assisted review,
I noticed I missed one case.
If userspace issues SNAPSHOT_FREEZE and then closes the device,
snapshot_release() may call pm_restore_gfp_mask() without a matching
restriction, which reproduces the same WARN. So we should switch
snapshot_release() to pm_restore_gfp_mask_safe() as well.
Also, since the safe wrapper may return early when saved_gfp_count == 0,
the locking assertion would be skipped in that path. To preserve the
invariant, it is better to keep:
WARN_ON(!mutex_is_locked(&system_transition_mutex));
in the wrapper too.
This modification is intentional, but after review I think this is better.
I will update the patch and resend.
Best regards,
Youngjun park
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path
2026-03-20 2:50 ` [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Andrew Morton
@ 2026-03-20 9:49 ` YoungJun Park
0 siblings, 0 replies; 8+ messages in thread
From: YoungJun Park @ 2026-03-20 9:49 UTC (permalink / raw)
To: Andrew Morton
Cc: rafael, chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
usama.arif, linux-mm, linux-pm
On Thu, Mar 19, 2026 at 07:50:32PM -0700, Andrew Morton wrote:
> On Thu, 19 Mar 2026 23:24:01 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
>
> > Currently, in the uswsusp path, only the swap type value is retrieved at
> > lookup time without holding a reference. If swapoff races after the type
> > is acquired, subsequent slot allocations operate on a stale swap device.
> >
> > Additionally, grabbing and releasing the swap device reference on every
> > slot allocation is inefficient across the entire hibernation swap path.
>
> AI review has questions:
> https://sashiko.dev/#/patchset/20260319142404.3683019-1-youngjun.park%40lge.com
Hi Andrew,
Thanks for sharing the AI review. Some comments are indeed good catches.
patch 2 comments are wrong.
patch 3 comments are right.
Regarding patch 2, the AI review raised the following concerns:
> If this race occurs, swapoff clears SWP_WRITEOK and calls try_to_unuse()
> to scan and evict swap slots. Without the SWP_WRITEOK check here, it
> looks like uswsusp could successfully allocate slots while
> try_to_unuse() is actively scanning.
Right. but..
> If try_to_unuse() encounters these newly allocated slots, wouldn't it
> attempt to page them in and subsequently free them via
> folio_free_swap()?
I believe this concern is not valid.
swapoff calls try_to_unuse(), which scans swap_map[] entries:
(this logic is slightly different from each kerenl version but, fundamental is same)
...
for (i = prev + 1; i < si->max; i++) {
count = READ_ONCE(si->swap_map[i]);
if (count && swap_count(count) != SWAP_MAP_BAD)
break;
}
...
However, hibernation allocations marked with SWAP_HAS_CACHE are not
visible via this swap_map[] scanning logic, so they cannot be found
or reclaimed in that path.
As a result swapoff keeps retrying while swap_usage_in_pages(si) is
non-zero until user signal is delivered., rather than freeing those entrie.
I will also verify this behavior at runtime to be absolutely sure.
For patch 3, I agree with the comment and have responded in Rafael’s
thread.
I will update the series shortly.
Thanks,
Youngjun Park
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-20 9:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 14:24 [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Youngjun Park
2026-03-19 14:24 ` [PATCH v5 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Youngjun Park
2026-03-19 14:24 ` [PATCH v5 2/3] mm/swap: remove redundant swap device reference in alloc/free Youngjun Park
2026-03-19 14:24 ` [PATCH v5 3/3] PM: hibernate: fix spurious GFP mask WARNING in uswsusp path Youngjun Park
2026-03-19 19:55 ` Rafael J. Wysocki
2026-03-20 8:18 ` YoungJun Park
2026-03-20 2:50 ` [PATCH v5 0/3] Fix swapoff race and cleanup in hibernation swap path Andrew Morton
2026-03-20 9:49 ` YoungJun Park
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox