[RFC PATCH 0/4] vfs freeze/thaw on suspend/resume

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] vfs freeze/thaw on suspend/resume
@ 2025-03-27 14:06 James Bottomley
  2025-03-27 14:06 ` [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read James Bottomley
                   ` (3 more replies)
  0 siblings, 4 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-27 14:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: mcgrof, jack, hch, david, rafael, djwong, pavel, peterz, mingo,
	will, boqun.feng

This sequence is posted as an RFC because it needs to be combined with
Luis' patch set:

https://lore.kernel.org/linux-fsdevel/20250326112220.1988619-1-mcgrof@kernel.org/

In particular I've done nothing to replace the kthread freezing in
filesystems.  I can say that this works flawlessly on 6.14 with my
limited test rig (I only have access to my laptop while at LSF/MM).
My test VM is ext4 root with a file on the root attached to a loop
device and mounted (to test nesting) while running a fio workload on
the upper ext4.

The rwsem rework is absolutely necessary because without it hibernate
immediately fails because systemd-journald tries to record the kernel
messages and gets blocked in sb_start_write() on TASK_INTERRUPTIBLE
which inhibits hibernation.

My goal in doing this is to be able to add a thaw_super() callback to
efivarfs and remove our deadlock prone pm notifier.

Regards,

James

---

James Bottomley (4):
  locking/percpu-rwsem: add freezable alternative to down_read
  vfs: make sb_start_write freezable
  fs/super.c: introduce reverse superblock iterator and use it in
    emergency remount
  vfs: add filesystem freeze/thaw callbacks for power management

 fs/super.c                    | 109 ++++++++++++++++++++++++++++------
 include/linux/fs.h            |   8 ++-
 include/linux/percpu-rwsem.h  |  20 +++++--
 kernel/locking/percpu-rwsem.c |  13 ++--
 kernel/power/hibernate.c      |  12 ++++
 kernel/power/suspend.c        |   4 ++
 6 files changed, 137 insertions(+), 29 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-03-27 14:06 [RFC PATCH 0/4] vfs freeze/thaw on suspend/resume James Bottomley
@ 2025-03-27 14:06 ` James Bottomley
  2025-03-31 19:51   ` James Bottomley
  2025-03-27 14:06 ` [RFC PATCH 2/4] vfs: make sb_start_write freezable James Bottomley
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-27 14:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: mcgrof, jack, hch, david, rafael, djwong, pavel, peterz, mingo,
	will, boqun.feng

Percpu-rwsems are used for superblock locking.  However, we know the
read percpu-rwsem we take for sb_start_write() on a frozen filesystem
needs not to inhibit system from suspending or hibernating.  That
means it needs to wait with TASK_UNINTERRUPTIBLE | TASK_FREEZABLE.

Introduce a new percpu_down_read_freezable() that allows us to control
whether TASK_FREEZABLE is added to the wait flags.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

---

Since this is an RFC, added the percpu-rwsem maintainers for
information and guidance to check if we're on the right track or
whether they would prefer an alternative API.
---
 include/linux/percpu-rwsem.h  | 20 ++++++++++++++++----
 kernel/locking/percpu-rwsem.c | 13 ++++++++-----
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index c012df33a9f0..a55fe709b832 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -42,9 +42,10 @@ is_static struct percpu_rw_semaphore name = {				\
 #define DEFINE_STATIC_PERCPU_RWSEM(name)	\
 	__DEFINE_PERCPU_RWSEM(name, static)
 
-extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool);
+extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool, bool);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+static inline void percpu_down_read_internal(struct percpu_rw_semaphore *sem,
+					     bool freezable)
 {
 	might_sleep();
 
@@ -62,7 +63,7 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
 	if (likely(rcu_sync_is_idle(&sem->rss)))
 		this_cpu_inc(*sem->read_count);
 	else
-		__percpu_down_read(sem, false); /* Unconditional memory barrier */
+		__percpu_down_read(sem, false, freezable); /* Unconditional memory barrier */
 	/*
 	 * The preempt_enable() prevents the compiler from
 	 * bleeding the critical section out.
@@ -70,6 +71,17 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
 	preempt_enable();
 }
 
+static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+	percpu_down_read_internal(sem, false);
+}
+
+static inline void percpu_down_read_freezable(struct percpu_rw_semaphore *sem,
+					      bool freeze)
+{
+	percpu_down_read_internal(sem, freeze);
+}
+
 static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
 	bool ret = true;
@@ -81,7 +93,7 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 	if (likely(rcu_sync_is_idle(&sem->rss)))
 		this_cpu_inc(*sem->read_count);
 	else
-		ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */
+		ret = __percpu_down_read(sem, true, false); /* Unconditional memory barrier */
 	preempt_enable();
 	/*
 	 * The barrier() from preempt_enable() prevents the compiler from
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 6083883c4fe0..890837b73476 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -138,7 +138,8 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
 	return !reader; /* wake (readers until) 1 writer */
 }
 
-static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
+static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader,
+			      bool freeze)
 {
 	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
 	bool wait;
@@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 	spin_unlock_irq(&sem->waiters.lock);
 
 	while (wait) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE |
+				  freeze ? TASK_FREEZABLE : 0);
 		if (!smp_load_acquire(&wq_entry.private))
 			break;
 		schedule();
@@ -164,7 +166,8 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 	__set_current_state(TASK_RUNNING);
 }
 
-bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
+bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try,
+				bool freeze)
 {
 	if (__percpu_down_read_trylock(sem))
 		return true;
@@ -174,7 +177,7 @@ bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 
 	trace_contention_begin(sem, LCB_F_PERCPU | LCB_F_READ);
 	preempt_enable();
-	percpu_rwsem_wait(sem, /* .reader = */ true);
+	percpu_rwsem_wait(sem, /* .reader = */ true, freeze);
 	preempt_disable();
 	trace_contention_end(sem, 0);
 
@@ -237,7 +240,7 @@ void __sched percpu_down_write(struct percpu_rw_semaphore *sem)
 	 */
 	if (!__percpu_down_write_trylock(sem)) {
 		trace_contention_begin(sem, LCB_F_PERCPU | LCB_F_WRITE);
-		percpu_rwsem_wait(sem, /* .reader = */ false);
+		percpu_rwsem_wait(sem, /* .reader = */ false, false);
 		contended = true;
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [RFC PATCH 2/4] vfs: make sb_start_write freezable
  2025-03-27 14:06 [RFC PATCH 0/4] vfs freeze/thaw on suspend/resume James Bottomley
  2025-03-27 14:06 ` [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read James Bottomley
@ 2025-03-27 14:06 ` James Bottomley
  2025-03-27 17:36   ` Jan Kara
  2025-03-27 14:06 ` [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount James Bottomley
  2025-03-27 14:06 ` [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management James Bottomley
  3 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-27 14:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: mcgrof, jack, hch, david, rafael, djwong, pavel, peterz, mingo,
	will, boqun.feng

If a write happens on a frozen filesystem, the s_writers.rw_sem gets
stuck in TASK_UNINTERRUPTIBLE and inhibits suspending or hibernating
the system.  Since we want to freeze filesystems first then tasks, we
need this condition not to inhibit suspend/hibernate, which means the
wait has to have the TASK_FREEZABLE flag as well.  Use the freezable
version of percpu-rwsem to ensure this.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 include/linux/fs.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd84d1c3b8af..cbbb704eff74 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
-	percpu_down_read(sb->s_writers.rw_sem + level - 1);
+	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
+				   level == SB_FREEZE_WRITE);
 }
 
 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount
  2025-03-27 14:06 [RFC PATCH 0/4] vfs freeze/thaw on suspend/resume James Bottomley
  2025-03-27 14:06 ` [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read James Bottomley
  2025-03-27 14:06 ` [RFC PATCH 2/4] vfs: make sb_start_write freezable James Bottomley
@ 2025-03-27 14:06 ` James Bottomley
  2025-03-28 11:56   ` Christian Brauner
  2025-03-27 14:06 ` [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management James Bottomley
  3 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-27 14:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: mcgrof, jack, hch, david, rafael, djwong, pavel, peterz, mingo,
	will, boqun.feng

Originally proposed by Amir as an extract from the android kernel:

https://lore.kernel.org/linux-fsdevel/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/

Since suspend/resume requires a reverse iterator, I'm dusting it off.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 fs/super.c | 48 +++++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 5a7db4a556e3..76785509d906 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -887,28 +887,38 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+#define ITERATE_SUPERS(f, rev)					\
+	struct super_block *sb, *p = NULL;			\
+								\
+	spin_lock(&sb_lock);					\
+								\
+	list_for_each_entry##rev(sb, &super_blocks, s_list) {	\
+		if (super_flags(sb, SB_DYING))			\
+			continue;				\
+		sb->s_count++;					\
+		spin_unlock(&sb_lock);				\
+								\
+		f(sb);						\
+								\
+		spin_lock(&sb_lock);				\
+		if (p)						\
+			__put_super(p);				\
+		p = sb;						\
+	}							\
+	if (p)							\
+		__put_super(p);					\
+	spin_unlock(&sb_lock);
+
 static void __iterate_supers(void (*f)(struct super_block *))
 {
-	struct super_block *sb, *p = NULL;
-
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (super_flags(sb, SB_DYING))
-			continue;
-		sb->s_count++;
-		spin_unlock(&sb_lock);
-
-		f(sb);
+	ITERATE_SUPERS(f,)
+}
 
-		spin_lock(&sb_lock);
-		if (p)
-			__put_super(p);
-		p = sb;
-	}
-	if (p)
-		__put_super(p);
-	spin_unlock(&sb_lock);
+static void __iterate_supers_rev(void (*f)(struct super_block *))
+{
+	ITERATE_SUPERS(f, _reverse)
 }
+
 /**
  *	iterate_supers - call function for all active superblocks
  *	@f: function to call
@@ -1132,7 +1142,7 @@ static void do_emergency_remount_callback(struct super_block *sb)
 
 static void do_emergency_remount(struct work_struct *work)
 {
-	__iterate_supers(do_emergency_remount_callback);
+	__iterate_supers_rev(do_emergency_remount_callback);
 	kfree(work);
 	printk("Emergency Remount complete\n");
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-27 14:06 [RFC PATCH 0/4] vfs freeze/thaw on suspend/resume James Bottomley
                   ` (2 preceding siblings ...)
  2025-03-27 14:06 ` [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount James Bottomley
@ 2025-03-27 14:06 ` James Bottomley
  2025-03-27 18:20   ` Jan Kara
                     ` (2 more replies)
  3 siblings, 3 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-27 14:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: mcgrof, jack, hch, david, rafael, djwong, pavel, peterz, mingo,
	will, boqun.feng

Introduce a freeze function, which iterates superblocks in reverse
order freezing filesystems.  The indicator a filesystem is freezable
is either possessing a s_bdev or a freeze_super method.  So this can
be used in efivarfs, whether the freeze is for hibernate is also
passed in via the new FREEZE_FOR_HIBERNATE flag.

Thawing is done opposite to freezing (so superblock traversal in
regular order) and the whole thing is plumbed into power management.
The original ksys_sync() is preserved so the whole freezing step is
optional (if it fails we're no worse off than we are today) so it
doesn't inhibit suspend/hibernate if there's a failure.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 fs/super.c               | 61 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h       |  5 ++++
 kernel/power/hibernate.c | 12 ++++++++
 kernel/power/suspend.c   |  4 +++
 4 files changed, 82 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 76785509d906..b4b0986414b0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev)
 	return sb;
 }
 
+/*
+ * Kernel freezing and thawing is only done in the power management
+ * subsystem and is thus single threaded (so we don't have to worry
+ * here about multiple calls to filesystems_freeze/thaw().
+ */
+
+static int freeze_flags;
+
+static void filesystems_freeze_callback(struct super_block *sb)
+{
+	/* errors don't fail suspend so ignore them */
+	if (sb->s_op->freeze_super)
+		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
+				       | FREEZE_HOLDER_KERNEL
+				       | freeze_flags);
+	else if (sb->s_bdev)
+		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
+			     | freeze_flags);
+	else {
+		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
+		return;
+	}
+
+	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
+	sync_blockdev(sb->s_bdev);
+	pr_info("done.");
+}
+
+/**
+ * filesystems_freeze - freeze callback for power management
+ *
+ * Freeze all active filesystems (in reverse superblock order)
+ */
+void filesystems_freeze(bool for_hibernate)
+{
+	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
+	__iterate_supers_rev(filesystems_freeze_callback);
+}
+
+static void filesystems_thaw_callback(struct super_block *sb)
+{
+	if (sb->s_op->thaw_super)
+		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
+				     | FREEZE_HOLDER_KERNEL
+				     | freeze_flags);
+	else if (sb->s_bdev)
+		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
+			   | freeze_flags);
+}
+
+/**
+ * filesystems_thaw - thaw callback for power management
+ *
+ * Thaw all active filesystems (in forward superblock order)
+ */
+void filesystems_thaw(bool for_hibernate)
+{
+	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
+	__iterate_supers(filesystems_thaw_callback);
+}
+
 /**
  * fs_bdev_freeze - freeze owning filesystem of block device
  * @bdev: block device
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cbbb704eff74..de154e9379ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.
@@ -2285,6 +2286,7 @@ enum freeze_holder {
 	FREEZE_HOLDER_KERNEL	= (1U << 0),
 	FREEZE_HOLDER_USERSPACE	= (1U << 1),
 	FREEZE_MAY_NEST		= (1U << 2),
+	FREEZE_FOR_HIBERNATE	= (1U << 3),
 };
 
 struct super_operations {
@@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
 
 int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
 
+void filesystems_freeze(bool for_hibernate);
+void filesystems_thaw(bool for_hibernate);
+
 #endif /* _LINUX_FS_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 10a01af63a80..fc2106e6685a 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -778,7 +778,12 @@ int hibernate(void)
 
 	ksys_sync_helper();
 
+	pr_info("about to freeze filesystems\n");
+	filesystems_freeze(true);
+	pr_info("filesystem freeze done\n");
+
 	error = freeze_processes();
+	pr_info("process freeze done\n");
 	if (error)
 		goto Exit;
 
@@ -788,7 +793,9 @@ int hibernate(void)
 	if (error)
 		goto Thaw;
 
+	pr_info("About to create snapshot\n");
 	error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
+	pr_info("snapshot done\n");
 	if (error || freezer_test_done)
 		goto Free_bitmaps;
 
@@ -842,6 +849,8 @@ int hibernate(void)
 	}
 	thaw_processes();
 
+	filesystems_thaw(true);
+
 	/* Don't bother checking whether freezer_test_done is true */
 	freezer_test_done = false;
  Exit:
@@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 
 	thaw_processes();
 
+	filesystems_thaw(true);
+
 exit:
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
 
@@ -1041,6 +1052,7 @@ static int software_resume(void)
 
 	error = load_image_and_restore();
 	thaw_processes();
+	filesystems_thaw(true);
  Finish:
 	pm_notifier_call_chain(PM_POST_RESTORE);
  Restore:
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 09f8397bae15..34cc5b0c408c 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state)
 static void suspend_finish(void)
 {
 	suspend_thaw_processes();
+	filesystems_thaw(false);
 	pm_notifier_call_chain(PM_POST_SUSPEND);
 	pm_restore_console();
 }
@@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state)
 		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
 		ksys_sync_helper();
 		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
+		filesystems_freeze(false);
 	}
 
 	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
@@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state)
 	pm_pr_dbg("Finishing wakeup.\n");
 	suspend_finish();
  Unlock:
+	if (sync_on_suspend_enabled)
+		filesystems_thaw(false);
 	mutex_unlock(&system_transition_mutex);
 	return error;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/4] vfs: make sb_start_write freezable
  2025-03-27 14:06 ` [RFC PATCH 2/4] vfs: make sb_start_write freezable James Bottomley
@ 2025-03-27 17:36   ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-03-27 17:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Thu 27-03-25 10:06:11, James Bottomley wrote:
> If a write happens on a frozen filesystem, the s_writers.rw_sem gets
> stuck in TASK_UNINTERRUPTIBLE and inhibits suspending or hibernating
> the system.  Since we want to freeze filesystems first then tasks, we
> need this condition not to inhibit suspend/hibernate, which means the
> wait has to have the TASK_FREEZABLE flag as well.  Use the freezable
> version of percpu-rwsem to ensure this.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dd84d1c3b8af..cbbb704eff74 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>  
>  static inline void __sb_start_write(struct super_block *sb, int level)
>  {
> -	percpu_down_read(sb->s_writers.rw_sem + level - 1);
> +	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> +				   level == SB_FREEZE_WRITE);
>  }
>  
>  static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-27 14:06 ` [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management James Bottomley
@ 2025-03-27 18:20   ` Jan Kara
  2025-03-28 14:21     ` James Bottomley
  2025-03-28 10:08   ` Christian Brauner
  2025-03-28 12:01   ` Christian Brauner
  2 siblings, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-03-27 18:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Thu 27-03-25 10:06:13, James Bottomley wrote:
> Introduce a freeze function, which iterates superblocks in reverse
> order freezing filesystems.  The indicator a filesystem is freezable
> is either possessing a s_bdev or a freeze_super method.  So this can
> be used in efivarfs, whether the freeze is for hibernate is also
> passed in via the new FREEZE_FOR_HIBERNATE flag.
> 
> Thawing is done opposite to freezing (so superblock traversal in
> regular order) and the whole thing is plumbed into power management.
> The original ksys_sync() is preserved so the whole freezing step is
> optional (if it fails we're no worse off than we are today) so it
> doesn't inhibit suspend/hibernate if there's a failure.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

OK, I've seen you are setting the new FREEZE_FOR_HIBERNATE flag but I didn't
find anything using that flag. What do you plan to use it for? Does you
efivars usecase need it? I find passing down this detail about the caller
down to all filesystems a bit awkward. Isn't it possible to extract the
information "hibernate is ongoing" from PM subsystem?

> +/*
> + * Kernel freezing and thawing is only done in the power management
> + * subsystem and is thus single threaded (so we don't have to worry
> + * here about multiple calls to filesystems_freeze/thaw().
> + */
> +
> +static int freeze_flags;

Frankly, the global variable to propagate flags is pretty ugly... If we
really have to propagate some context into the iterator callback, rather do
it explicitly like iterate_supers() does it.

> +static void filesystems_freeze_callback(struct super_block *sb)
> +{
> +	/* errors don't fail suspend so ignore them */
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> +				       | FREEZE_HOLDER_KERNEL
> +				       | freeze_flags);
> +	else if (sb->s_bdev)
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			     | freeze_flags);

Style nit - braces around above blocks would be IMHO appropriate.

> +	else {
> +		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
> +		return;
> +	}
> +
> +	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
> +	sync_blockdev(sb->s_bdev);
> +	pr_info("done.");
> +}

Generally this callback is not safe because it can race with filesystem
unmount and calling ->freeze_super() after the filesystem's ->put_super()
was called may have all sorts of interesting effects (freeze_super() itself
will just bail with a warning, which is better but not great either).

The cleanest way I see how to make the iteration safe is to grab active sb
reference (like grab_super() does it) for the duration of freeze_super()
calls. Another possibility would be to grab sb->s_umount rwsem exclusively
as Luis does it in his series but that requires a bit of locking surgery
and ->freeze_super() handlers make this particularly nasty these days so I
think active sb reference is going to be nicer these days.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-27 14:06 ` [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management James Bottomley
  2025-03-27 18:20   ` Jan Kara
@ 2025-03-28 10:08   ` Christian Brauner
  2025-03-28 14:14     ` James Bottomley
  2025-03-28 12:01   ` Christian Brauner
  2 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 10:08 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
> Introduce a freeze function, which iterates superblocks in reverse
> order freezing filesystems.  The indicator a filesystem is freezable
> is either possessing a s_bdev or a freeze_super method.  So this can
> be used in efivarfs, whether the freeze is for hibernate is also
> passed in via the new FREEZE_FOR_HIBERNATE flag.
> 
> Thawing is done opposite to freezing (so superblock traversal in
> regular order) and the whole thing is plumbed into power management.
> The original ksys_sync() is preserved so the whole freezing step is
> optional (if it fails we're no worse off than we are today) so it
> doesn't inhibit suspend/hibernate if there's a failure.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  fs/super.c               | 61 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h       |  5 ++++
>  kernel/power/hibernate.c | 12 ++++++++
>  kernel/power/suspend.c   |  4 +++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 76785509d906..b4b0986414b0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev)
>  	return sb;
>  }
>  
> +/*
> + * Kernel freezing and thawing is only done in the power management
> + * subsystem and is thus single threaded (so we don't have to worry
> + * here about multiple calls to filesystems_freeze/thaw().
> + */
> +
> +static int freeze_flags;

Ugh, please don't use a global flag for this.

> +
> +static void filesystems_freeze_callback(struct super_block *sb)
> +{
> +	/* errors don't fail suspend so ignore them */
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> +				       | FREEZE_HOLDER_KERNEL
> +				       | freeze_flags);
> +	else if (sb->s_bdev)
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			     | freeze_flags);
> +	else {
> +		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
> +		return;
> +	}
> +
> +	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
> +	sync_blockdev(sb->s_bdev);
> +	pr_info("done.");
> +}
> +
> +/**
> + * filesystems_freeze - freeze callback for power management
> + *
> + * Freeze all active filesystems (in reverse superblock order)
> + */
> +void filesystems_freeze(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers_rev(filesystems_freeze_callback);
> +}
> +
> +static void filesystems_thaw_callback(struct super_block *sb)
> +{
> +	if (sb->s_op->thaw_super)
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> +				     | FREEZE_HOLDER_KERNEL
> +				     | freeze_flags);
> +	else if (sb->s_bdev)
> +		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			   | freeze_flags);
> +}
> +
> +/**
> + * filesystems_thaw - thaw callback for power management
> + *
> + * Thaw all active filesystems (in forward superblock order)
> + */
> +void filesystems_thaw(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers(filesystems_thaw_callback);

This doesn't work and I've explained in my reply to Luis how this
doesn't work and what the alternative are:

A concurrent umount() can wipe the filesystem behind your back. So you
either need an active superblock reference or you need to communicate
that the superblock is locked through the new flag I proposed (naming
irrelevant for now).

> +}
> +
>  /**
>   * fs_bdev_freeze - freeze owning filesystem of block device
>   * @bdev: block device
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cbbb704eff74..de154e9379ec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> @@ -2285,6 +2286,7 @@ enum freeze_holder {
>  	FREEZE_HOLDER_KERNEL	= (1U << 0),
>  	FREEZE_HOLDER_USERSPACE	= (1U << 1),
>  	FREEZE_MAY_NEST		= (1U << 2),
> +	FREEZE_FOR_HIBERNATE	= (1U << 3),
>  };
>  
>  struct super_operations {
> @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
>  
>  int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
>  
> +void filesystems_freeze(bool for_hibernate);
> +void filesystems_thaw(bool for_hibernate);
> +
>  #endif /* _LINUX_FS_H */
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 10a01af63a80..fc2106e6685a 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -778,7 +778,12 @@ int hibernate(void)
>  
>  	ksys_sync_helper();
>  
> +	pr_info("about to freeze filesystems\n");
> +	filesystems_freeze(true);
> +	pr_info("filesystem freeze done\n");
> +
>  	error = freeze_processes();
> +	pr_info("process freeze done\n");
>  	if (error)
>  		goto Exit;
>  
> @@ -788,7 +793,9 @@ int hibernate(void)
>  	if (error)
>  		goto Thaw;
>  
> +	pr_info("About to create snapshot\n");
>  	error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
> +	pr_info("snapshot done\n");
>  	if (error || freezer_test_done)
>  		goto Free_bitmaps;
>  
> @@ -842,6 +849,8 @@ int hibernate(void)
>  	}
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  	/* Don't bother checking whether freezer_test_done is true */
>  	freezer_test_done = false;
>   Exit:
> @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  exit:
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>  
> @@ -1041,6 +1052,7 @@ static int software_resume(void)
>  
>  	error = load_image_and_restore();
>  	thaw_processes();
> +	filesystems_thaw(true);
>   Finish:
>  	pm_notifier_call_chain(PM_POST_RESTORE);
>   Restore:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 09f8397bae15..34cc5b0c408c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  static void suspend_finish(void)
>  {
>  	suspend_thaw_processes();
> +	filesystems_thaw(false);
>  	pm_notifier_call_chain(PM_POST_SUSPEND);
>  	pm_restore_console();
>  }
> @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state)
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
>  		ksys_sync_helper();
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
> +		filesystems_freeze(false);
>  	}
>  
>  	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
> @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state)
>  	pm_pr_dbg("Finishing wakeup.\n");
>  	suspend_finish();
>   Unlock:
> +	if (sync_on_suspend_enabled)
> +		filesystems_thaw(false);
>  	mutex_unlock(&system_transition_mutex);
>  	return error;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount
  2025-03-27 14:06 ` [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount James Bottomley
@ 2025-03-28 11:56   ` Christian Brauner
  2025-03-28 12:38     ` James Bottomley
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
  0 siblings, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 11:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Thu, Mar 27, 2025 at 10:06:12AM -0400, James Bottomley wrote:
> Originally proposed by Amir as an extract from the android kernel:
> 
> https://lore.kernel.org/linux-fsdevel/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> 
> Since suspend/resume requires a reverse iterator, I'm dusting it off.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  fs/super.c | 48 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 29 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 5a7db4a556e3..76785509d906 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -887,28 +887,38 @@ void drop_super_exclusive(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(drop_super_exclusive);
>  
> +#define ITERATE_SUPERS(f, rev)					\

I'm not fond of the macro magic here.
I've taken some of your patches and massaging them.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-27 14:06 ` [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management James Bottomley
  2025-03-27 18:20   ` Jan Kara
  2025-03-28 10:08   ` Christian Brauner
@ 2025-03-28 12:01   ` Christian Brauner
  2025-03-28 14:40     ` James Bottomley
  2 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 12:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
> Introduce a freeze function, which iterates superblocks in reverse
> order freezing filesystems.  The indicator a filesystem is freezable
> is either possessing a s_bdev or a freeze_super method.  So this can
> be used in efivarfs, whether the freeze is for hibernate is also
> passed in via the new FREEZE_FOR_HIBERNATE flag.
> 
> Thawing is done opposite to freezing (so superblock traversal in
> regular order) and the whole thing is plumbed into power management.
> The original ksys_sync() is preserved so the whole freezing step is
> optional (if it fails we're no worse off than we are today) so it
> doesn't inhibit suspend/hibernate if there's a failure.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  fs/super.c               | 61 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h       |  5 ++++
>  kernel/power/hibernate.c | 12 ++++++++
>  kernel/power/suspend.c   |  4 +++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 76785509d906..b4b0986414b0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1461,6 +1461,67 @@ static struct super_block *get_bdev_super(struct block_device *bdev)
>  	return sb;
>  }
>  
> +/*
> + * Kernel freezing and thawing is only done in the power management
> + * subsystem and is thus single threaded (so we don't have to worry
> + * here about multiple calls to filesystems_freeze/thaw().
> + */
> +
> +static int freeze_flags;
> +
> +static void filesystems_freeze_callback(struct super_block *sb)
> +{
> +	/* errors don't fail suspend so ignore them */
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> +				       | FREEZE_HOLDER_KERNEL
> +				       | freeze_flags);
> +	else if (sb->s_bdev)
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			     | freeze_flags);
> +	else {
> +		pr_info("Ignoring filesystem %s\n", sb->s_type->name);
> +		return;
> +	}
> +
> +	pr_info("frozen %s, now syncing block ...", sb->s_type->name);
> +	sync_blockdev(sb->s_bdev);

Unnecessary, either the filesystem is responsible for this if it
provides its own ->freeze_super() or freeze_super() does it in
sync_filesystem.

> +	pr_info("done.");
> +}
> +
> +/**
> + * filesystems_freeze - freeze callback for power management
> + *
> + * Freeze all active filesystems (in reverse superblock order)
> + */
> +void filesystems_freeze(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers_rev(filesystems_freeze_callback);
> +}
> +
> +static void filesystems_thaw_callback(struct super_block *sb)
> +{
> +	if (sb->s_op->thaw_super)
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> +				     | FREEZE_HOLDER_KERNEL
> +				     | freeze_flags);
> +	else if (sb->s_bdev)
> +		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> +			   | freeze_flags);
> +}
> +
> +/**
> + * filesystems_thaw - thaw callback for power management
> + *
> + * Thaw all active filesystems (in forward superblock order)
> + */
> +void filesystems_thaw(bool for_hibernate)
> +{
> +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> +	__iterate_supers(filesystems_thaw_callback);
> +}
> +
>  /**
>   * fs_bdev_freeze - freeze owning filesystem of block device
>   * @bdev: block device
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cbbb704eff74..de154e9379ec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2272,6 +2272,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_FOR_HIBERNATE: set if freeze is from power management hibernate
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> @@ -2285,6 +2286,7 @@ enum freeze_holder {
>  	FREEZE_HOLDER_KERNEL	= (1U << 0),
>  	FREEZE_HOLDER_USERSPACE	= (1U << 1),
>  	FREEZE_MAY_NEST		= (1U << 2),
> +	FREEZE_FOR_HIBERNATE	= (1U << 3),
>  };
>  
>  struct super_operations {
> @@ -3919,4 +3921,7 @@ static inline bool vfs_empty_path(int dfd, const char __user *path)
>  
>  int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter);
>  
> +void filesystems_freeze(bool for_hibernate);
> +void filesystems_thaw(bool for_hibernate);
> +
>  #endif /* _LINUX_FS_H */
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 10a01af63a80..fc2106e6685a 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -778,7 +778,12 @@ int hibernate(void)
>  
>  	ksys_sync_helper();
>  
> +	pr_info("about to freeze filesystems\n");
> +	filesystems_freeze(true);
> +	pr_info("filesystem freeze done\n");
> +
>  	error = freeze_processes();
> +	pr_info("process freeze done\n");
>  	if (error)
>  		goto Exit;
>  
> @@ -788,7 +793,9 @@ int hibernate(void)
>  	if (error)
>  		goto Thaw;
>  
> +	pr_info("About to create snapshot\n");
>  	error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
> +	pr_info("snapshot done\n");
>  	if (error || freezer_test_done)
>  		goto Free_bitmaps;
>  
> @@ -842,6 +849,8 @@ int hibernate(void)
>  	}
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  	/* Don't bother checking whether freezer_test_done is true */
>  	freezer_test_done = false;
>   Exit:
> @@ -939,6 +948,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  
>  	thaw_processes();
>  
> +	filesystems_thaw(true);
> +
>  exit:
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>  
> @@ -1041,6 +1052,7 @@ static int software_resume(void)
>  
>  	error = load_image_and_restore();
>  	thaw_processes();
> +	filesystems_thaw(true);
>   Finish:
>  	pm_notifier_call_chain(PM_POST_RESTORE);
>   Restore:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 09f8397bae15..34cc5b0c408c 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -544,6 +544,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  static void suspend_finish(void)
>  {
>  	suspend_thaw_processes();
> +	filesystems_thaw(false);
>  	pm_notifier_call_chain(PM_POST_SUSPEND);
>  	pm_restore_console();
>  }
> @@ -581,6 +582,7 @@ static int enter_state(suspend_state_t state)
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
>  		ksys_sync_helper();
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
> +		filesystems_freeze(false);
>  	}
>  
>  	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
> @@ -603,6 +605,8 @@ static int enter_state(suspend_state_t state)
>  	pm_pr_dbg("Finishing wakeup.\n");
>  	suspend_finish();
>   Unlock:
> +	if (sync_on_suspend_enabled)
> +		filesystems_thaw(false);
>  	mutex_unlock(&system_transition_mutex);
>  	return error;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount
  2025-03-28 11:56   ` Christian Brauner
@ 2025-03-28 12:38     ` James Bottomley
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
  1 sibling, 0 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-28 12:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Fri, 2025-03-28 at 12:56 +0100, Christian Brauner wrote:
> On Thu, Mar 27, 2025 at 10:06:12AM -0400, James Bottomley wrote:
> > Originally proposed by Amir as an extract from the android kernel:
> > 
> > https://lore.kernel.org/linux-fsdevel/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> > 
> > Since suspend/resume requires a reverse iterator, I'm dusting it
> > off.
> > 
> > Signed-off-by: James Bottomley
> > <James.Bottomley@HansenPartnership.com>
> > ---
> >  fs/super.c | 48 +++++++++++++++++++++++++++++-------------------
> >  1 file changed, 29 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 5a7db4a556e3..76785509d906 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -887,28 +887,38 @@ void drop_super_exclusive(struct super_block
> > *sb)
> >  }
> >  EXPORT_SYMBOL(drop_super_exclusive);
> >  
> > +#define ITERATE_SUPERS(f, rev)					\
> 
> I'm not fond of the macro magic here.
> I've taken some of your patches and massaging them.

I'm not either, so if you have an alternative, I'm all ears.  The
problem I had is that list_for_each_entry() and _reverse are designed
to take a code block, so it's very difficult to swap one for the other
without macroizing.  The internal logic is complex enough that I didn't
want to duplicate it.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-28 10:08   ` Christian Brauner
@ 2025-03-28 14:14     ` James Bottomley
  2025-03-28 15:52       ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-28 14:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Fri, 2025-03-28 at 11:08 +0100, Christian Brauner wrote:
> On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
[...]
> > +
> > +static void filesystems_freeze_callback(struct super_block *sb)
> > +{
> > +	/* errors don't fail suspend so ignore them */
> > +	if (sb->s_op->freeze_super)
> > +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> > +				       | FREEZE_HOLDER_KERNEL
> > +				       | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		freeze_super(sb, FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			     | freeze_flags);
> > +	else {
> > +		pr_info("Ignoring filesystem %s\n", sb->s_type-
> > >name);
> > +		return;
> > +	}
> > +
> > +	pr_info("frozen %s, now syncing block ...", sb->s_type-
> > >name);
> > +	sync_blockdev(sb->s_bdev);
> > +	pr_info("done.");
> > +}
> > +
> > +/**
> > + * filesystems_freeze - freeze callback for power management
> > + *
> > + * Freeze all active filesystems (in reverse superblock order)
> > + */
> > +void filesystems_freeze(bool for_hibernate)
> > +{
> > +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> > +	__iterate_supers_rev(filesystems_freeze_callback);
> > +}
> > +
> > +static void filesystems_thaw_callback(struct super_block *sb)
> > +{
> > +	if (sb->s_op->thaw_super)
> > +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> > +				     | FREEZE_HOLDER_KERNEL
> > +				     | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		thaw_super(sb,	FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			   | freeze_flags);
> > +}
> > +
> > +/**
> > + * filesystems_thaw - thaw callback for power management
> > + *
> > + * Thaw all active filesystems (in forward superblock order)
> > + */
> > +void filesystems_thaw(bool for_hibernate)
> > +{
> > +	freeze_flags = for_hibernate ? FREEZE_FOR_HIBERNATE : 0;
> > +	__iterate_supers(filesystems_thaw_callback);
> 
> This doesn't work and I've explained in my reply to Luis how this
> doesn't work and what the alternative are:
> 
> A concurrent umount() can wipe the filesystem behind your back. So
> you either need an active superblock reference or you need to
> communicate that the superblock is locked through the new flag I
> proposed (naming irrelevant for now).

Since this is a hybrid thread between power management and VFS, could I
just summarize what I think the various superblock locks are before
discussing the actual problem (important because the previous threads
always gave the impression of petering out for fear of vfs locking).

s_count: outermost of the superblock locks refcounting the superblock
structure itself, making no guarantee that any of the underlying
filesystem superblock structures are attached (i.e. kill_sb() may have
been called).  Taken by incrementing under the global sb_lock and
decremented using a put_super() variant.

s_active: an atomic reference counting the underlying filesystem
specific superblock structures.  if you hold s_active, kill_sb cannot
be called.  Acquired by atomic_inc_not_zero() with a possible failure
if it is zero and released by deactivate_super() and its variants.

s_umount: rwsem and innermost of the superblock locks. Used to protect
various operations from races.  Taken exclusively with down_write and
shared with down_read. Private functions internal to super.c wrap this
with grab_super and super_lock_shared/excl() wrappers.

The explicit freeze/thaw_super() functions require the s_umount rwsem
in down_write or exclusive mode and take it as the first step in their
operation.  Looking at the locking in fs_bdev_freeze/thaw() implies
that the super_operations freeze_super/thaw_super *don't* need this
taken (presumably they handle it internally).

Regards,

James

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-27 18:20   ` Jan Kara
@ 2025-03-28 14:21     ` James Bottomley
  2025-03-28 14:36       ` James Bottomley
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-28 14:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Thu, 2025-03-27 at 19:20 +0100, Jan Kara wrote:
> On Thu 27-03-25 10:06:13, James Bottomley wrote:
> > Introduce a freeze function, which iterates superblocks in reverse
> > order freezing filesystems.  The indicator a filesystem is
> > freezable is either possessing a s_bdev or a freeze_super method. 
> > So this can be used in efivarfs, whether the freeze is for
> > hibernate is also passed in via the new FREEZE_FOR_HIBERNATE flag.
> > 
> > Thawing is done opposite to freezing (so superblock traversal in
> > regular order) and the whole thing is plumbed into power
> > management.
> > The original ksys_sync() is preserved so the whole freezing step is
> > optional (if it fails we're no worse off than we are today) so it
> > doesn't inhibit suspend/hibernate if there's a failure.
> > 
> > Signed-off-by: James Bottomley
> > <James.Bottomley@HansenPartnership.com>
> 
> OK, I've seen you are setting the new FREEZE_FOR_HIBERNATE flag but I
> didn't find anything using that flag. What do you plan to use it for?
> Does you efivars usecase need it? I find passing down this detail
> about the caller down to all filesystems a bit awkward. Isn't it
> possible to extract the information "hibernate is ongoing" from PM
> subsystem?

That's right.  I'm happy to post my patch below, but it depends on Al
accepting the simple_next_child() proposal, so it doesn't apply to any
tree.

> > +/*
> > + * Kernel freezing and thawing is only done in the power
> > management
> > + * subsystem and is thus single threaded (so we don't have to
> > worry
> > + * here about multiple calls to filesystems_freeze/thaw().
> > + */
> > +
> > +static int freeze_flags;
> 
> Frankly, the global variable to propagate flags is pretty ugly... If
> we really have to propagate some context into the iterator callback,
> rather do it explicitly like iterate_supers() does it.

Christian said the same thing.  I can do it, but if you look in the
power management subsystem, it relies on single threading and has a lot
of global variables like this, so I thought of this as a
simplification.

> > +static void filesystems_freeze_callback(struct super_block *sb)
> > +{
> > +	/* errors don't fail suspend so ignore them */
> > +	if (sb->s_op->freeze_super)
> > +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> > +				       | FREEZE_HOLDER_KERNEL
> > +				       | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		freeze_super(sb, FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			     | freeze_flags);
> 
> Style nit - braces around above blocks would be IMHO appropriate.
> 
> > +	else {
> > +		pr_info("Ignoring filesystem %s\n", sb->s_type-
> > >name);
> > +		return;
> > +	}
> > +
> > +	pr_info("frozen %s, now syncing block ...", sb->s_type-
> > >name);
> > +	sync_blockdev(sb->s_bdev);
> > +	pr_info("done.");
> > +}
> 
> Generally this callback is not safe because it can race with
> filesystem unmount and calling ->freeze_super() after the
> filesystem's ->put_super() was called may have all sorts of
> interesting effects (freeze_super() itself will just bail with a
> warning, which is better but not great either).
> 
> The cleanest way I see how to make the iteration safe is to grab
> active sb reference (like grab_super() does it) for the duration of
> freeze_super() calls. Another possibility would be to grab sb-
> >s_umount rwsem exclusively as Luis does it in his series but that
> requires a bit of locking surgery and ->freeze_super() handlers make
> this particularly nasty these days so I think active sb reference is
> going to be nicer these days.

Before getting into the how of this, could you just confirm my
understanding of what the various locks do:

https://lore.kernel.org/cd5c3d8aab9c5fb37fa018cb3302ecf7d2bdb140.camel@HansenPartnership.com

Is correct?

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-28 14:21     ` James Bottomley
@ 2025-03-28 14:36       ` James Bottomley
  0 siblings, 0 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-28 14:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Fri, 2025-03-28 at 10:21 -0400, James Bottomley wrote:
> [...]
> That's right.  I'm happy to post my patch below, but it depends on Al
> accepting the simple_next_child() proposal, so it doesn't apply to
> any
> tree.

And here's the patch I forgot to attach.

Regards,

James

---

diff --git a/fs/efivarfs/internal.h b/fs/efivarfs/internal.h
index ac6a1dd0a6a5..f913b6824289 100644
--- a/fs/efivarfs/internal.h
+++ b/fs/efivarfs/internal.h
@@ -17,7 +17,6 @@ struct efivarfs_fs_info {
 	struct efivarfs_mount_opts mount_opts;
 	struct super_block *sb;
 	struct notifier_block nb;
-	struct notifier_block pm_nb;
 };
 
 struct efi_variable {
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 7d47f8d7ad1d..3398ec5c60b3 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -119,12 +119,15 @@ static int efivarfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return 0;
 }
+
+static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who);
 static const struct super_operations efivarfs_ops = {
 	.statfs = efivarfs_statfs,
 	.drop_inode = generic_delete_inode,
 	.alloc_inode = efivarfs_alloc_inode,
 	.free_inode = efivarfs_free_inode,
 	.show_options = efivarfs_show_options,
+	.thaw_super = efivarfs_thaw,
 };
 
 /*
@@ -367,8 +370,6 @@ static int efivarfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (err)
 		return err;
 
-	register_pm_notifier(&sfi->pm_nb);
-
 	return efivar_init(efivarfs_callback, sb, true);
 }
 
@@ -432,24 +433,17 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 	return err;
 }
 
-static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
-			      void *ptr)
+static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who)
 {
-	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
-						    pm_nb);
 	static bool rescan_done = true;
-	struct dentry *parent = sfi->sb->s_root;
+	struct dentry *parent = sb->s_root;
 	struct dentry *child = NULL;
 
-	if (action == PM_HIBERNATION_PREPARE) {
-		rescan_done = false;
-		return NOTIFY_OK;
-	} else if (action != PM_POST_HIBERNATION) {
-		return NOTIFY_DONE;
-	}
+	if ((who & FREEZE_FOR_HIBERNATE) == 0)
+		return 0;
 
 	if (rescan_done)
-		return NOTIFY_DONE;
+		return 0;
 
 	pr_info("efivarfs: resyncing variable state\n");
 
@@ -488,9 +482,9 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 	 * then loop over variables, creating them if there's no matching
 	 * dentry
 	 */
-	efivar_init(efivarfs_check_missing, sfi->sb, false);
+	efivar_init(efivarfs_check_missing, sb, false);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
 static int efivarfs_init_fs_context(struct fs_context *fc)
@@ -510,9 +504,6 @@ static int efivarfs_init_fs_context(struct fs_context *fc)
 	fc->s_fs_info = sfi;
 	fc->ops = &efivarfs_context_ops;
 
-	sfi->pm_nb.notifier_call = efivarfs_pm_notify;
-	sfi->pm_nb.priority = 0;
-
 	return 0;
 }
 
@@ -521,7 +512,6 @@ static void efivarfs_kill_sb(struct super_block *sb)
 	struct efivarfs_fs_info *sfi = sb->s_fs_info;
 
 	blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb);
-	unregister_pm_notifier(&sfi->pm_nb);
 	kill_litter_super(sb);
 
 	kfree(sfi);


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-28 12:01   ` Christian Brauner
@ 2025-03-28 14:40     ` James Bottomley
  0 siblings, 0 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-28 14:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Fri, 2025-03-28 at 13:01 +0100, Christian Brauner wrote:
> On Thu, Mar 27, 2025 at 10:06:13AM -0400, James Bottomley wrote:
[...]
> > diff --git a/fs/super.c b/fs/super.c
> > index 76785509d906..b4b0986414b0 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1461,6 +1461,67 @@ static struct super_block
> > *get_bdev_super(struct block_device *bdev)
> >  	return sb;
> >  }
> >  
> > +/*
> > + * Kernel freezing and thawing is only done in the power
> > management
> > + * subsystem and is thus single threaded (so we don't have to
> > worry
> > + * here about multiple calls to filesystems_freeze/thaw().
> > + */
> > +
> > +static int freeze_flags;
> > +
> > +static void filesystems_freeze_callback(struct super_block *sb)
> > +{
> > +	/* errors don't fail suspend so ignore them */
> > +	if (sb->s_op->freeze_super)
> > +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST
> > +				       | FREEZE_HOLDER_KERNEL
> > +				       | freeze_flags);
> > +	else if (sb->s_bdev)
> > +		freeze_super(sb, FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL
> > +			     | freeze_flags);
> > +	else {
> > +		pr_info("Ignoring filesystem %s\n", sb->s_type-
> > >name);
> > +		return;
> > +	}
> > +
> > +	pr_info("frozen %s, now syncing block ...", sb->s_type-
> > >name);
> > +	sync_blockdev(sb->s_bdev);
> 
> Unnecessary, either the filesystem is responsible for this if it
> provides its own ->freeze_super() or freeze_super() does it in
> sync_filesystem.

I simply copied it from super.c:fs_bdev_freeze(), so is the
sync_blockdev() in there unnecessary as well?

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-28 14:14     ` James Bottomley
@ 2025-03-28 15:52       ` Christian Brauner
  2025-03-28 16:15         ` James Bottomley
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 15:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

> Since this is a hybrid thread between power management and VFS, could I
> just summarize what I think the various superblock locks are before
> discussing the actual problem (important because the previous threads
> always gave the impression of petering out for fear of vfs locking).
> 
> s_count: outermost of the superblock locks refcounting the superblock
> structure itself, making no guarantee that any of the underlying
> filesystem superblock structures are attached (i.e. kill_sb() may have
> been called).  Taken by incrementing under the global sb_lock and
> decremented using a put_super() variant.

and protects the presence of the superblock on the global super lists.

> 
> s_active: an atomic reference counting the underlying filesystem
> specific superblock structures.  if you hold s_active, kill_sb cannot
> be called.  Acquired by atomic_inc_not_zero() with a possible failure
> if it is zero and released by deactivate_super() and its variants.

or deactivate_locked_super() depending on whether s_umount is held or
not.

> 
> s_umount: rwsem and innermost of the superblock locks. Used to protect

No, it's not innermost. super_lock is a spinlock and obviously doesn't
nest with the semaphore. It's almost always the outmost lock for what
we're discussing here. Even is the outermost lock with most block device
locks.

It's also intimately tied into mount code and has implications for the
dcache and icache. That's all orthogonal to this thread.

> various operations from races.  Taken exclusively with down_write and
> shared with down_read. Private functions internal to super.c wrap this
> with grab_super and super_lock_shared/excl() wrappers.

See also the Documentation/filesystems/lock I added.

> 
> The explicit freeze/thaw_super() functions require the s_umount rwsem
> in down_write or exclusive mode and take it as the first step in their
> operation.  Looking at the locking in fs_bdev_freeze/thaw() implies
> that the super_operations freeze_super/thaw_super *don't* need this
> taken (presumably they handle it internally).

Block device locking cannot acquire the s_umount as that would cause
lock inversion with the block device open_mutex. The locking scheme
using sb_lock and the holder mutex allow safely acquiring the
superblock. It's orthogonal to what you're doing though.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 0/6] Extend freeze support to suspend and hibernate
  2025-03-28 11:56   ` Christian Brauner
  2025-03-28 12:38     ` James Bottomley
@ 2025-03-28 16:15     ` Christian Brauner
  2025-03-28 16:15       ` [PATCH 1/6] super: remove pointless s_root checks Christian Brauner
                         ` (6 more replies)
  1 sibling, 7 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Add the necessary infrastructure changes to support freezing for suspend
and hibernate.

Just got back from LSFMM so jetlag-increased possibility of bugs. This
should all that's needed to wire up power (minus the prep patches).

---
Christian Brauner (6):
      super: remove pointless s_root checks
      super: simplify user_get_super()
      super: skip dying superblocks early
      super: use a common iterator (Part 1)
      super: use common iterator (Part 2)
      super: add filesystem freezing helpers for suspend and hibernate

 fs/super.c         | 199 ++++++++++++++++++++++++++++++++---------------------
 include/linux/fs.h |   4 +-
 2 files changed, 125 insertions(+), 78 deletions(-)
---
base-commit: acb4f33713b9f6cadb6143f211714c343465411c
change-id: 20250328-work-freeze-0a446869cd62


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 1/6] super: remove pointless s_root checks
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
@ 2025-03-28 16:15       ` Christian Brauner
  2025-03-28 16:15       ` [PATCH 2/6] super: simplify user_get_super() Christian Brauner
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

The locking guarantees that the superblock is alive and sb->s_root is
still set. Remove the pointless check.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 97a17f9d9023..dc14f4bf73a6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -930,8 +930,7 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 
 		locked = super_lock_shared(sb);
 		if (locked) {
-			if (sb->s_root)
-				f(sb, arg);
+			f(sb, arg);
 			super_unlock_shared(sb);
 		}
 
@@ -967,11 +966,8 @@ void iterate_supers_type(struct file_system_type *type,
 		spin_unlock(&sb_lock);
 
 		locked = super_lock_shared(sb);
-		if (locked) {
-			if (sb->s_root)
-				f(sb, arg);
-			super_unlock_shared(sb);
-		}
+		if (locked)
+			f(sb, arg);
 
 		spin_lock(&sb_lock);
 		if (p)
@@ -991,18 +987,15 @@ struct super_block *user_get_super(dev_t dev, bool excl)
 
 	spin_lock(&sb_lock);
 	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_dev ==  dev) {
+		if (sb->s_dev == dev) {
 			bool locked;
 
 			sb->s_count++;
 			spin_unlock(&sb_lock);
 			/* still alive? */
 			locked = super_lock(sb, excl);
-			if (locked) {
-				if (sb->s_root)
-					return sb;
-				super_unlock(sb, excl);
-			}
+			if (locked)
+				return sb; /* caller will drop */
 			/* nope, got unmounted */
 			spin_lock(&sb_lock);
 			__put_super(sb);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 2/6] super: simplify user_get_super()
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
  2025-03-28 16:15       ` [PATCH 1/6] super: remove pointless s_root checks Christian Brauner
@ 2025-03-28 16:15       ` Christian Brauner
  2025-03-28 16:15       ` [PATCH 3/6] super: skip dying superblocks early Christian Brauner
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Make it easier to read and remove one level of identation.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index dc14f4bf73a6..b1acfc38ba0c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -987,20 +987,21 @@ struct super_block *user_get_super(dev_t dev, bool excl)
 
 	spin_lock(&sb_lock);
 	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_dev == dev) {
-			bool locked;
-
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/* still alive? */
-			locked = super_lock(sb, excl);
-			if (locked)
-				return sb; /* caller will drop */
-			/* nope, got unmounted */
-			spin_lock(&sb_lock);
-			__put_super(sb);
-			break;
-		}
+		bool locked;
+
+		if (sb->s_dev != dev)
+			continue;
+
+		sb->s_count++;
+		spin_unlock(&sb_lock);
+
+		locked = super_lock(sb, excl);
+		if (locked)
+			return sb;
+
+		spin_lock(&sb_lock);
+		__put_super(sb);
+		break;
 	}
 	spin_unlock(&sb_lock);
 	return NULL;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-28 15:52       ` Christian Brauner
@ 2025-03-28 16:15         ` James Bottomley
  2025-03-29  8:23           ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-28 16:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Fri, 2025-03-28 at 16:52 +0100, Christian Brauner wrote:
[...]
> 
> > various operations from races.  Taken exclusively with down_write
> > and shared with down_read. Private functions internal to super.c
> > wrap this with grab_super and super_lock_shared/excl() wrappers.
> 
> See also the Documentation/filesystems/lock I added.

you mean locking.rst which covers s_umount?  It would be nice to add
the others as well.

> > The explicit freeze/thaw_super() functions require the s_umount
> > rwsem in down_write or exclusive mode and take it as the first step
> > in their operation.  Looking at the locking in
> > fs_bdev_freeze/thaw() implies that the super_operations
> > freeze_super/thaw_super *don't* need this taken (presumably they
> > handle it internally).
> 
> Block device locking cannot acquire the s_umount as that would cause
> lock inversion with the block device open_mutex. The locking scheme
> using sb_lock and the holder mutex allow safely acquiring the
> superblock. It's orthogonal to what you're doing though.

OK, but based on the above and the fact that the code has to call
either the super op freeze/thaw_super or the global call, I think this
can be handled in the callback as something like rather than trying to
thread an exclusive s_umount:

static void filesystems_thaw_callback(struct super_block *sb)
{
	if (unlikely(!atomic_inc_not_zero(&sb->s_active)))
		return;

	if (sb->s_op->thaw_super)
		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
				     | FREEZE_HOLDER_KERNEL
				     | freeze_flags);
	else if (sb->s_bdev)
		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
			   | freeze_flags);

	deactivate_super(sb);
}


Regards,

James



^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 3/6] super: skip dying superblocks early
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
  2025-03-28 16:15       ` [PATCH 1/6] super: remove pointless s_root checks Christian Brauner
  2025-03-28 16:15       ` [PATCH 2/6] super: simplify user_get_super() Christian Brauner
@ 2025-03-28 16:15       ` Christian Brauner
  2025-03-28 16:15       ` [PATCH 4/6] super: use a common iterator (Part 1) Christian Brauner
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Make all iterators uniform by performing an early check whether the
superblock is dying.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index b1acfc38ba0c..c67ea3cdda41 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -925,6 +925,9 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 	list_for_each_entry(sb, &super_blocks, s_list) {
 		bool locked;
 
+		if (super_flags(sb, SB_DYING))
+			continue;
+
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
@@ -962,6 +965,9 @@ void iterate_supers_type(struct file_system_type *type,
 	hlist_for_each_entry(sb, &type->fs_supers, s_instances) {
 		bool locked;
 
+		if (super_flags(sb, SB_DYING))
+			continue;
+
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 4/6] super: use a common iterator (Part 1)
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
                         ` (2 preceding siblings ...)
  2025-03-28 16:15       ` [PATCH 3/6] super: skip dying superblocks early Christian Brauner
@ 2025-03-28 16:15       ` Christian Brauner
  2025-03-28 16:15       ` [PATCH 5/6] super: use common iterator (Part 2) Christian Brauner
                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Use a common iterator for all callbacks.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 67 +++++++++++-------------------------------------------
 include/linux/fs.h |  6 ++++-
 2 files changed, 18 insertions(+), 55 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index c67ea3cdda41..0dd208804a74 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -887,37 +887,7 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
-static void __iterate_supers(void (*f)(struct super_block *))
-{
-	struct super_block *sb, *p = NULL;
-
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (super_flags(sb, SB_DYING))
-			continue;
-		sb->s_count++;
-		spin_unlock(&sb_lock);
-
-		f(sb);
-
-		spin_lock(&sb_lock);
-		if (p)
-			__put_super(p);
-		p = sb;
-	}
-	if (p)
-		__put_super(p);
-	spin_unlock(&sb_lock);
-}
-/**
- *	iterate_supers - call function for all active superblocks
- *	@f: function to call
- *	@arg: argument to pass to it
- *
- *	Scans the superblock list and calls given function, passing it
- *	locked superblock and given argument.
- */
-void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl)
 {
 	struct super_block *sb, *p = NULL;
 
@@ -927,14 +897,13 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 
 		if (super_flags(sb, SB_DYING))
 			continue;
-
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
-		locked = super_lock_shared(sb);
+		locked = super_lock(sb, excl);
 		if (locked) {
 			f(sb, arg);
-			super_unlock_shared(sb);
+			super_unlock(sb, excl);
 		}
 
 		spin_lock(&sb_lock);
@@ -1111,11 +1080,9 @@ int reconfigure_super(struct fs_context *fc)
 	return retval;
 }
 
-static void do_emergency_remount_callback(struct super_block *sb)
+static void do_emergency_remount_callback(struct super_block *sb, void *unused)
 {
-	bool locked = super_lock_excl(sb);
-
-	if (locked && sb->s_root && sb->s_bdev && !sb_rdonly(sb)) {
+	if (sb->s_bdev && !sb_rdonly(sb)) {
 		struct fs_context *fc;
 
 		fc = fs_context_for_reconfigure(sb->s_root,
@@ -1126,13 +1093,11 @@ static void do_emergency_remount_callback(struct super_block *sb)
 			put_fs_context(fc);
 		}
 	}
-	if (locked)
-		super_unlock_excl(sb);
 }
 
 static void do_emergency_remount(struct work_struct *work)
 {
-	__iterate_supers(do_emergency_remount_callback);
+	__iterate_supers(do_emergency_remount_callback, NULL, true);
 	kfree(work);
 	printk("Emergency Remount complete\n");
 }
@@ -1148,24 +1113,18 @@ void emergency_remount(void)
 	}
 }
 
-static void do_thaw_all_callback(struct super_block *sb)
+static void do_thaw_all_callback(struct super_block *sb, void *unused)
 {
-	bool locked = super_lock_excl(sb);
-
-	if (locked && sb->s_root) {
-		if (IS_ENABLED(CONFIG_BLOCK))
-			while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
-				pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
-		thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
-		return;
-	}
-	if (locked)
-		super_unlock_excl(sb);
+	if (IS_ENABLED(CONFIG_BLOCK))
+		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
+			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
+	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
+	return;
 }
 
 static void do_thaw_all(struct work_struct *work)
 {
-	__iterate_supers(do_thaw_all_callback);
+	__iterate_supers(do_thaw_all_callback, NULL, true);
 	kfree(work);
 	printk(KERN_WARNING "Emergency Thaw complete\n");
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..0351500b71d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3515,7 +3515,11 @@ extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl);
+static inline void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+{
+	__iterate_supers(f, arg, false);
+}
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 5/6] super: use common iterator (Part 2)
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
                         ` (3 preceding siblings ...)
  2025-03-28 16:15       ` [PATCH 4/6] super: use a common iterator (Part 1) Christian Brauner
@ 2025-03-28 16:15       ` Christian Brauner
  2025-03-28 18:58         ` James Bottomley
  2025-03-28 16:15       ` [PATCH 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
  6 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Use a common iterator for all callbacks. We could go for something even
more elaborate (advance step-by-step similar to iov_iter) but I really
don't think this is warranted.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 76 +++++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |  6 +----
 2 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0dd208804a74..58c95210e66c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -887,24 +887,71 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
-void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl)
+enum super_iter_flags_t {
+	SUPER_ITER_EXCL		= (1U << 0),
+	SUPER_ITER_GRAB		= (1U << 1) | SUPER_ITER_EXCL,
+	SUPER_ITER_REVERSE	= (1U << 2),
+};
+
+static inline struct super_block *first_super(enum super_iter_flags_t flags)
+{
+	if (flags & SUPER_ITER_REVERSE)
+		return list_last_entry(&super_blocks, struct super_block, s_list);
+	return list_first_entry(&super_blocks, struct super_block, s_list);
+}
+
+static inline struct super_block *next_super(struct super_block *sb,
+					     enum super_iter_flags_t flags)
+{
+	if (flags & SUPER_ITER_REVERSE)
+		return list_prev_entry(sb, s_list);
+	return list_next_entry(sb, s_list);
+}
+
+static inline void super_cb_locked(struct super_block *sb,
+				   void (*f)(struct super_block *, void *),
+				   void *arg, bool excl)
+{
+        if (super_lock(sb, excl)) {
+                f(sb, arg);
+                super_unlock(sb, excl);
+        }
+}
+
+static inline void super_cb_grabbed(struct super_block *sb,
+				    void (*f)(struct super_block *, void *),
+				    void *arg)
+{
+	if (super_lock_excl(sb)) {
+		bool active = atomic_inc_not_zero(&sb->s_active);
+		super_unlock_excl(sb);
+		if (active)
+			f(sb, arg);
+		deactivate_super(sb);
+	}
+}
+
+#define invalid_super list_entry_is_head
+
+static void __iterate_supers(void (*f)(struct super_block *, void *), void *arg,
+			     enum super_iter_flags_t flags)
 {
 	struct super_block *sb, *p = NULL;
+	bool excl = flags & SUPER_ITER_EXCL;
 
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		bool locked;
+	guard(spinlock)(&sb_lock);
 
+	for (sb = first_super(flags); !invalid_super(sb, &super_blocks, s_list);
+	     sb = next_super(sb, flags)) {
 		if (super_flags(sb, SB_DYING))
 			continue;
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
-		locked = super_lock(sb, excl);
-		if (locked) {
-			f(sb, arg);
-			super_unlock(sb, excl);
-		}
+                if (flags & SUPER_ITER_GRAB)
+                        super_cb_grabbed(sb, f, arg);
+                else
+                        super_cb_locked(sb, f, arg, excl);
 
 		spin_lock(&sb_lock);
 		if (p)
@@ -913,7 +960,11 @@ void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool e
 	}
 	if (p)
 		__put_super(p);
-	spin_unlock(&sb_lock);
+}
+
+void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+{
+	__iterate_supers(f, arg, 0);
 }
 
 /**
@@ -1097,7 +1148,8 @@ static void do_emergency_remount_callback(struct super_block *sb, void *unused)
 
 static void do_emergency_remount(struct work_struct *work)
 {
-	__iterate_supers(do_emergency_remount_callback, NULL, true);
+	__iterate_supers(do_emergency_remount_callback, NULL,
+			 SUPER_ITER_EXCL | SUPER_ITER_REVERSE);
 	kfree(work);
 	printk("Emergency Remount complete\n");
 }
@@ -1124,7 +1176,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
 
 static void do_thaw_all(struct work_struct *work)
 {
-	__iterate_supers(do_thaw_all_callback, NULL, true);
+	__iterate_supers(do_thaw_all_callback, NULL, SUPER_ITER_EXCL);
 	kfree(work);
 	printk(KERN_WARNING "Emergency Thaw complete\n");
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0351500b71d2..c475fa874055 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3515,11 +3515,7 @@ extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl);
-static inline void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
-{
-	__iterate_supers(f, arg, false);
-}
+extern void iterate_supers(void (*f)(struct super_block *, void *), void *arg);
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 6/6] super: add filesystem freezing helpers for suspend and hibernate
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
                         ` (4 preceding siblings ...)
  2025-03-28 16:15       ` [PATCH 5/6] super: use common iterator (Part 2) Christian Brauner
@ 2025-03-28 16:15       ` Christian Brauner
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
  6 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-28 16:15 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Allow the power subsystem to support filesystem freeze for
suspend and hibernate.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 34 ++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 36 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 58c95210e66c..a2942b21d661 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1197,6 +1197,40 @@ void emergency_thaw_all(void)
 	}
 }
 
+static void filesystems_freeze_callback(struct super_block *sb, void *flagsp)
+{
+	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
+		return;
+
+	if (sb->s_op->freeze_super)
+		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+	else
+		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+}
+
+void filesystems_freeze(bool hibernate)
+{
+	__iterate_supers(filesystems_freeze_callback, NULL,
+			 SUPER_ITER_GRAB | SUPER_ITER_REVERSE);
+}
+
+static void filesystems_thaw_callback(struct super_block *sb, void *flagsp)
+{
+	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
+		return;
+
+	if (sb->s_op->thaw_super)
+		sb->s_op->thaw_super(sb, FREEZE_HOLDER_KERNEL);
+	else
+		thaw_super(sb, FREEZE_HOLDER_KERNEL);
+}
+
+void filesystems_thaw(bool hibernate)
+{
+	__iterate_supers(filesystems_thaw_callback, NULL,
+			 SUPER_ITER_GRAB | SUPER_ITER_REVERSE);
+}
+
 static DEFINE_IDA(unnamed_dev_ida);
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c475fa874055..29bd28491eff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3518,6 +3518,8 @@ extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*f)(struct super_block *, void *), void *arg);
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
+void filesystems_freeze(bool hibernate);
+void filesystems_thaw(bool hibernate);
 
 extern int dcache_dir_open(struct inode *, struct file *);
 extern int dcache_dir_close(struct inode *, struct file *);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 5/6] super: use common iterator (Part 2)
  2025-03-28 16:15       ` [PATCH 5/6] super: use common iterator (Part 2) Christian Brauner
@ 2025-03-28 18:58         ` James Bottomley
  2025-03-29  7:34           ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-28 18:58 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, jack
  Cc: linux-kernel, mcgrof, hch, david, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Fri, 2025-03-28 at 17:15 +0100, Christian Brauner wrote:
[...]
> +static inline void super_cb_grabbed(struct super_block *sb,
> +				    void (*f)(struct super_block *,
> void *),
> +				    void *arg)
> +{
> +	if (super_lock_excl(sb)) {
> +		bool active = atomic_inc_not_zero(&sb->s_active);
> +		super_unlock_excl(sb);
> +		if (active)
> +			f(sb, arg);
> +		deactivate_super(sb);

I don't think this can be right: if we fail to increment s_active
because it's zero, we shouldn't call deactivate_super(), should we?

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 5/6] super: use common iterator (Part 2)
  2025-03-28 18:58         ` James Bottomley
@ 2025-03-29  7:34           ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  7:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, jack, linux-kernel, mcgrof, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Fri, Mar 28, 2025 at 02:58:29PM -0400, James Bottomley wrote:
> On Fri, 2025-03-28 at 17:15 +0100, Christian Brauner wrote:
> [...]
> > +static inline void super_cb_grabbed(struct super_block *sb,
> > +				    void (*f)(struct super_block *,
> > void *),
> > +				    void *arg)
> > +{
> > +	if (super_lock_excl(sb)) {
> > +		bool active = atomic_inc_not_zero(&sb->s_active);
> > +		super_unlock_excl(sb);
> > +		if (active)
> > +			f(sb, arg);
> > +		deactivate_super(sb);
> 
> I don't think this can be right: if we fail to increment s_active
> because it's zero, we shouldn't call deactivate_super(), should we?

Fixed in-tree. Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management
  2025-03-28 16:15         ` James Bottomley
@ 2025-03-29  8:23           ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, linux-kernel, mcgrof, jack, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Fri, Mar 28, 2025 at 12:15:55PM -0400, James Bottomley wrote:
> On Fri, 2025-03-28 at 16:52 +0100, Christian Brauner wrote:
> [...]
> > 
> > > various operations from races.  Taken exclusively with down_write
> > > and shared with down_read. Private functions internal to super.c
> > > wrap this with grab_super and super_lock_shared/excl() wrappers.
> > 
> > See also the Documentation/filesystems/lock I added.
> 
> you mean locking.rst which covers s_umount?  It would be nice to add
> the others as well.
> 
> > > The explicit freeze/thaw_super() functions require the s_umount
> > > rwsem in down_write or exclusive mode and take it as the first step
> > > in their operation.  Looking at the locking in
> > > fs_bdev_freeze/thaw() implies that the super_operations
> > > freeze_super/thaw_super *don't* need this taken (presumably they
> > > handle it internally).
> > 
> > Block device locking cannot acquire the s_umount as that would cause
> > lock inversion with the block device open_mutex. The locking scheme
> > using sb_lock and the holder mutex allow safely acquiring the
> > superblock. It's orthogonal to what you're doing though.
> 
> OK, but based on the above and the fact that the code has to call
> either the super op freeze/thaw_super or the global call, I think this
> can be handled in the callback as something like rather than trying to
> thread an exclusive s_umount:

Eww, no. We're not going to open-code that in two different places.

> static void filesystems_thaw_callback(struct super_block *sb)
> {
> 	if (unlikely(!atomic_inc_not_zero(&sb->s_active)))
> 		return;
> 
> 	if (sb->s_op->thaw_super)
> 		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST
> 				     | FREEZE_HOLDER_KERNEL
> 				     | freeze_flags);
> 	else if (sb->s_bdev)
> 		thaw_super(sb,	FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL
> 			   | freeze_flags);
> 
> 	deactivate_super(sb);
> }

This is broken. The freeze/thaw functions cannot be called with s_umount
held otherwise they deadlock. And not holding s_umount while taking an
active reference count isn't supported as we're optimistically dropping
reference counts. We're not introducing exceptions to that scheme for no
good reason.

The other option is to move everything into the caller and bring back
get_active_super() and then add SUPER_ITER_UNLOCKED instead of
SUPER_ITER_GRAB. That's what I've done now.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
                         ` (5 preceding siblings ...)
  2025-03-28 16:15       ` [PATCH 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
@ 2025-03-29  8:42       ` Christian Brauner
  2025-03-29  8:42         ` [PATCH v2 1/6] super: remove pointless s_root checks Christian Brauner
                           ` (7 more replies)
  6 siblings, 8 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Add the necessary infrastructure changes to support freezing for suspend
and hibernate.

Just got back from LSFMM. So still jetlagged and likelihood of bugs
increased. This should all that's needed to wire up power.

This will be in vfs-6.16.super shortly.

---
Changes in v2:
- Don't grab reference in the iterator make that a requirement for the
  callers that need custom behavior.
- Link to v1: https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org

---
Christian Brauner (6):
      super: remove pointless s_root checks
      super: simplify user_get_super()
      super: skip dying superblocks early
      super: use a common iterator (Part 1)
      super: use common iterator (Part 2)
      super: add filesystem freezing helpers for suspend and hibernate

 fs/super.c         | 201 ++++++++++++++++++++++++++++++++---------------------
 include/linux/fs.h |   4 +-
 2 files changed, 126 insertions(+), 79 deletions(-)
---
base-commit: acb4f33713b9f6cadb6143f211714c343465411c
change-id: 20250328-work-freeze-0a446869cd62


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 1/6] super: remove pointless s_root checks
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
@ 2025-03-29  8:42         ` Christian Brauner
  2025-03-31  9:57           ` Jan Kara
  2025-06-11 16:26           ` Darrick J. Wong
  2025-03-29  8:42         ` [PATCH v2 2/6] super: simplify user_get_super() Christian Brauner
                           ` (6 subsequent siblings)
  7 siblings, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

The locking guarantees that the superblock is alive and sb->s_root is
still set. Remove the pointless check.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 97a17f9d9023..dc14f4bf73a6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -930,8 +930,7 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 
 		locked = super_lock_shared(sb);
 		if (locked) {
-			if (sb->s_root)
-				f(sb, arg);
+			f(sb, arg);
 			super_unlock_shared(sb);
 		}
 
@@ -967,11 +966,8 @@ void iterate_supers_type(struct file_system_type *type,
 		spin_unlock(&sb_lock);
 
 		locked = super_lock_shared(sb);
-		if (locked) {
-			if (sb->s_root)
-				f(sb, arg);
-			super_unlock_shared(sb);
-		}
+		if (locked)
+			f(sb, arg);
 
 		spin_lock(&sb_lock);
 		if (p)
@@ -991,18 +987,15 @@ struct super_block *user_get_super(dev_t dev, bool excl)
 
 	spin_lock(&sb_lock);
 	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_dev ==  dev) {
+		if (sb->s_dev == dev) {
 			bool locked;
 
 			sb->s_count++;
 			spin_unlock(&sb_lock);
 			/* still alive? */
 			locked = super_lock(sb, excl);
-			if (locked) {
-				if (sb->s_root)
-					return sb;
-				super_unlock(sb, excl);
-			}
+			if (locked)
+				return sb; /* caller will drop */
 			/* nope, got unmounted */
 			spin_lock(&sb_lock);
 			__put_super(sb);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 2/6] super: simplify user_get_super()
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
  2025-03-29  8:42         ` [PATCH v2 1/6] super: remove pointless s_root checks Christian Brauner
@ 2025-03-29  8:42         ` Christian Brauner
  2025-03-31  9:58           ` Jan Kara
  2025-03-29  8:42         ` [PATCH v2 3/6] super: skip dying superblocks early Christian Brauner
                           ` (5 subsequent siblings)
  7 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Make it easier to read and remove one level of identation.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index dc14f4bf73a6..b1acfc38ba0c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -987,20 +987,21 @@ struct super_block *user_get_super(dev_t dev, bool excl)
 
 	spin_lock(&sb_lock);
 	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_dev == dev) {
-			bool locked;
-
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/* still alive? */
-			locked = super_lock(sb, excl);
-			if (locked)
-				return sb; /* caller will drop */
-			/* nope, got unmounted */
-			spin_lock(&sb_lock);
-			__put_super(sb);
-			break;
-		}
+		bool locked;
+
+		if (sb->s_dev != dev)
+			continue;
+
+		sb->s_count++;
+		spin_unlock(&sb_lock);
+
+		locked = super_lock(sb, excl);
+		if (locked)
+			return sb;
+
+		spin_lock(&sb_lock);
+		__put_super(sb);
+		break;
 	}
 	spin_unlock(&sb_lock);
 	return NULL;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 3/6] super: skip dying superblocks early
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
  2025-03-29  8:42         ` [PATCH v2 1/6] super: remove pointless s_root checks Christian Brauner
  2025-03-29  8:42         ` [PATCH v2 2/6] super: simplify user_get_super() Christian Brauner
@ 2025-03-29  8:42         ` Christian Brauner
  2025-03-31 10:00           ` Jan Kara
  2025-03-29  8:42         ` [PATCH v2 4/6] super: use a common iterator (Part 1) Christian Brauner
                           ` (4 subsequent siblings)
  7 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Make all iterators uniform by performing an early check whether the
superblock is dying.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index b1acfc38ba0c..c67ea3cdda41 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -925,6 +925,9 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 	list_for_each_entry(sb, &super_blocks, s_list) {
 		bool locked;
 
+		if (super_flags(sb, SB_DYING))
+			continue;
+
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
@@ -962,6 +965,9 @@ void iterate_supers_type(struct file_system_type *type,
 	hlist_for_each_entry(sb, &type->fs_supers, s_instances) {
 		bool locked;
 
+		if (super_flags(sb, SB_DYING))
+			continue;
+
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 4/6] super: use a common iterator (Part 1)
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
                           ` (2 preceding siblings ...)
  2025-03-29  8:42         ` [PATCH v2 3/6] super: skip dying superblocks early Christian Brauner
@ 2025-03-29  8:42         ` Christian Brauner
  2025-03-31 10:01           ` Jan Kara
  2025-03-29  8:42         ` [PATCH v2 5/6] super: use common iterator (Part 2) Christian Brauner
                           ` (3 subsequent siblings)
  7 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Use a common iterator for all callbacks.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 67 +++++++++++-------------------------------------------
 include/linux/fs.h |  6 ++++-
 2 files changed, 18 insertions(+), 55 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index c67ea3cdda41..0dd208804a74 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -887,37 +887,7 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
-static void __iterate_supers(void (*f)(struct super_block *))
-{
-	struct super_block *sb, *p = NULL;
-
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (super_flags(sb, SB_DYING))
-			continue;
-		sb->s_count++;
-		spin_unlock(&sb_lock);
-
-		f(sb);
-
-		spin_lock(&sb_lock);
-		if (p)
-			__put_super(p);
-		p = sb;
-	}
-	if (p)
-		__put_super(p);
-	spin_unlock(&sb_lock);
-}
-/**
- *	iterate_supers - call function for all active superblocks
- *	@f: function to call
- *	@arg: argument to pass to it
- *
- *	Scans the superblock list and calls given function, passing it
- *	locked superblock and given argument.
- */
-void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl)
 {
 	struct super_block *sb, *p = NULL;
 
@@ -927,14 +897,13 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 
 		if (super_flags(sb, SB_DYING))
 			continue;
-
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
-		locked = super_lock_shared(sb);
+		locked = super_lock(sb, excl);
 		if (locked) {
 			f(sb, arg);
-			super_unlock_shared(sb);
+			super_unlock(sb, excl);
 		}
 
 		spin_lock(&sb_lock);
@@ -1111,11 +1080,9 @@ int reconfigure_super(struct fs_context *fc)
 	return retval;
 }
 
-static void do_emergency_remount_callback(struct super_block *sb)
+static void do_emergency_remount_callback(struct super_block *sb, void *unused)
 {
-	bool locked = super_lock_excl(sb);
-
-	if (locked && sb->s_root && sb->s_bdev && !sb_rdonly(sb)) {
+	if (sb->s_bdev && !sb_rdonly(sb)) {
 		struct fs_context *fc;
 
 		fc = fs_context_for_reconfigure(sb->s_root,
@@ -1126,13 +1093,11 @@ static void do_emergency_remount_callback(struct super_block *sb)
 			put_fs_context(fc);
 		}
 	}
-	if (locked)
-		super_unlock_excl(sb);
 }
 
 static void do_emergency_remount(struct work_struct *work)
 {
-	__iterate_supers(do_emergency_remount_callback);
+	__iterate_supers(do_emergency_remount_callback, NULL, true);
 	kfree(work);
 	printk("Emergency Remount complete\n");
 }
@@ -1148,24 +1113,18 @@ void emergency_remount(void)
 	}
 }
 
-static void do_thaw_all_callback(struct super_block *sb)
+static void do_thaw_all_callback(struct super_block *sb, void *unused)
 {
-	bool locked = super_lock_excl(sb);
-
-	if (locked && sb->s_root) {
-		if (IS_ENABLED(CONFIG_BLOCK))
-			while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
-				pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
-		thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
-		return;
-	}
-	if (locked)
-		super_unlock_excl(sb);
+	if (IS_ENABLED(CONFIG_BLOCK))
+		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
+			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
+	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
+	return;
 }
 
 static void do_thaw_all(struct work_struct *work)
 {
-	__iterate_supers(do_thaw_all_callback);
+	__iterate_supers(do_thaw_all_callback, NULL, true);
 	kfree(work);
 	printk(KERN_WARNING "Emergency Thaw complete\n");
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..0351500b71d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3515,7 +3515,11 @@ extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl);
+static inline void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+{
+	__iterate_supers(f, arg, false);
+}
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 5/6] super: use common iterator (Part 2)
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
                           ` (3 preceding siblings ...)
  2025-03-29  8:42         ` [PATCH v2 4/6] super: use a common iterator (Part 1) Christian Brauner
@ 2025-03-29  8:42         ` Christian Brauner
  2025-03-31 10:07           ` Jan Kara
  2025-03-29  8:42         ` [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
                           ` (2 subsequent siblings)
  7 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Use a common iterator for all callbacks. We could go for something even
more elaborate (advance step-by-step similar to iov_iter) but I really
don't think this is warranted.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 49 ++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |  6 +-----
 2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0dd208804a74..666a2a16df87 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -887,21 +887,47 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
-void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl)
+enum super_iter_flags_t {
+	SUPER_ITER_EXCL		= (1U << 0),
+	SUPER_ITER_UNLOCKED	= (1U << 1),
+	SUPER_ITER_REVERSE	= (1U << 2),
+};
+
+static inline struct super_block *first_super(enum super_iter_flags_t flags)
+{
+	if (flags & SUPER_ITER_REVERSE)
+		return list_last_entry(&super_blocks, struct super_block, s_list);
+	return list_first_entry(&super_blocks, struct super_block, s_list);
+}
+
+static inline struct super_block *next_super(struct super_block *sb,
+					     enum super_iter_flags_t flags)
+{
+	if (flags & SUPER_ITER_REVERSE)
+		return list_prev_entry(sb, s_list);
+	return list_next_entry(sb, s_list);
+}
+
+#define invalid_super list_entry_is_head
+
+static void __iterate_supers(void (*f)(struct super_block *, void *), void *arg,
+			     enum super_iter_flags_t flags)
 {
 	struct super_block *sb, *p = NULL;
+	bool excl = flags & SUPER_ITER_EXCL;
 
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &super_blocks, s_list) {
-		bool locked;
+	guard(spinlock)(&sb_lock);
 
+	for (sb = first_super(flags); !invalid_super(sb, &super_blocks, s_list);
+	     sb = next_super(sb, flags)) {
 		if (super_flags(sb, SB_DYING))
 			continue;
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
-		locked = super_lock(sb, excl);
-		if (locked) {
+		if (flags & SUPER_ITER_UNLOCKED) {
+			f(sb, arg);
+		} else if (super_lock(sb, excl)) {
 			f(sb, arg);
 			super_unlock(sb, excl);
 		}
@@ -913,7 +939,11 @@ void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool e
 	}
 	if (p)
 		__put_super(p);
-	spin_unlock(&sb_lock);
+}
+
+void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+{
+	__iterate_supers(f, arg, 0);
 }
 
 /**
@@ -1097,7 +1127,8 @@ static void do_emergency_remount_callback(struct super_block *sb, void *unused)
 
 static void do_emergency_remount(struct work_struct *work)
 {
-	__iterate_supers(do_emergency_remount_callback, NULL, true);
+	__iterate_supers(do_emergency_remount_callback, NULL,
+			 SUPER_ITER_EXCL | SUPER_ITER_REVERSE);
 	kfree(work);
 	printk("Emergency Remount complete\n");
 }
@@ -1124,7 +1155,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
 
 static void do_thaw_all(struct work_struct *work)
 {
-	__iterate_supers(do_thaw_all_callback, NULL, true);
+	__iterate_supers(do_thaw_all_callback, NULL, SUPER_ITER_EXCL);
 	kfree(work);
 	printk(KERN_WARNING "Emergency Thaw complete\n");
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0351500b71d2..c475fa874055 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3515,11 +3515,7 @@ extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl);
-static inline void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
-{
-	__iterate_supers(f, arg, false);
-}
+extern void iterate_supers(void (*f)(struct super_block *, void *), void *arg);
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
                           ` (4 preceding siblings ...)
  2025-03-29  8:42         ` [PATCH v2 5/6] super: use common iterator (Part 2) Christian Brauner
@ 2025-03-29  8:42         ` Christian Brauner
  2025-03-29  8:46           ` Christian Brauner
  2025-03-31 10:23           ` Jan Kara
  2025-03-29 14:04         ` [PATCH v2 0/6] Extend freeze support to " James Bottomley
  2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
  7 siblings, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:42 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

Allow the power subsystem to support filesystem freeze for
suspend and hibernate.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 57 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 666a2a16df87..4364b763e91f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1176,6 +1176,61 @@ void emergency_thaw_all(void)
 	}
 }
 
+static inline bool get_active_super(struct super_block *sb)
+{
+	bool active;
+
+	if (super_lock_excl(sb)) {
+		active = atomic_inc_not_zero(&sb->s_active);
+		super_unlock_excl(sb);
+	}
+	return active;
+}
+
+static void filesystems_freeze_callback(struct super_block *sb, void *unused)
+{
+	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
+		return;
+
+	if (!get_active_super(sb))
+		return;
+
+	if (sb->s_op->freeze_super)
+		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+	else
+		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+
+	deactivate_super(sb);
+}
+
+void filesystems_freeze(bool hibernate)
+{
+	__iterate_supers(filesystems_freeze_callback, NULL,
+			 SUPER_ITER_UNLOCKED | SUPER_ITER_REVERSE);
+}
+
+static void filesystems_thaw_callback(struct super_block *sb, void *unused)
+{
+	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
+		return;
+
+	if (!get_active_super(sb))
+		return;
+
+	if (sb->s_op->thaw_super)
+		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+	else
+		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+
+	deactivate_super(sb);
+}
+
+void filesystems_thaw(bool hibernate)
+{
+	__iterate_supers(filesystems_thaw_callback, NULL,
+			 SUPER_ITER_UNLOCKED | SUPER_ITER_REVERSE);
+}
+
 static DEFINE_IDA(unnamed_dev_ida);
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c475fa874055..29bd28491eff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3518,6 +3518,8 @@ extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*f)(struct super_block *, void *), void *arg);
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
+void filesystems_freeze(bool hibernate);
+void filesystems_thaw(bool hibernate);
 
 extern int dcache_dir_open(struct inode *, struct file *);
 extern int dcache_dir_close(struct inode *, struct file *);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate
  2025-03-29  8:42         ` [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
@ 2025-03-29  8:46           ` Christian Brauner
  2025-03-31 10:23           ` Jan Kara
  1 sibling, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-29  8:46 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: linux-kernel, James Bottomley, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Sat, Mar 29, 2025 at 09:42:19AM +0100, Christian Brauner wrote:
> Allow the power subsystem to support filesystem freeze for
> suspend and hibernate.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/super.c         | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h |  2 ++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 666a2a16df87..4364b763e91f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1176,6 +1176,61 @@ void emergency_thaw_all(void)
>  	}
>  }
>  
> +static inline bool get_active_super(struct super_block *sb)
> +{
> +	bool active;

Typo on my end. This is ofc bool active = false;
And fixed.

> +
> +	if (super_lock_excl(sb)) {
> +		active = atomic_inc_not_zero(&sb->s_active);
> +		super_unlock_excl(sb);
> +	}
> +	return active;
> +}
> +
> +static void filesystems_freeze_callback(struct super_block *sb, void *unused)
> +{
> +	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
> +		return;
> +
> +	if (!get_active_super(sb))
> +		return;
> +
> +	if (sb->s_op->freeze_super)
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +	else
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +
> +	deactivate_super(sb);
> +}
> +
> +void filesystems_freeze(bool hibernate)
> +{
> +	__iterate_supers(filesystems_freeze_callback, NULL,
> +			 SUPER_ITER_UNLOCKED | SUPER_ITER_REVERSE);
> +}
> +
> +static void filesystems_thaw_callback(struct super_block *sb, void *unused)
> +{
> +	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
> +		return;
> +
> +	if (!get_active_super(sb))
> +		return;
> +
> +	if (sb->s_op->thaw_super)
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +	else
> +		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +
> +	deactivate_super(sb);
> +}
> +
> +void filesystems_thaw(bool hibernate)
> +{
> +	__iterate_supers(filesystems_thaw_callback, NULL,
> +			 SUPER_ITER_UNLOCKED | SUPER_ITER_REVERSE);
> +}
> +
>  static DEFINE_IDA(unnamed_dev_ida);
>  
>  /**
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c475fa874055..29bd28491eff 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3518,6 +3518,8 @@ extern void drop_super_exclusive(struct super_block *sb);
>  extern void iterate_supers(void (*f)(struct super_block *, void *), void *arg);
>  extern void iterate_supers_type(struct file_system_type *,
>  			        void (*)(struct super_block *, void *), void *);
> +void filesystems_freeze(bool hibernate);
> +void filesystems_thaw(bool hibernate);
>  
>  extern int dcache_dir_open(struct inode *, struct file *);
>  extern int dcache_dir_close(struct inode *, struct file *);
> 
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
                           ` (5 preceding siblings ...)
  2025-03-29  8:42         ` [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
@ 2025-03-29 14:04         ` James Bottomley
  2025-03-29 17:02           ` James Bottomley
  2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
  7 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-29 14:04 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, jack
  Cc: linux-kernel, mcgrof, hch, david, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> Add the necessary infrastructure changes to support freezing for
> suspend and hibernate.
> 
> Just got back from LSFMM. So still jetlagged and likelihood of bugs
> increased. This should all that's needed to wire up power.
> 
> This will be in vfs-6.16.super shortly.
> 
> ---
> Changes in v2:
> - Don't grab reference in the iterator make that a requirement for
> the callers that need custom behavior.
> - Link to v1:
> https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org

Given I've been a bit quiet on this, I thought I'd better explain
what's going on: I do have these built, but I made the mistake of doing
a dist-upgrade on my testing VM master image and it pulled in a version
of systemd (257.4-3) that has a broken hibernate.  Since I upgraded in
place I don't have the old image so I'm spending my time currently
debugging systemd ... normal service will hopefully resume shortly.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-29 14:04         ` [PATCH v2 0/6] Extend freeze support to " James Bottomley
@ 2025-03-29 17:02           ` James Bottomley
  2025-03-30  8:33             ` Christian Brauner
  2025-03-31 10:36             ` Jan Kara
  0 siblings, 2 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-29 17:02 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, jack
  Cc: linux-kernel, mcgrof, hch, david, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > Add the necessary infrastructure changes to support freezing for
> > suspend and hibernate.
> > 
> > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > increased. This should all that's needed to wire up power.
> > 
> > This will be in vfs-6.16.super shortly.
> > 
> > ---
> > Changes in v2:
> > - Don't grab reference in the iterator make that a requirement for
> > the callers that need custom behavior.
> > - Link to v1:
> > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> 
> Given I've been a bit quiet on this, I thought I'd better explain
> what's going on: I do have these built, but I made the mistake of
> doing a dist-upgrade on my testing VM master image and it pulled in a
> version of systemd (257.4-3) that has a broken hibernate.  Since I
> upgraded in place I don't have the old image so I'm spending my time
> currently debugging systemd ... normal service will hopefully resume
> shortly.

I found the systemd bug

https://github.com/systemd/systemd/issues/36888

And hacked around it, so I can confirm a simple hibernate/resume works
provided the sd_start_write() patches are applied (and the hooks are
plumbed in to pm).

There is an oddity: the systemd-journald process that would usually
hang hibernate in D wait goes into R but seems to be hung and can't be
killed by the watchdog even with a -9.  It's stack trace says it's
still stuck in sb_start_write:

[<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
[<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
[<0>] do_page_mkwrite+0x38/0xa0
[<0>] do_wp_page+0xd5/0xba0
[<0>] __handle_mm_fault+0xa29/0xca0
[<0>] handle_mm_fault+0x16a/0x2d0
[<0>] do_user_addr_fault+0x3ab/0x810
[<0>] exc_page_fault+0x68/0x150
[<0>] asm_exc_page_fault+0x22/0x30

So I think there's something funny going on in thaw.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-29 17:02           ` James Bottomley
@ 2025-03-30  8:33             ` Christian Brauner
  2025-03-30 11:53               ` Christian Brauner
  2025-03-30 14:00               ` James Bottomley
  2025-03-31 10:36             ` Jan Kara
  1 sibling, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-30  8:33 UTC (permalink / raw)
  To: James Bottomley, jack
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Sat, Mar 29, 2025 at 01:02:32PM -0400, James Bottomley wrote:
> On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > Add the necessary infrastructure changes to support freezing for
> > > suspend and hibernate.
> > > 
> > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > increased. This should all that's needed to wire up power.
> > > 
> > > This will be in vfs-6.16.super shortly.
> > > 
> > > ---
> > > Changes in v2:
> > > - Don't grab reference in the iterator make that a requirement for
> > > the callers that need custom behavior.
> > > - Link to v1:
> > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > 
> > Given I've been a bit quiet on this, I thought I'd better explain
> > what's going on: I do have these built, but I made the mistake of
> > doing a dist-upgrade on my testing VM master image and it pulled in a
> > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > upgraded in place I don't have the old image so I'm spending my time
> > currently debugging systemd ... normal service will hopefully resume
> > shortly.
> 
> I found the systemd bug
> 
> https://github.com/systemd/systemd/issues/36888

I don't think that's a systemd bug.

> And hacked around it, so I can confirm a simple hibernate/resume works
> provided the sd_start_write() patches are applied (and the hooks are
> plumbed in to pm).
> 
> There is an oddity: the systemd-journald process that would usually
> hang hibernate in D wait goes into R but seems to be hung and can't be
> killed by the watchdog even with a -9.  It's stack trace says it's
> still stuck in sb_start_write:
> 
> [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> [<0>] do_page_mkwrite+0x38/0xa0
> [<0>] do_wp_page+0xd5/0xba0
> [<0>] __handle_mm_fault+0xa29/0xca0
> [<0>] handle_mm_fault+0x16a/0x2d0
> [<0>] do_user_addr_fault+0x3ab/0x810
> [<0>] exc_page_fault+0x68/0x150
> [<0>] asm_exc_page_fault+0x22/0x30
> 
> So I think there's something funny going on in thaw.

My uneducated guess is that it's probably an issue with ext4 freezing
and unfreezing. xfs stops workqueues after all writes and pagefault
writers have stopped. This is done in ->sync_fs() when it's called from
freeze_super(). They are restarted when ->unfreeze_fs is called.

But for ext4 in ->sync_fs() the rsv_conversion_wq is flushed. I think
that should be safe to do but I'm not sure if there can't be other work
coming in on it before the actual freeze call. Jan will be able to
explain this a lot better. I don't have time today to figure out what
this does.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-30  8:33             ` Christian Brauner
@ 2025-03-30 11:53               ` Christian Brauner
  2025-03-30 14:00               ` James Bottomley
  1 sibling, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-30 11:53 UTC (permalink / raw)
  To: James Bottomley, jack
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

[-- Attachment #1: Type: text/plain, Size: 4207 bytes --]

On Sun, Mar 30, 2025 at 10:33:53AM +0200, Christian Brauner wrote:
> On Sat, Mar 29, 2025 at 01:02:32PM -0400, James Bottomley wrote:
> > On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > > Add the necessary infrastructure changes to support freezing for
> > > > suspend and hibernate.
> > > > 
> > > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > > increased. This should all that's needed to wire up power.
> > > > 
> > > > This will be in vfs-6.16.super shortly.
> > > > 
> > > > ---
> > > > Changes in v2:
> > > > - Don't grab reference in the iterator make that a requirement for
> > > > the callers that need custom behavior.
> > > > - Link to v1:
> > > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > > 
> > > Given I've been a bit quiet on this, I thought I'd better explain
> > > what's going on: I do have these built, but I made the mistake of
> > > doing a dist-upgrade on my testing VM master image and it pulled in a
> > > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > > upgraded in place I don't have the old image so I'm spending my time
> > > currently debugging systemd ... normal service will hopefully resume
> > > shortly.
> > 
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> 
> I don't think that's a systemd bug.
> 
> > And hacked around it, so I can confirm a simple hibernate/resume works
> > provided the sd_start_write() patches are applied (and the hooks are
> > plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't be
> > killed by the watchdog even with a -9.  It's stack trace says it's
> > still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> My uneducated guess is that it's probably an issue with ext4 freezing
> and unfreezing. xfs stops workqueues after all writes and pagefault
> writers have stopped. This is done in ->sync_fs() when it's called from
> freeze_super(). They are restarted when ->unfreeze_fs is called.
> 
> But for ext4 in ->sync_fs() the rsv_conversion_wq is flushed. I think
> that should be safe to do but I'm not sure if there can't be other work
> coming in on it before the actual freeze call. Jan will be able to
> explain this a lot better. I don't have time today to figure out what
> this does.

Though I'm just looking at the patch snippet you posted for how you
hooked up efivarfs in https://lore.kernel.org/r/a7e6dee45ac11519c33a297797990fce6bb32bff.camel@HansenPartnership.com
and that looks pretty broken and is probably the root cause. You have:

+static int efivarfs_thaw(struct super_block *sb, enum freeze_holder who);
 static const struct super_operations efivarfs_ops = {
        .statfs = efivarfs_statfs,
        .drop_inode = generic_delete_inode,
        .alloc_inode = efivarfs_alloc_inode,
        .free_inode = efivarfs_free_inode,
        .show_options = efivarfs_show_options,
+       .thaw_super = efivarfs_thaw,
 };

Which adds ->thaw_super() without ->freeze_super() which means that
->thaw_super() is never called for efivarfs.

But also it's broken in other ways. You're not waiting for writers to
finish. Which is most often fine because efivarfs shouldn't be written
to that heavily but still this won't work and you need to call the
generic VFS helpers.

I'm appending a draft for how to do this with efivarfs. Note, I don't
have the means/time to test this right now. Would you please plumb in
your recursive removal into my patch and test it? I'm pushing it to
vfs-6.16.super for now (It likely will fail due to unused helpers right
now because I gutted the recursive removal.).

[-- Attachment #2: 0001-DRAFT-efivarfs-support-freeze-thaw-for-suspend-hiber.patch --]
[-- Type: text/x-diff, Size: 7176 bytes --]

From 4cb24e33a63a8f9dd5a2ab56b1b183c1ef26c4d0 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Sun, 30 Mar 2025 13:24:18 +0200
Subject: [PATCH] [DRAFT] efivarfs: support freeze/thaw for suspend/hibernate

The efivarfs subsystem wants to partake in system hibernation and
suspend. To this end it needs to gain freeze/thaw support.

- Don't expose efivarfs freeze/thaw to userspace. It's not just
  pointless it also would complicate the implementation because we would
  need to handle userspace initiated freezed in combination with
  hibernation initiated freezes. IOW, userspace could freeze efivarfs
  and we get a notification about an imminent freeze request from the
  power subsystem but since we're already frozen by userspace we never
  actually sync variables. So this is useful on two fronts.

- Unregister the notifier before we call kill_litter_super() because
  by that time the filesystems is already dead so no need bothering with
  reacting to hibernation. We wont't resurrect it anyway.

- Let the notifier set a global variable to indicate that hibernation is
  ongoing and resync variable state when efivars is actually going to be
  unfrozen via efivarfs_thaw_super()'s call to efivarfs_unfreeze_fs().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/efivarfs/super.c | 141 ++++++++++++++++----------------------------
 1 file changed, 51 insertions(+), 90 deletions(-)

diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 0486e9b68bc6..ce0f7ebeed1d 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -119,12 +119,20 @@ static int efivarfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return 0;
 }
+
+static int efivarfs_freeze_super(struct super_block *sb, enum freeze_holder who);
+static int efivarfs_thaw_super(struct super_block *sb, enum freeze_holder who);
+static int efivarfs_unfreeze_fs(struct super_block *sb);
+
 static const struct super_operations efivarfs_ops = {
 	.statfs = efivarfs_statfs,
 	.drop_inode = generic_delete_inode,
 	.alloc_inode = efivarfs_alloc_inode,
 	.free_inode = efivarfs_free_inode,
 	.show_options = efivarfs_show_options,
+	.freeze_super = efivarfs_freeze_super,
+	.thaw_super = efivarfs_thaw_super,
+	.unfreeze_fs = efivarfs_unfreeze_fs,
 };
 
 /*
@@ -368,7 +376,6 @@ static int efivarfs_fill_super(struct super_block *sb, struct fs_context *fc)
 		return err;
 
 	register_pm_notifier(&sfi->pm_nb);
-
 	return efivar_init(efivarfs_callback, sb, true);
 }
 
@@ -474,111 +481,61 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 	return err;
 }
 
-static void efivarfs_deactivate_super_work(struct work_struct *work)
-{
-	struct super_block *s = container_of(work, struct super_block,
-					     destroy_work);
-	/*
-	 * note: here s->destroy_work is free for reuse (which
-	 * will happen in deactivate_super)
-	 */
-	deactivate_super(s);
-}
-
 static struct file_system_type efivarfs_type;
 
-static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
-			      void *ptr)
+static int efivarfs_freeze_super(struct super_block *sb, enum freeze_holder who)
 {
-	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
-						    pm_nb);
-	struct path path;
-	struct efivarfs_ctx ectx = {
-		.ctx = {
-			.actor	= efivarfs_actor,
-		},
-		.sb = sfi->sb,
-	};
-	struct file *file;
-	struct super_block *s = sfi->sb;
-	static bool rescan_done = true;
-
-	if (action == PM_HIBERNATION_PREPARE) {
-		rescan_done = false;
-		return NOTIFY_OK;
-	} else if (action != PM_POST_HIBERNATION) {
-		return NOTIFY_DONE;
-	}
+	/* We only support freezing from the kernel. */
+	if (!(who & FREEZE_HOLDER_KERNEL))
+		return -EOPNOTSUPP;
 
-	if (rescan_done)
-		return NOTIFY_DONE;
+	return freeze_super(sb, who);
+}
 
-	/* ensure single superblock is alive and pin it */
-	if (!atomic_inc_not_zero(&s->s_active))
-		return NOTIFY_DONE;
+static int efivarfs_thaw_super(struct super_block *sb, enum freeze_holder who)
+{
+	/* We only support freezing from the kernel. */
+	if (!(who & FREEZE_HOLDER_KERNEL))
+		return -EOPNOTSUPP;
 
-	pr_info("efivarfs: resyncing variable state\n");
+	return thaw_super(sb, who);
+}
 
-	path.dentry = sfi->sb->s_root;
+/*
+ * Only accessed by the power management notifier before ->unfreeze_fs()
+ * is ever called so this is serialized through the power management
+ * system.
+ */
+static bool need_unfreeze_fs = false;
 
-	/*
-	 * do not add SB_KERNMOUNT which a single superblock could
-	 * expose to userspace and which also causes MNT_INTERNAL, see
-	 * below
-	 */
-	path.mnt = vfs_kern_mount(&efivarfs_type, 0,
-				  efivarfs_type.name, NULL);
-	if (IS_ERR(path.mnt)) {
-		pr_err("efivarfs: internal mount failed\n");
-		/*
-		 * We may be the last pinner of the superblock but
-		 * calling efivarfs_kill_sb from within the notifier
-		 * here would deadlock trying to unregister it
-		 */
-		INIT_WORK(&s->destroy_work, efivarfs_deactivate_super_work);
-		schedule_work(&s->destroy_work);
-		return PTR_ERR(path.mnt);
+static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action, void *ptr)
+{
+	if (action == PM_HIBERNATION_PREPARE) {
+		need_unfreeze_fs = true;
+		return NOTIFY_OK;
+	} else if (action == PM_POST_HIBERNATION) {
+		need_unfreeze_fs = false;
+		return NOTIFY_OK;
 	}
 
-	/* path.mnt now has pin on superblock, so this must be above one */
-	atomic_dec(&s->s_active);
+	return NOTIFY_DONE;
+}
 
-	file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME,
-				current_cred());
-	/*
-	 * safe even if last put because no MNT_INTERNAL means this
-	 * will do delayed deactivate_super and not deadlock
-	 */
-	mntput(path.mnt);
-	if (IS_ERR(file))
-		return NOTIFY_DONE;
+static int efivarfs_unfreeze_fs(struct super_block *sb)
+{
+	/* This isn't a hibernation call so there's nothing for us to do. */
+	if (!need_unfreeze_fs)
+		return 0;
 
-	rescan_done = true;
+	pr_info("efivarfs: resyncing variable state\n");
 
 	/*
-	 * First loop over the directory and verify each entry exists,
-	 * removing it if it doesn't
+	 * TODO: Now do the variable resyncing thing. vfs_kern_mount()
+	 * won't work because we'd deadlock with ->thaw_super() fwiw.
 	 */
-	file->f_pos = 2;	/* skip . and .. */
-	do {
-		ectx.dentry = NULL;
-		iterate_dir(file, &ectx.ctx);
-		if (ectx.dentry) {
-			pr_info("efivarfs: removing variable %pd\n",
-				ectx.dentry);
-			simple_recursive_removal(ectx.dentry, NULL);
-			dput(ectx.dentry);
-		}
-	} while (ectx.dentry);
-	fput(file);
 
-	/*
-	 * then loop over variables, creating them if there's no matching
-	 * dentry
-	 */
-	efivar_init(efivarfs_check_missing, sfi->sb, false);
+	return 0;
 
-	return NOTIFY_OK;
 }
 
 static int efivarfs_init_fs_context(struct fs_context *fc)
@@ -609,8 +566,12 @@ static void efivarfs_kill_sb(struct super_block *sb)
 	struct efivarfs_fs_info *sfi = sb->s_fs_info;
 
 	blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb);
-	kill_litter_super(sb);
+	/*
+	 * Unregister the pm notifier right now as that superblock is
+	 * already dead.
+	 */
 	unregister_pm_notifier(&sfi->pm_nb);
+	kill_litter_super(sb);
 
 	kfree(sfi);
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-30  8:33             ` Christian Brauner
  2025-03-30 11:53               ` Christian Brauner
@ 2025-03-30 14:00               ` James Bottomley
  2025-03-31  9:13                 ` Christian Brauner
  1 sibling, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-30 14:00 UTC (permalink / raw)
  To: Christian Brauner, jack
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Sun, 2025-03-30 at 10:33 +0200, Christian Brauner wrote:
[...]
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> 
> I don't think that's a systemd bug.

Heh, well I have zero interest in refereeing a turf war between systemd
and dracut over mismatched expectations.  The point for anyone who
wants to run hibernate tests is that until they both sort this out the
bug can be fixed by removing the system identifier check from systemd-
hibernate-resume-generator.

> > And hacked around it, so I can confirm a simple hibernate/resume
> > works provided the sd_start_write() patches are applied (and the
> > hooks are plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't
> > be killed by the watchdog even with a -9.  It's stack trace says
> > it's still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> My uneducated guess is that it's probably an issue with ext4 freezing
> and unfreezing. xfs stops workqueues after all writes and pagefault
> writers have stopped. This is done in ->sync_fs() when it's called
> from freeze_super(). They are restarted when ->unfreeze_fs is called.

It is possible, but I note that if I do

fsfreeze --freeze /

I can produce exactly the above stack trace in systemd-journald, but if
I unfreeze root it continues on normally.  Thus I think this is some
type of bad interaction with the process freezing that goes on in
hibernate.  I'm going to see if I can replicate using the cgroup
freezer.

> But for ext4 in ->sync_fs() the rsv_conversion_wq is flushed. I think
> that should be safe to do but I'm not sure if there can't be other
> work coming in on it before the actual freeze call. Jan will be able
> to explain this a lot better. I don't have time today to figure out
> what this does.

Understood.  The above is for Jan if he'd like to think about it.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-30 14:00               ` James Bottomley
@ 2025-03-31  9:13                 ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31  9:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: jack, linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Sun, Mar 30, 2025 at 10:00:56AM -0400, James Bottomley wrote:
> On Sun, 2025-03-30 at 10:33 +0200, Christian Brauner wrote:
> [...]
> > > I found the systemd bug
> > > 
> > > https://github.com/systemd/systemd/issues/36888
> > 
> > I don't think that's a systemd bug.
> 
> Heh, well I have zero interest in refereeing a turf war between systemd
> and dracut over mismatched expectations.  The point for anyone who
> wants to run hibernate tests is that until they both sort this out the
> bug can be fixed by removing the system identifier check from systemd-
> hibernate-resume-generator.
> 
> > > And hacked around it, so I can confirm a simple hibernate/resume
> > > works provided the sd_start_write() patches are applied (and the
> > > hooks are plumbed in to pm).
> > > 
> > > There is an oddity: the systemd-journald process that would usually
> > > hang hibernate in D wait goes into R but seems to be hung and can't
> > > be killed by the watchdog even with a -9.  It's stack trace says
> > > it's still stuck in sb_start_write:
> > > 
> > > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > > [<0>] do_page_mkwrite+0x38/0xa0
> > > [<0>] do_wp_page+0xd5/0xba0
> > > [<0>] __handle_mm_fault+0xa29/0xca0
> > > [<0>] handle_mm_fault+0x16a/0x2d0
> > > [<0>] do_user_addr_fault+0x3ab/0x810
> > > [<0>] exc_page_fault+0x68/0x150
> > > [<0>] asm_exc_page_fault+0x22/0x30
> > > 
> > > So I think there's something funny going on in thaw.
> > 
> > My uneducated guess is that it's probably an issue with ext4 freezing
> > and unfreezing. xfs stops workqueues after all writes and pagefault
> > writers have stopped. This is done in ->sync_fs() when it's called
> > from freeze_super(). They are restarted when ->unfreeze_fs is called.
> 
> It is possible, but I note that if I do
> 
> fsfreeze --freeze /

Freezing the root filesystem from userspace will inevitably lead to an
odd form of deadlock eventually. Either the first accidental request for
opening something as writable or even the call to fsfreeze --unfreeze /
may deadlock.

The most likely explanation for this stacktrace is that the root
filesystem isn't unfrozen. In userspace it's easy enough to trigger by
leaving the filesystem frozen without also freezing userspace processes
accessing that filesystem:

[  243.232205] INFO: task systemd-journal:539 blocked for more than 120 seconds.
[  243.239491]       Not tainted 6.14.0-g9ad3884269ca #131
[  243.243771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.248517] task:systemd-journal state:D stack:0     pid:539   tgid:539   ppid:1      task_flags:0x400100 flags:0x00000006
[  243.253480] Call Trace:
[  243.254641]  <TASK>
[  243.255663]  __schedule+0x61e/0x1080
[  243.257071]  ? percpu_rwsem_wait+0x149/0x1b0
[  243.258473]  schedule+0x3a/0x120
[  243.259533]  percpu_rwsem_wait+0x155/0x1b0
[  243.260844]  ? __pfx_percpu_rwsem_wake_function+0x10/0x10
[  243.262620]  __percpu_down_read+0x83/0x1c0
[  243.263968]  btrfs_page_mkwrite+0x45b/0x890 [btrfs]
[  243.266828]  ? find_held_lock+0x2b/0x80
[  243.267765]  do_page_mkwrite+0x4a/0xb0
[  243.268698]  do_wp_page+0x331/0xdc0
[  243.269559]  __handle_mm_fault+0xb15/0x11d0
[  243.270566]  handle_mm_fault+0xb8/0x2b0
[  243.271557]  do_user_addr_fault+0x20a/0x700
[  243.272574]  exc_page_fault+0x6a/0x200
[  243.273462]  asm_exc_page_fault+0x26/0x30

This happens because systemd-journald mmaps the journal file. It
triggers a pagefault which wants to get pagefault based write access to
the file. But it can't because pagefaults are frozen. So it hangs and as
it's not frozen it will trigger hung task warnings.

IOW, the most likely explanation is that the root filesystem wasn't
unfrozen and systemd-journald wasn't frozen.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/6] super: remove pointless s_root checks
  2025-03-29  8:42         ` [PATCH v2 1/6] super: remove pointless s_root checks Christian Brauner
@ 2025-03-31  9:57           ` Jan Kara
  2025-06-11 16:26           ` Darrick J. Wong
  1 sibling, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-03-31  9:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 09:42:14, Christian Brauner wrote:
> The locking guarantees that the superblock is alive and sb->s_root is
> still set. Remove the pointless check.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. In fact most sb->s_root checks in fs/super.c look pointless
these days since AFAICT if you have SB_BORN && !SB_DYING superblock (as
super_lock_*() ascertains), then sb->s_root != NULL. Anyway feel free to
add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c | 19 ++++++-------------
>  1 file changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 97a17f9d9023..dc14f4bf73a6 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -930,8 +930,7 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
>  
>  		locked = super_lock_shared(sb);
>  		if (locked) {
> -			if (sb->s_root)
> -				f(sb, arg);
> +			f(sb, arg);
>  			super_unlock_shared(sb);
>  		}
>  
> @@ -967,11 +966,8 @@ void iterate_supers_type(struct file_system_type *type,
>  		spin_unlock(&sb_lock);
>  
>  		locked = super_lock_shared(sb);
> -		if (locked) {
> -			if (sb->s_root)
> -				f(sb, arg);
> -			super_unlock_shared(sb);
> -		}
> +		if (locked)
> +			f(sb, arg);
>  
>  		spin_lock(&sb_lock);
>  		if (p)
> @@ -991,18 +987,15 @@ struct super_block *user_get_super(dev_t dev, bool excl)
>  
>  	spin_lock(&sb_lock);
>  	list_for_each_entry(sb, &super_blocks, s_list) {
> -		if (sb->s_dev ==  dev) {
> +		if (sb->s_dev == dev) {
>  			bool locked;
>  
>  			sb->s_count++;
>  			spin_unlock(&sb_lock);
>  			/* still alive? */
>  			locked = super_lock(sb, excl);
> -			if (locked) {
> -				if (sb->s_root)
> -					return sb;
> -				super_unlock(sb, excl);
> -			}
> +			if (locked)
> +				return sb; /* caller will drop */
>  			/* nope, got unmounted */
>  			spin_lock(&sb_lock);
>  			__put_super(sb);
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/6] super: simplify user_get_super()
  2025-03-29  8:42         ` [PATCH v2 2/6] super: simplify user_get_super() Christian Brauner
@ 2025-03-31  9:58           ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-03-31  9:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 09:42:15, Christian Brauner wrote:
> Make it easier to read and remove one level of identation.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index dc14f4bf73a6..b1acfc38ba0c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -987,20 +987,21 @@ struct super_block *user_get_super(dev_t dev, bool excl)
>  
>  	spin_lock(&sb_lock);
>  	list_for_each_entry(sb, &super_blocks, s_list) {
> -		if (sb->s_dev == dev) {
> -			bool locked;
> -
> -			sb->s_count++;
> -			spin_unlock(&sb_lock);
> -			/* still alive? */
> -			locked = super_lock(sb, excl);
> -			if (locked)
> -				return sb; /* caller will drop */
> -			/* nope, got unmounted */
> -			spin_lock(&sb_lock);
> -			__put_super(sb);
> -			break;
> -		}
> +		bool locked;
> +
> +		if (sb->s_dev != dev)
> +			continue;
> +
> +		sb->s_count++;
> +		spin_unlock(&sb_lock);
> +
> +		locked = super_lock(sb, excl);
> +		if (locked)
> +			return sb;
> +
> +		spin_lock(&sb_lock);
> +		__put_super(sb);
> +		break;
>  	}
>  	spin_unlock(&sb_lock);
>  	return NULL;
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 3/6] super: skip dying superblocks early
  2025-03-29  8:42         ` [PATCH v2 3/6] super: skip dying superblocks early Christian Brauner
@ 2025-03-31 10:00           ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-03-31 10:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 09:42:16, Christian Brauner wrote:
> Make all iterators uniform by performing an early check whether the
> superblock is dying.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b1acfc38ba0c..c67ea3cdda41 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -925,6 +925,9 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
>  	list_for_each_entry(sb, &super_blocks, s_list) {
>  		bool locked;
>  
> +		if (super_flags(sb, SB_DYING))
> +			continue;
> +
>  		sb->s_count++;
>  		spin_unlock(&sb_lock);
>  
> @@ -962,6 +965,9 @@ void iterate_supers_type(struct file_system_type *type,
>  	hlist_for_each_entry(sb, &type->fs_supers, s_instances) {
>  		bool locked;
>  
> +		if (super_flags(sb, SB_DYING))
> +			continue;
> +
>  		sb->s_count++;
>  		spin_unlock(&sb_lock);
>  
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 4/6] super: use a common iterator (Part 1)
  2025-03-29  8:42         ` [PATCH v2 4/6] super: use a common iterator (Part 1) Christian Brauner
@ 2025-03-31 10:01           ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-03-31 10:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 09:42:17, Christian Brauner wrote:
> Use a common iterator for all callbacks.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Very nice! Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c         | 67 +++++++++++-------------------------------------------
>  include/linux/fs.h |  6 ++++-
>  2 files changed, 18 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index c67ea3cdda41..0dd208804a74 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -887,37 +887,7 @@ void drop_super_exclusive(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(drop_super_exclusive);
>  
> -static void __iterate_supers(void (*f)(struct super_block *))
> -{
> -	struct super_block *sb, *p = NULL;
> -
> -	spin_lock(&sb_lock);
> -	list_for_each_entry(sb, &super_blocks, s_list) {
> -		if (super_flags(sb, SB_DYING))
> -			continue;
> -		sb->s_count++;
> -		spin_unlock(&sb_lock);
> -
> -		f(sb);
> -
> -		spin_lock(&sb_lock);
> -		if (p)
> -			__put_super(p);
> -		p = sb;
> -	}
> -	if (p)
> -		__put_super(p);
> -	spin_unlock(&sb_lock);
> -}
> -/**
> - *	iterate_supers - call function for all active superblocks
> - *	@f: function to call
> - *	@arg: argument to pass to it
> - *
> - *	Scans the superblock list and calls given function, passing it
> - *	locked superblock and given argument.
> - */
> -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> +void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl)
>  {
>  	struct super_block *sb, *p = NULL;
>  
> @@ -927,14 +897,13 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
>  
>  		if (super_flags(sb, SB_DYING))
>  			continue;
> -
>  		sb->s_count++;
>  		spin_unlock(&sb_lock);
>  
> -		locked = super_lock_shared(sb);
> +		locked = super_lock(sb, excl);
>  		if (locked) {
>  			f(sb, arg);
> -			super_unlock_shared(sb);
> +			super_unlock(sb, excl);
>  		}
>  
>  		spin_lock(&sb_lock);
> @@ -1111,11 +1080,9 @@ int reconfigure_super(struct fs_context *fc)
>  	return retval;
>  }
>  
> -static void do_emergency_remount_callback(struct super_block *sb)
> +static void do_emergency_remount_callback(struct super_block *sb, void *unused)
>  {
> -	bool locked = super_lock_excl(sb);
> -
> -	if (locked && sb->s_root && sb->s_bdev && !sb_rdonly(sb)) {
> +	if (sb->s_bdev && !sb_rdonly(sb)) {
>  		struct fs_context *fc;
>  
>  		fc = fs_context_for_reconfigure(sb->s_root,
> @@ -1126,13 +1093,11 @@ static void do_emergency_remount_callback(struct super_block *sb)
>  			put_fs_context(fc);
>  		}
>  	}
> -	if (locked)
> -		super_unlock_excl(sb);
>  }
>  
>  static void do_emergency_remount(struct work_struct *work)
>  {
> -	__iterate_supers(do_emergency_remount_callback);
> +	__iterate_supers(do_emergency_remount_callback, NULL, true);
>  	kfree(work);
>  	printk("Emergency Remount complete\n");
>  }
> @@ -1148,24 +1113,18 @@ void emergency_remount(void)
>  	}
>  }
>  
> -static void do_thaw_all_callback(struct super_block *sb)
> +static void do_thaw_all_callback(struct super_block *sb, void *unused)
>  {
> -	bool locked = super_lock_excl(sb);
> -
> -	if (locked && sb->s_root) {
> -		if (IS_ENABLED(CONFIG_BLOCK))
> -			while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
> -				pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
> -		thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
> -		return;
> -	}
> -	if (locked)
> -		super_unlock_excl(sb);
> +	if (IS_ENABLED(CONFIG_BLOCK))
> +		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
> +			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
> +	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
> +	return;
>  }
>  
>  static void do_thaw_all(struct work_struct *work)
>  {
> -	__iterate_supers(do_thaw_all_callback);
> +	__iterate_supers(do_thaw_all_callback, NULL, true);
>  	kfree(work);
>  	printk(KERN_WARNING "Emergency Thaw complete\n");
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..0351500b71d2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3515,7 +3515,11 @@ extern void put_filesystem(struct file_system_type *fs);
>  extern struct file_system_type *get_fs_type(const char *name);
>  extern void drop_super(struct super_block *sb);
>  extern void drop_super_exclusive(struct super_block *sb);
> -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> +void __iterate_supers(void (*f)(struct super_block *, void *), void *arg, bool excl);
> +static inline void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> +{
> +	__iterate_supers(f, arg, false);
> +}
>  extern void iterate_supers_type(struct file_system_type *,
>  			        void (*)(struct super_block *, void *), void *);
>  
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 5/6] super: use common iterator (Part 2)
  2025-03-29  8:42         ` [PATCH v2 5/6] super: use common iterator (Part 2) Christian Brauner
@ 2025-03-31 10:07           ` Jan Kara
  2025-03-31 10:15             ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-03-31 10:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 09:42:18, Christian Brauner wrote:
> Use a common iterator for all callbacks. We could go for something even
> more elaborate (advance step-by-step similar to iov_iter) but I really
> don't think this is warranted.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good, one nit below. With that fixed feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> +#define invalid_super list_entry_is_head

Why do you have this invalid_super define? I find it rather confusing in
the loop below and list_entry_is_head() would be much more
understandable...

								Honza

> +
> +static void __iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> +			     enum super_iter_flags_t flags)
>  {
>  	struct super_block *sb, *p = NULL;
> +	bool excl = flags & SUPER_ITER_EXCL;
>  
> -	spin_lock(&sb_lock);
> -	list_for_each_entry(sb, &super_blocks, s_list) {
> -		bool locked;
> +	guard(spinlock)(&sb_lock);
>  
> +	for (sb = first_super(flags); !invalid_super(sb, &super_blocks, s_list);
> +	     sb = next_super(sb, flags)) {
>  		if (super_flags(sb, SB_DYING))
>  			continue;
>  		sb->s_count++;
>  		spin_unlock(&sb_lock);
>  
> -		locked = super_lock(sb, excl);
> -		if (locked) {
> +		if (flags & SUPER_ITER_UNLOCKED) {
> +			f(sb, arg);
> +		} else if (super_lock(sb, excl)) {
>  			f(sb, arg);
>  			super_unlock(sb, excl);
>  		}
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 5/6] super: use common iterator (Part 2)
  2025-03-31 10:07           ` Jan Kara
@ 2025-03-31 10:15             ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 10:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-kernel, James Bottomley, mcgrof, hch, david,
	rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Mon, Mar 31, 2025 at 12:07:12PM +0200, Jan Kara wrote:
> On Sat 29-03-25 09:42:18, Christian Brauner wrote:
> > Use a common iterator for all callbacks. We could go for something even
> > more elaborate (advance step-by-step similar to iov_iter) but I really
> > don't think this is warranted.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> Looks good, one nit below. With that fixed feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > +#define invalid_super list_entry_is_head
> 
> Why do you have this invalid_super define? I find it rather confusing in
> the loop below and list_entry_is_head() would be much more
> understandable...

Fair, I just wanted a shorter name but I'll change it to
list_entry_is_head() and push it out.

> 
> 								Honza
> 
> > +
> > +static void __iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> > +			     enum super_iter_flags_t flags)
> >  {
> >  	struct super_block *sb, *p = NULL;
> > +	bool excl = flags & SUPER_ITER_EXCL;
> >  
> > -	spin_lock(&sb_lock);
> > -	list_for_each_entry(sb, &super_blocks, s_list) {
> > -		bool locked;
> > +	guard(spinlock)(&sb_lock);
> >  
> > +	for (sb = first_super(flags); !invalid_super(sb, &super_blocks, s_list);
> > +	     sb = next_super(sb, flags)) {
> >  		if (super_flags(sb, SB_DYING))
> >  			continue;
> >  		sb->s_count++;
> >  		spin_unlock(&sb_lock);
> >  
> > -		locked = super_lock(sb, excl);
> > -		if (locked) {
> > +		if (flags & SUPER_ITER_UNLOCKED) {
> > +			f(sb, arg);
> > +		} else if (super_lock(sb, excl)) {
> >  			f(sb, arg);
> >  			super_unlock(sb, excl);
> >  		}
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate
  2025-03-29  8:42         ` [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
  2025-03-29  8:46           ` Christian Brauner
@ 2025-03-31 10:23           ` Jan Kara
  2025-03-31 10:25             ` Christian Brauner
  1 sibling, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-03-31 10:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 09:42:19, Christian Brauner wrote:
> Allow the power subsystem to support filesystem freeze for
> suspend and hibernate.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

One comment below. Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> +void filesystems_thaw(bool hibernate)
> +{
> +	__iterate_supers(filesystems_thaw_callback, NULL,
> +			 SUPER_ITER_UNLOCKED | SUPER_ITER_REVERSE);
> +}

I think we should thaw in normal superblock order, not in reverse one? To
thaw the bottommost filesystem first? The filesystem thaw callback can
write to the underlying device and this could cause deadlocks...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate
  2025-03-31 10:23           ` Jan Kara
@ 2025-03-31 10:25             ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 10:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-kernel, James Bottomley, mcgrof, hch, david,
	rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Mon, Mar 31, 2025 at 12:23:04PM +0200, Jan Kara wrote:
> On Sat 29-03-25 09:42:19, Christian Brauner wrote:
> > Allow the power subsystem to support filesystem freeze for
> > suspend and hibernate.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> One comment below. Otherwise feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > +void filesystems_thaw(bool hibernate)
> > +{
> > +	__iterate_supers(filesystems_thaw_callback, NULL,
> > +			 SUPER_ITER_UNLOCKED | SUPER_ITER_REVERSE);
> > +}
> 
> I think we should thaw in normal superblock order, not in reverse one? To
> thaw the bottommost filesystem first? The filesystem thaw callback can
> write to the underlying device and this could cause deadlocks...

Yep, I've fixed that already up in vfs-6.16.super yesterday.
Sorry, forgot to mention that here. Thanks for noticing!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-29 17:02           ` James Bottomley
  2025-03-30  8:33             ` Christian Brauner
@ 2025-03-31 10:36             ` Jan Kara
  2025-03-31 14:49               ` James Bottomley
  2025-03-31 23:33               ` Christian Brauner
  1 sibling, 2 replies; 120+ messages in thread
From: Jan Kara @ 2025-03-31 10:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christian Brauner, linux-fsdevel, jack, linux-kernel, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Sat 29-03-25 13:02:32, James Bottomley wrote:
> On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > Add the necessary infrastructure changes to support freezing for
> > > suspend and hibernate.
> > > 
> > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > increased. This should all that's needed to wire up power.
> > > 
> > > This will be in vfs-6.16.super shortly.
> > > 
> > > ---
> > > Changes in v2:
> > > - Don't grab reference in the iterator make that a requirement for
> > > the callers that need custom behavior.
> > > - Link to v1:
> > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > 
> > Given I've been a bit quiet on this, I thought I'd better explain
> > what's going on: I do have these built, but I made the mistake of
> > doing a dist-upgrade on my testing VM master image and it pulled in a
> > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > upgraded in place I don't have the old image so I'm spending my time
> > currently debugging systemd ... normal service will hopefully resume
> > shortly.
> 
> I found the systemd bug
> 
> https://github.com/systemd/systemd/issues/36888
> 
> And hacked around it, so I can confirm a simple hibernate/resume works
> provided the sd_start_write() patches are applied (and the hooks are
> plumbed in to pm).
> 
> There is an oddity: the systemd-journald process that would usually
> hang hibernate in D wait goes into R but seems to be hung and can't be
> killed by the watchdog even with a -9.  It's stack trace says it's
> still stuck in sb_start_write:
> 
> [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> [<0>] do_page_mkwrite+0x38/0xa0
> [<0>] do_wp_page+0xd5/0xba0
> [<0>] __handle_mm_fault+0xa29/0xca0
> [<0>] handle_mm_fault+0x16a/0x2d0
> [<0>] do_user_addr_fault+0x3ab/0x810
> [<0>] exc_page_fault+0x68/0x150
> [<0>] asm_exc_page_fault+0x22/0x30
> 
> So I think there's something funny going on in thaw.

As Christian wrote, it seems systemd-journald does a memory store to
mmapped file and gets blocked on sb_start_write() while doing the page
fault. What's strange is that R state. Is the task really executing on some
CPU or it only has 'R' state (i.e., got woken but never scheduled)?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 0/2] efivarfs: support freeze/thaw
  2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
                           ` (6 preceding siblings ...)
  2025-03-29 14:04         ` [PATCH v2 0/6] Extend freeze support to " James Bottomley
@ 2025-03-31 12:42         ` Christian Brauner
  2025-03-31 12:42           ` [PATCH 1/2] libfs: export find_next_child() Christian Brauner
                             ` (3 more replies)
  7 siblings, 4 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 12:42 UTC (permalink / raw)
  To: linux-fsdevel, jack, Ard Biesheuvel
  Cc: Christian Brauner, linux-efi, linux-kernel, James Bottomley,
	mcgrof, hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.

This is a pretty straightforward implementation. We simply add regular
freeze/thaw support for both userspace and the kernel. This works
without any big issues and congrats afaict efivars is the first
pseudofilesystem that adds support for filesystem freezing and thawing.

The simplicity comes from the fact that we simply always resync variable
state after efivarfs has been frozen. It doesn't matter whether that's
because of suspend, userspace initiated freeze or hibernation. Efivars
is simple enough that it doesn't matter that we walk all dentries. There
are no directories and there aren't insane amounts of entries and both
freeze/thaw are already heavy-handed operations. If userspace initiated
a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user
namespace (as that's where efivarfs is mounted) so it can't be triggered
by random userspace. IOW, we really really don't care.

@Ard, if you're fine with this (and agree with the patch) I'd carry this
on a stable branch vfs-6.16.super that you can pull into efivarfs once
-rc1 is out.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
      libfs: export find_next_child()
      efivarfs: support freeze/thaw

 fs/efivarfs/internal.h |   1 -
 fs/efivarfs/super.c    | 196 +++++++++++++------------------------------------
 fs/internal.h          |   1 +
 fs/libfs.c             |   3 +-
 4 files changed, 54 insertions(+), 147 deletions(-)
---
base-commit: 8876e79faf32838d05488996b896cb40247a4a8a
change-id: 20250331-work-freeze-ae6260c405b9

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 1/2] libfs: export find_next_child()
  2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
@ 2025-03-31 12:42           ` Christian Brauner
  2025-03-31 12:42           ` [PATCH 2/2] efivarfs: support freeze/thaw Christian Brauner
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 12:42 UTC (permalink / raw)
  To: linux-fsdevel, jack, Ard Biesheuvel
  Cc: Christian Brauner, linux-efi, linux-kernel, James Bottomley,
	mcgrof, hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

Export find_next_child() so it can be used by efivarfs.
Keep it internal for now. There's no reason to advertise this
kernel-wide.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/internal.h | 1 +
 fs/libfs.c    | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/internal.h b/fs/internal.h
index b9b3e29a73fd..b9949707a152 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -343,3 +343,4 @@ static inline bool path_mounted(const struct path *path)
 void file_f_owner_release(struct file *file);
 bool file_seek_cur_needs_f_lock(struct file *file);
 int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map);
+struct dentry *find_next_child(struct dentry *parent, struct dentry *prev);
diff --git a/fs/libfs.c b/fs/libfs.c
index 6393d7c49ee6..f2ef377d2665 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -583,7 +583,7 @@ const struct file_operations simple_offset_dir_operations = {
 	.fsync		= noop_fsync,
 };
 
-static struct dentry *find_next_child(struct dentry *parent, struct dentry *prev)
+struct dentry *find_next_child(struct dentry *parent, struct dentry *prev)
 {
 	struct dentry *child = NULL, *d;
 
@@ -603,6 +603,7 @@ static struct dentry *find_next_child(struct dentry *parent, struct dentry *prev
 	dput(prev);
 	return child;
 }
+EXPORT_SYMBOL(find_next_child);
 
 void simple_recursive_removal(struct dentry *dentry,
                               void (*callback)(struct dentry *))

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 2/2] efivarfs: support freeze/thaw
  2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
  2025-03-31 12:42           ` [PATCH 1/2] libfs: export find_next_child() Christian Brauner
@ 2025-03-31 12:42           ` Christian Brauner
  2025-03-31 14:46             ` James Bottomley
  2025-04-01 19:31             ` James Bottomley
  2025-03-31 14:05           ` [PATCH 0/2] " Ard Biesheuvel
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
  3 siblings, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 12:42 UTC (permalink / raw)
  To: linux-fsdevel, jack, Ard Biesheuvel
  Cc: Christian Brauner, linux-efi, linux-kernel, James Bottomley,
	mcgrof, hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.

This is a pretty straightforward implementation. We simply add regular
freeze/thaw support for both userspace and the kernel. This works
without any big issues and congrats afaict efivars is the first
pseudofilesystem that adds support for filesystem freezing and thawing.

The simplicity comes from the fact that we simply always resync variable
state after efivarfs has been frozen. It doesn't matter whether that's
because of suspend, userspace initiated freeze or hibernation. Efivars
is simple enough that it doesn't matter that we walk all dentries. There
are no directories and there aren't insane amounts of entries and both
freeze/thaw are already heavy-handed operations. We really really don't
need to care.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/efivarfs/internal.h |   1 -
 fs/efivarfs/super.c    | 196 +++++++++++++------------------------------------
 2 files changed, 51 insertions(+), 146 deletions(-)

diff --git a/fs/efivarfs/internal.h b/fs/efivarfs/internal.h
index ac6a1dd0a6a5..f913b6824289 100644
--- a/fs/efivarfs/internal.h
+++ b/fs/efivarfs/internal.h
@@ -17,7 +17,6 @@ struct efivarfs_fs_info {
 	struct efivarfs_mount_opts mount_opts;
 	struct super_block *sb;
 	struct notifier_block nb;
-	struct notifier_block pm_nb;
 };
 
 struct efi_variable {
diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 0486e9b68bc6..567e849a03fe 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -20,6 +20,7 @@
 #include <linux/printk.h>
 
 #include "internal.h"
+#include "../internal.h"
 
 static int efivarfs_ops_notifier(struct notifier_block *nb, unsigned long event,
 				 void *data)
@@ -119,12 +120,18 @@ static int efivarfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 
 	return 0;
 }
+
+static int efivarfs_freeze_fs(struct super_block *sb);
+static int efivarfs_unfreeze_fs(struct super_block *sb);
+
 static const struct super_operations efivarfs_ops = {
 	.statfs = efivarfs_statfs,
 	.drop_inode = generic_delete_inode,
 	.alloc_inode = efivarfs_alloc_inode,
 	.free_inode = efivarfs_free_inode,
 	.show_options = efivarfs_show_options,
+	.freeze_fs = efivarfs_freeze_fs,
+	.unfreeze_fs = efivarfs_unfreeze_fs,
 };
 
 /*
@@ -367,8 +374,6 @@ static int efivarfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (err)
 		return err;
 
-	register_pm_notifier(&sfi->pm_nb);
-
 	return efivar_init(efivarfs_callback, sb, true);
 }
 
@@ -393,48 +398,6 @@ static const struct fs_context_operations efivarfs_context_ops = {
 	.reconfigure	= efivarfs_reconfigure,
 };
 
-struct efivarfs_ctx {
-	struct dir_context ctx;
-	struct super_block *sb;
-	struct dentry *dentry;
-};
-
-static bool efivarfs_actor(struct dir_context *ctx, const char *name, int len,
-			   loff_t offset, u64 ino, unsigned mode)
-{
-	unsigned long size;
-	struct efivarfs_ctx *ectx = container_of(ctx, struct efivarfs_ctx, ctx);
-	struct qstr qstr = { .name = name, .len = len };
-	struct dentry *dentry = d_hash_and_lookup(ectx->sb->s_root, &qstr);
-	struct inode *inode;
-	struct efivar_entry *entry;
-	int err;
-
-	if (IS_ERR_OR_NULL(dentry))
-		return true;
-
-	inode = d_inode(dentry);
-	entry = efivar_entry(inode);
-
-	err = efivar_entry_size(entry, &size);
-	size += sizeof(__u32);	/* attributes */
-	if (err)
-		size = 0;
-
-	inode_lock_nested(inode, I_MUTEX_CHILD);
-	i_size_write(inode, size);
-	inode_unlock(inode);
-
-	if (!size) {
-		ectx->dentry = dentry;
-		return false;
-	}
-
-	dput(dentry);
-
-	return true;
-}
-
 static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 				  unsigned long name_size, void *data)
 {
@@ -474,111 +437,59 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 	return err;
 }
 
-static void efivarfs_deactivate_super_work(struct work_struct *work)
-{
-	struct super_block *s = container_of(work, struct super_block,
-					     destroy_work);
-	/*
-	 * note: here s->destroy_work is free for reuse (which
-	 * will happen in deactivate_super)
-	 */
-	deactivate_super(s);
-}
-
 static struct file_system_type efivarfs_type;
 
-static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
-			      void *ptr)
+static int efivarfs_freeze_fs(struct super_block *sb)
 {
-	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
-						    pm_nb);
-	struct path path;
-	struct efivarfs_ctx ectx = {
-		.ctx = {
-			.actor	= efivarfs_actor,
-		},
-		.sb = sfi->sb,
-	};
-	struct file *file;
-	struct super_block *s = sfi->sb;
-	static bool rescan_done = true;
-
-	if (action == PM_HIBERNATION_PREPARE) {
-		rescan_done = false;
-		return NOTIFY_OK;
-	} else if (action != PM_POST_HIBERNATION) {
-		return NOTIFY_DONE;
-	}
-
-	if (rescan_done)
-		return NOTIFY_DONE;
-
-	/* ensure single superblock is alive and pin it */
-	if (!atomic_inc_not_zero(&s->s_active))
-		return NOTIFY_DONE;
-
-	pr_info("efivarfs: resyncing variable state\n");
+	/* Nothing for us to do. */
+	return 0;
+}
 
-	path.dentry = sfi->sb->s_root;
+static int efivarfs_unfreeze_fs(struct super_block *sb)
+{
+	struct dentry *child = NULL;
 
 	/*
-	 * do not add SB_KERNMOUNT which a single superblock could
-	 * expose to userspace and which also causes MNT_INTERNAL, see
-	 * below
+	 * Unconditionally resync the variable state on a thaw request.
+	 * Given the size of efivarfs it really doesn't matter to simply
+	 * iterate through all of the entries and resync. Freeze/thaw
+	 * requests are rare enough for that to not matter and the
+	 * number of entries is pretty low too. So we really don't care.
 	 */
-	path.mnt = vfs_kern_mount(&efivarfs_type, 0,
-				  efivarfs_type.name, NULL);
-	if (IS_ERR(path.mnt)) {
-		pr_err("efivarfs: internal mount failed\n");
-		/*
-		 * We may be the last pinner of the superblock but
-		 * calling efivarfs_kill_sb from within the notifier
-		 * here would deadlock trying to unregister it
-		 */
-		INIT_WORK(&s->destroy_work, efivarfs_deactivate_super_work);
-		schedule_work(&s->destroy_work);
-		return PTR_ERR(path.mnt);
+	pr_info("efivarfs: resyncing variable state\n");
+	for (;;) {
+		int err;
+		size_t size;
+		struct inode *inode;
+		struct efivar_entry *entry;
+
+		child = find_next_child(sb->s_root, child);
+		if (!child)
+			break;
+
+		inode = d_inode(child);
+		entry = efivar_entry(inode);
+
+		err = efivar_entry_size(entry, &size);
+		if (err)
+			size = 0;
+		else
+			size += sizeof(__u32);
+
+		inode_lock(inode);
+		i_size_write(inode, size);
+		inode_unlock(inode);
+
+		if (!err)
+			continue;
+
+		/* The variable doesn't exist anymore, delete it. */
+		simple_recursive_removal(child, NULL);
 	}
 
-	/* path.mnt now has pin on superblock, so this must be above one */
-	atomic_dec(&s->s_active);
-
-	file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME,
-				current_cred());
-	/*
-	 * safe even if last put because no MNT_INTERNAL means this
-	 * will do delayed deactivate_super and not deadlock
-	 */
-	mntput(path.mnt);
-	if (IS_ERR(file))
-		return NOTIFY_DONE;
-
-	rescan_done = true;
-
-	/*
-	 * First loop over the directory and verify each entry exists,
-	 * removing it if it doesn't
-	 */
-	file->f_pos = 2;	/* skip . and .. */
-	do {
-		ectx.dentry = NULL;
-		iterate_dir(file, &ectx.ctx);
-		if (ectx.dentry) {
-			pr_info("efivarfs: removing variable %pd\n",
-				ectx.dentry);
-			simple_recursive_removal(ectx.dentry, NULL);
-			dput(ectx.dentry);
-		}
-	} while (ectx.dentry);
-	fput(file);
-
-	/*
-	 * then loop over variables, creating them if there's no matching
-	 * dentry
-	 */
-	efivar_init(efivarfs_check_missing, sfi->sb, false);
-
-	return NOTIFY_OK;
+	efivar_init(efivarfs_check_missing, sb, false);
+	pr_info("efivarfs: finished resyncing variable state\n");
+	return 0;
 }
 
 static int efivarfs_init_fs_context(struct fs_context *fc)
@@ -598,9 +509,6 @@ static int efivarfs_init_fs_context(struct fs_context *fc)
 	fc->s_fs_info = sfi;
 	fc->ops = &efivarfs_context_ops;
 
-	sfi->pm_nb.notifier_call = efivarfs_pm_notify;
-	sfi->pm_nb.priority = 0;
-
 	return 0;
 }
 
@@ -608,9 +516,7 @@ static void efivarfs_kill_sb(struct super_block *sb)
 {
 	struct efivarfs_fs_info *sfi = sb->s_fs_info;
 
-	blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb);
 	kill_litter_super(sb);
-	unregister_pm_notifier(&sfi->pm_nb);
 
 	kfree(sfi);
 }

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/2] efivarfs: support freeze/thaw
  2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
  2025-03-31 12:42           ` [PATCH 1/2] libfs: export find_next_child() Christian Brauner
  2025-03-31 12:42           ` [PATCH 2/2] efivarfs: support freeze/thaw Christian Brauner
@ 2025-03-31 14:05           ` Ard Biesheuvel
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
  3 siblings, 0 replies; 120+ messages in thread
From: Ard Biesheuvel @ 2025-03-31 14:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-efi, linux-kernel, James Bottomley,
	mcgrof, hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

On Mon, 31 Mar 2025 at 14:42, Christian Brauner <brauner@kernel.org> wrote:
>
> Allow efivarfs to partake to resync variable state during system
> hibernation and suspend. Add freeze/thaw support.
>
> This is a pretty straightforward implementation. We simply add regular
> freeze/thaw support for both userspace and the kernel. This works
> without any big issues and congrats afaict efivars is the first
> pseudofilesystem that adds support for filesystem freezing and thawing.
>
> The simplicity comes from the fact that we simply always resync variable
> state after efivarfs has been frozen. It doesn't matter whether that's
> because of suspend, userspace initiated freeze or hibernation. Efivars
> is simple enough that it doesn't matter that we walk all dentries. There
> are no directories and there aren't insane amounts of entries and both
> freeze/thaw are already heavy-handed operations. If userspace initiated
> a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user
> namespace (as that's where efivarfs is mounted) so it can't be triggered
> by random userspace. IOW, we really really don't care.
>
> @Ard, if you're fine with this (and agree with the patch) I'd carry this
> on a stable branch vfs-6.16.super that you can pull into efivarfs once
> -rc1 is out.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
>       libfs: export find_next_child()
>       efivarfs: support freeze/thaw
>

This looks fine to me: I'm a EFI expert not a VFS expert so I am quite
pleased that you have taken the time to implement this properly.

Acked-by: Ard Biesheuvel <ardb@kernel.org>

I don't anticipate a lot of parallel development going on in efivarfs
so taking this through the VFS tree is fine. I'll let you know if/when
I merge it into the EFI tree so feel free to rebase/tweak the branch
otherwise.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 2/2] efivarfs: support freeze/thaw
  2025-03-31 12:42           ` [PATCH 2/2] efivarfs: support freeze/thaw Christian Brauner
@ 2025-03-31 14:46             ` James Bottomley
  2025-03-31 15:03               ` Christian Brauner
  2025-04-01 19:31             ` James Bottomley
  1 sibling, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-31 14:46 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, jack, Ard Biesheuvel
  Cc: linux-efi, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Mon, 2025-03-31 at 14:42 +0200, Christian Brauner wrote:
> Allow efivarfs to partake to resync variable state during system
> hibernation and suspend. Add freeze/thaw support.
> 
> This is a pretty straightforward implementation. We simply add
> regular freeze/thaw support for both userspace and the kernel. This
> works without any big issues and congrats afaict efivars is the first
> pseudofilesystem that adds support for filesystem freezing and
> thawing.
> 
> The simplicity comes from the fact that we simply always resync
> variable state after efivarfs has been frozen. It doesn't matter
> whether that's because of suspend, userspace initiated freeze or
> hibernation. Efivars is simple enough that it doesn't matter that we
> walk all dentries. There are no directories and there aren't insane
> amounts of entries and both freeze/thaw are already heavy-handed
> operations. We really really don't need to care.

Just as a point of order: this can't actually work until freeze/thaw is
actually plumbed into suspend/resume.

> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/efivarfs/internal.h |   1 -
>  fs/efivarfs/super.c    | 196 +++++++++++++--------------------------
> ----------
>  2 files changed, 51 insertions(+), 146 deletions(-)
> 
> diff --git a/fs/efivarfs/internal.h b/fs/efivarfs/internal.h
> index ac6a1dd0a6a5..f913b6824289 100644
> --- a/fs/efivarfs/internal.h
> +++ b/fs/efivarfs/internal.h
> @@ -17,7 +17,6 @@ struct efivarfs_fs_info {
>  	struct efivarfs_mount_opts mount_opts;
>  	struct super_block *sb;
>  	struct notifier_block nb;
> -	struct notifier_block pm_nb;
>  };
>  
>  struct efi_variable {
> diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
> index 0486e9b68bc6..567e849a03fe 100644
> --- a/fs/efivarfs/super.c
> +++ b/fs/efivarfs/super.c
> @@ -20,6 +20,7 @@
>  #include <linux/printk.h>
>  
>  #include "internal.h"
> +#include "../internal.h"
>  
>  static int efivarfs_ops_notifier(struct notifier_block *nb, unsigned
> long event,
>  				 void *data)
> @@ -119,12 +120,18 @@ static int efivarfs_statfs(struct dentry
> *dentry, struct kstatfs *buf)
>  
>  	return 0;
>  }
> +
> +static int efivarfs_freeze_fs(struct super_block *sb);
> +static int efivarfs_unfreeze_fs(struct super_block *sb);
> +
>  static const struct super_operations efivarfs_ops = {
>  	.statfs = efivarfs_statfs,
>  	.drop_inode = generic_delete_inode,
>  	.alloc_inode = efivarfs_alloc_inode,
>  	.free_inode = efivarfs_free_inode,
>  	.show_options = efivarfs_show_options,
> +	.freeze_fs = efivarfs_freeze_fs,

Why is it necessary to have a freeze_fs operation?  The current code in
super.c:freeze_super() reads:

	if (sb->s_op->freeze_fs) {
		ret = sb->s_op->freeze_fs(sb);

So it would seem that setting this to NULL has exactly the same effect
as providing a null method.

> +	.unfreeze_fs = efivarfs_unfreeze_fs,
>  };
>  
>  /*
> 
[...]
> @@ -608,9 +516,7 @@ static void efivarfs_kill_sb(struct super_block
> *sb)
>  {
>  	struct efivarfs_fs_info *sfi = sb->s_fs_info;
>  
> -	blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi-
> >nb);

This is an extraneous deletion of an unrelated notifier which efivarfs
still needs to listen for ops updates from the efi subsystem.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-31 10:36             ` Jan Kara
@ 2025-03-31 14:49               ` James Bottomley
  2025-03-31 23:33               ` Christian Brauner
  1 sibling, 0 replies; 120+ messages in thread
From: James Bottomley @ 2025-03-31 14:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Mon, 2025-03-31 at 12:36 +0200, Jan Kara wrote:
> On Sat 29-03-25 13:02:32, James Bottomley wrote:
> > On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > > Add the necessary infrastructure changes to support freezing
> > > > for
> > > > suspend and hibernate.
> > > > 
> > > > Just got back from LSFMM. So still jetlagged and likelihood of
> > > > bugs
> > > > increased. This should all that's needed to wire up power.
> > > > 
> > > > This will be in vfs-6.16.super shortly.
> > > > 
> > > > ---
> > > > Changes in v2:
> > > > - Don't grab reference in the iterator make that a requirement
> > > > for
> > > > the callers that need custom behavior.
> > > > - Link to v1:
> > > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > > 
> > > Given I've been a bit quiet on this, I thought I'd better explain
> > > what's going on: I do have these built, but I made the mistake of
> > > doing a dist-upgrade on my testing VM master image and it pulled
> > > in a
> > > version of systemd (257.4-3) that has a broken hibernate.  Since
> > > I
> > > upgraded in place I don't have the old image so I'm spending my
> > > time
> > > currently debugging systemd ... normal service will hopefully
> > > resume
> > > shortly.
> > 
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> > 
> > And hacked around it, so I can confirm a simple hibernate/resume
> > works
> > provided the sd_start_write() patches are applied (and the hooks
> > are
> > plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't
> > be killed by the watchdog even with a -9.  It's stack trace says
> > it's still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> As Christian wrote, it seems systemd-journald does a memory store to
> mmapped file and gets blocked on sb_start_write() while doing the
> page fault. What's strange is that R state. Is the task really
> executing on some CPU or it only has 'R' state (i.e., got woken but
> never scheduled)?

Yes, ps shows it definitely stuck in R state.  The trace above
identifies the rwsem being at set_current_state() which seems to imply
it never returns from schedule() even though it's in state R.

I've actually managed to reproduce this now just doing filesystem
freeze and thaw without using the freezer, so I'll continue
investigating.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 2/2] efivarfs: support freeze/thaw
  2025-03-31 14:46             ` James Bottomley
@ 2025-03-31 15:03               ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 15:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	mcgrof, hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

On Mon, Mar 31, 2025 at 10:46:43AM -0400, James Bottomley wrote:
> On Mon, 2025-03-31 at 14:42 +0200, Christian Brauner wrote:
> > Allow efivarfs to partake to resync variable state during system
> > hibernation and suspend. Add freeze/thaw support.
> > 
> > This is a pretty straightforward implementation. We simply add
> > regular freeze/thaw support for both userspace and the kernel. This
> > works without any big issues and congrats afaict efivars is the first
> > pseudofilesystem that adds support for filesystem freezing and
> > thawing.
> > 
> > The simplicity comes from the fact that we simply always resync
> > variable state after efivarfs has been frozen. It doesn't matter
> > whether that's because of suspend, userspace initiated freeze or
> > hibernation. Efivars is simple enough that it doesn't matter that we
> > walk all dentries. There are no directories and there aren't insane
> > amounts of entries and both freeze/thaw are already heavy-handed
> > operations. We really really don't need to care.
> 
> Just as a point of order: this can't actually work until freeze/thaw is
> actually plumbed into suspend/resume.
> 
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/efivarfs/internal.h |   1 -
> >  fs/efivarfs/super.c    | 196 +++++++++++++--------------------------
> > ----------
> >  2 files changed, 51 insertions(+), 146 deletions(-)
> > 
> > diff --git a/fs/efivarfs/internal.h b/fs/efivarfs/internal.h
> > index ac6a1dd0a6a5..f913b6824289 100644
> > --- a/fs/efivarfs/internal.h
> > +++ b/fs/efivarfs/internal.h
> > @@ -17,7 +17,6 @@ struct efivarfs_fs_info {
> >  	struct efivarfs_mount_opts mount_opts;
> >  	struct super_block *sb;
> >  	struct notifier_block nb;
> > -	struct notifier_block pm_nb;
> >  };
> >  
> >  struct efi_variable {
> > diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
> > index 0486e9b68bc6..567e849a03fe 100644
> > --- a/fs/efivarfs/super.c
> > +++ b/fs/efivarfs/super.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/printk.h>
> >  
> >  #include "internal.h"
> > +#include "../internal.h"
> >  
> >  static int efivarfs_ops_notifier(struct notifier_block *nb, unsigned
> > long event,
> >  				 void *data)
> > @@ -119,12 +120,18 @@ static int efivarfs_statfs(struct dentry
> > *dentry, struct kstatfs *buf)
> >  
> >  	return 0;
> >  }
> > +
> > +static int efivarfs_freeze_fs(struct super_block *sb);
> > +static int efivarfs_unfreeze_fs(struct super_block *sb);
> > +
> >  static const struct super_operations efivarfs_ops = {
> >  	.statfs = efivarfs_statfs,
> >  	.drop_inode = generic_delete_inode,
> >  	.alloc_inode = efivarfs_alloc_inode,
> >  	.free_inode = efivarfs_free_inode,
> >  	.show_options = efivarfs_show_options,
> > +	.freeze_fs = efivarfs_freeze_fs,
> 
> Why is it necessary to have a freeze_fs operation?  The current code in
> super.c:freeze_super() reads:

Fwiw, I've explained this already in prior mails. The same behavior as
for the ioctl where we check whether the filesystem provides either a
->freeze_fs or ->freeze_super method. If neither is provided the
filesystem is assumed to not have freeze support.

> 
> 	if (sb->s_op->freeze_fs) {
> 		ret = sb->s_op->freeze_fs(sb);
> 
> So it would seem that setting this to NULL has exactly the same effect
> as providing a null method.

No, it would cause freeze to not be called.

IOW, any filesystem that doesn't provides neither a freeze_super or
freeze_fs method doesn't support freeze (that's how the ioctls work as
well) which allows us to only call into filesystems that are able to
properly freeze so we don't need pointless FS_* flags. By only providing
thaw it would end up thawing something that was never frozen. Both are
provided and the freeze methods function as the indicator whether
freezing/thawing is supported.

That could be changed but not in this series. We could also provide
noop_freeze just like we have noop_sync but again, not for this series.

> 
> > +	.unfreeze_fs = efivarfs_unfreeze_fs,
> >  };
> >  
> >  /*
> > 
> [...]
> > @@ -608,9 +516,7 @@ static void efivarfs_kill_sb(struct super_block
> > *sb)
> >  {
> >  	struct efivarfs_fs_info *sfi = sb->s_fs_info;
> >  
> > -	blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi-
> > >nb);
> 
> This is an extraneous deletion of an unrelated notifier which efivarfs
> still needs to listen for ops updates from the efi subsystem.

At first I was bewildered because I thought you were talking about pm_nb
for some reason and was ready to explode. Man, I need a post LSFMM
vacation. :)

Thanks for spotting this. This is now fixed by adding back:

blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-03-27 14:06 ` [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read James Bottomley
@ 2025-03-31 19:51   ` James Bottomley
  2025-03-31 23:32     ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-03-31 19:51 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: mcgrof, jack, hch, david, rafael, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote:
[...]
> -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool
> reader)
> +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool
> reader,
> +			      bool freeze)
>  {
>  	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
>  	bool wait;
> @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct
> percpu_rw_semaphore *sem, bool reader)
>  	spin_unlock_irq(&sem->waiters.lock);
>  
>  	while (wait) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		set_current_state(TASK_UNINTERRUPTIBLE |
> +				  freeze ? TASK_FREEZABLE : 0);

This is a bit embarrassing, the bug I've been chasing is here: the ?
operator is lower in precedence than | meaning this expression always
evaluates to TASK_FREEZABLE and nothing else (which is why the process
goes into R state and never wakes up).

Let me fix that and redo all the testing.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-03-31 19:51   ` James Bottomley
@ 2025-03-31 23:32     ` Christian Brauner
  2025-04-01  1:13       ` James Bottomley
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 23:32 UTC (permalink / raw)
  To: James Bottomley, jack
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote:
> On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote:
> [...]
> > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool
> > reader)
> > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool
> > reader,
> > +			      bool freeze)
> >  {
> >  	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
> >  	bool wait;
> > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct
> > percpu_rw_semaphore *sem, bool reader)
> >  	spin_unlock_irq(&sem->waiters.lock);
> >  
> >  	while (wait) {
> > -		set_current_state(TASK_UNINTERRUPTIBLE);
> > +		set_current_state(TASK_UNINTERRUPTIBLE |
> > +				  freeze ? TASK_FREEZABLE : 0);
> 
> This is a bit embarrassing, the bug I've been chasing is here: the ?
> operator is lower in precedence than | meaning this expression always
> evaluates to TASK_FREEZABLE and nothing else (which is why the process
> goes into R state and never wakes up).
> 
> Let me fix that and redo all the testing.

I don't think that's it. I think you're missing making pagefault writers such
as systemd-journald freezable:

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..528e73f192ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
        percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-                                  level == SB_FREEZE_WRITE);
+                                  (level == SB_FREEZE_WRITE ||
+                                   level == SB_FREEZE_PAGEFAULT));
 }

 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/6] Extend freeze support to suspend and hibernate
  2025-03-31 10:36             ` Jan Kara
  2025-03-31 14:49               ` James Bottomley
@ 2025-03-31 23:33               ` Christian Brauner
  1 sibling, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-03-31 23:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: James Bottomley, linux-fsdevel, linux-kernel, mcgrof, hch, david,
	rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Mon, Mar 31, 2025 at 12:36:27PM +0200, Jan Kara wrote:
> On Sat 29-03-25 13:02:32, James Bottomley wrote:
> > On Sat, 2025-03-29 at 10:04 -0400, James Bottomley wrote:
> > > On Sat, 2025-03-29 at 09:42 +0100, Christian Brauner wrote:
> > > > Add the necessary infrastructure changes to support freezing for
> > > > suspend and hibernate.
> > > > 
> > > > Just got back from LSFMM. So still jetlagged and likelihood of bugs
> > > > increased. This should all that's needed to wire up power.
> > > > 
> > > > This will be in vfs-6.16.super shortly.
> > > > 
> > > > ---
> > > > Changes in v2:
> > > > - Don't grab reference in the iterator make that a requirement for
> > > > the callers that need custom behavior.
> > > > - Link to v1:
> > > > https://lore.kernel.org/r/20250328-work-freeze-v1-0-a2c3a6b0e7a6@kernel.org
> > > 
> > > Given I've been a bit quiet on this, I thought I'd better explain
> > > what's going on: I do have these built, but I made the mistake of
> > > doing a dist-upgrade on my testing VM master image and it pulled in a
> > > version of systemd (257.4-3) that has a broken hibernate.  Since I
> > > upgraded in place I don't have the old image so I'm spending my time
> > > currently debugging systemd ... normal service will hopefully resume
> > > shortly.
> > 
> > I found the systemd bug
> > 
> > https://github.com/systemd/systemd/issues/36888
> > 
> > And hacked around it, so I can confirm a simple hibernate/resume works
> > provided the sd_start_write() patches are applied (and the hooks are
> > plumbed in to pm).
> > 
> > There is an oddity: the systemd-journald process that would usually
> > hang hibernate in D wait goes into R but seems to be hung and can't be
> > killed by the watchdog even with a -9.  It's stack trace says it's
> > still stuck in sb_start_write:
> > 
> > [<0>] percpu_rwsem_wait.constprop.10+0xd1/0x140
> > [<0>] ext4_page_mkwrite+0x3c1/0x560 [ext4]
> > [<0>] do_page_mkwrite+0x38/0xa0
> > [<0>] do_wp_page+0xd5/0xba0
> > [<0>] __handle_mm_fault+0xa29/0xca0
> > [<0>] handle_mm_fault+0x16a/0x2d0
> > [<0>] do_user_addr_fault+0x3ab/0x810
> > [<0>] exc_page_fault+0x68/0x150
> > [<0>] asm_exc_page_fault+0x22/0x30
> > 
> > So I think there's something funny going on in thaw.
> 
> As Christian wrote, it seems systemd-journald does a memory store to
> mmapped file and gets blocked on sb_start_write() while doing the page
> fault. What's strange is that R state. Is the task really executing on some
> CPU or it only has 'R' state (i.e., got woken but never scheduled)?

I think the issue is that we need to also make pagefault based writers
such as systemd-journald freezable:

I don't think that's it. I think you're missing making pagefault writers
such
as systemd-journald freezable:

I don't think that's it. I think you're missing making pagefault writers such
as systemd-journald freezable:

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..528e73f192ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
        percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-                                  level == SB_FREEZE_WRITE);
+                                  (level == SB_FREEZE_WRITE ||
+                                   level == SB_FREEZE_PAGEFAULT));
 }

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
                             ` (2 preceding siblings ...)
  2025-03-31 14:05           ` [PATCH 0/2] " Ard Biesheuvel
@ 2025-04-01  0:32           ` Christian Brauner
  2025-04-01  0:32             ` [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing Christian Brauner
                               ` (9 more replies)
  3 siblings, 10 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack, rafael
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

The whole shebang can also be found at:
https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze

I know nothing about power or hibernation. I've tested it as best as I
could. Works for me (TM).

I need to catch some actual sleep now...

---

Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.

If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (3):
      fs: add owner of freeze/thaw
      fs: allow pagefault based writers to be frozen
      power: freeze filesystems during suspend/resume

Luis Chamberlain (3):
      ext4: replace kthread freezing with auto fs freezing
      btrfs: replace kthread freezing with auto fs freezing
      xfs: replace kthread freezing with auto fs freezing

 fs/btrfs/disk-io.c          |  4 +--
 fs/btrfs/scrub.c            |  2 +-
 fs/ext4/mballoc.c           |  2 +-
 fs/ext4/super.c             |  3 --
 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++-----
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/super.c                  | 82 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_discard.c        |  2 +-
 fs/xfs/xfs_log.c            |  3 +-
 fs/xfs/xfs_log_cil.c        |  2 +-
 fs/xfs/xfs_mru_cache.c      |  2 +-
 fs/xfs/xfs_notify_failure.c |  6 ++--
 fs/xfs/xfs_pwork.c          |  2 +-
 fs/xfs/xfs_super.c          | 14 ++++----
 fs/xfs/xfs_trans_ail.c      |  3 --
 fs/xfs/xfs_zone_gc.c        |  2 --
 include/linux/fs.h          | 16 ++++++---
 kernel/power/hibernate.c    | 13 ++++++-
 kernel/power/suspend.c      |  8 +++++
 22 files changed, 139 insertions(+), 69 deletions(-)
---
base-commit: a68c99192db8060f383a2680333866c0be688ece
change-id: 20250401-work-freeze-693b5b5a78e0

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
@ 2025-04-01  0:32             ` Christian Brauner
  2025-04-01  9:16               ` Jan Kara
  2025-04-01  0:32             ` [PATCH 2/6] btrfs: " Christian Brauner
                               ` (8 subsequent siblings)
  9 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

From: Luis Chamberlain <mcgrof@kernel.org>

The kernel power management now supports allowing the VFS
to handle filesystem freezing freezes and thawing. Take advantage
of that and remove the kthread freezing. This is needed so that we
properly really stop IO in flight without races after userspace
has been frozen. Without this we rely on kthread freezing and
its semantics are loose and error prone.

The filesystem therefore is in charge of properly dealing with
quiescing of the filesystem through its callbacks if it thinks
it knows better than how the VFS handles it.

The following Coccinelle rule was used as to remove the now superfluous
freezer calls:

make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/ext4

virtual patch

@ remove_set_freezable @
expression time;
statement S, S2;
expression task, current;
@@

(
-       set_freezable();
|
-       if (try_to_freeze())
-               continue;
|
-       try_to_freeze();
|
-       freezable_schedule();
+       schedule();
|
-       freezable_schedule_timeout(time);
+       schedule_timeout(time);
|
-       if (freezing(task)) { S }
|
-       if (freezing(task)) { S }
-       else
	    { S2 }
|
-       freezing(current)
)

@ remove_wq_freezable @
expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
identifier fs_wq_fn;
@@

(
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_FREEZABLE,
+                              WQ_ARG2,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
+                              WQ_ARG2 | WQ_ARG3,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
+                              WQ_ARG2 | WQ_ARG3,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
+                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
			   ...);
|
	    WQ_E =
-               WQ_ARG1 | WQ_FREEZABLE
+               WQ_ARG1
|
	    WQ_E =
-               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
+               WQ_ARG1 | WQ_ARG3
|
    fs_wq_fn(
-               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
+               WQ_ARG2 | WQ_ARG3
    )
|
    fs_wq_fn(
-               WQ_FREEZABLE | WQ_ARG2
+               WQ_ARG2
    )
|
    fs_wq_fn(
-               WQ_FREEZABLE
+               0
    )
)

@ add_auto_flag @
expression E1;
identifier fs_type;
@@

struct file_system_type fs_type = {
	.fs_flags = E1
+                   | FS_AUTOFREEZE
	,
};

Generated-by: Coccinelle SmPL
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250326112220.1988619-5-mcgrof@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/ext4/mballoc.c | 2 +-
 fs/ext4/super.c   | 3 ---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0d523e9fb3d5..ae235ec5ff3a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6782,7 +6782,7 @@ static ext4_grpblk_t ext4_last_grp_cluster(struct super_block *sb,
 
 static bool ext4_trim_interrupted(void)
 {
-	return fatal_signal_pending(current) || freezing(current);
+	return fatal_signal_pending(current);
 }
 
 static int ext4_try_to_trim_range(struct super_block *sb,
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8122d4ffb3b5..020c818078d7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3778,7 +3778,6 @@ static int ext4_lazyinit_thread(void *arg)
 	unsigned long next_wakeup, cur;
 
 	BUG_ON(NULL == eli);
-	set_freezable();
 
 cont_thread:
 	while (true) {
@@ -3837,8 +3836,6 @@ static int ext4_lazyinit_thread(void *arg)
 		}
 		mutex_unlock(&eli->li_list_mtx);
 
-		try_to_freeze();
-
 		cur = jiffies;
 		if (!next_wakeup_initialized || time_after_eq(cur, next_wakeup)) {
 			cond_resched();

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 2/6] btrfs: replace kthread freezing with auto fs freezing
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
  2025-04-01  0:32             ` [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing Christian Brauner
@ 2025-04-01  0:32             ` Christian Brauner
  2025-04-01  0:32             ` [PATCH 3/6] xfs: " Christian Brauner
                               ` (7 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

From: Luis Chamberlain <mcgrof@kernel.org>

The kernel power management now supports allowing the VFS
to handle filesystem freezing freezes and thawing. Take advantage
of that and remove the kthread freezing. This is needed so that we
properly really stop IO in flight without races after userspace
has been frozen. Without this we rely on kthread freezing and
its semantics are loose and error prone.

The filesystem therefore is in charge of properly dealing with
quiescing of the filesystem through its callbacks if it thinks
it knows better than how the VFS handles it.

The following Coccinelle rule was used as to remove the now superfluous
freezer calls:

make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/btrfs

virtual patch

@ remove_set_freezable @
expression time;
statement S, S2;
expression task, current;
@@

(
-       set_freezable();
|
-       if (try_to_freeze())
-               continue;
|
-       try_to_freeze();
|
-       freezable_schedule();
+       schedule();
|
-       freezable_schedule_timeout(time);
+       schedule_timeout(time);
|
-       if (freezing(task)) { S }
|
-       if (freezing(task)) { S }
-       else
	    { S2 }
|
-       freezing(current)
)

@ remove_wq_freezable @
expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
identifier fs_wq_fn;
@@

(
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_FREEZABLE,
+                              WQ_ARG2,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
+                              WQ_ARG2 | WQ_ARG3,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
+                              WQ_ARG2 | WQ_ARG3,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
+                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
			   ...);
|
	    WQ_E =
-               WQ_ARG1 | WQ_FREEZABLE
+               WQ_ARG1
|
	    WQ_E =
-               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
+               WQ_ARG1 | WQ_ARG3
|
    fs_wq_fn(
-               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
+               WQ_ARG2 | WQ_ARG3
    )
|
    fs_wq_fn(
-               WQ_FREEZABLE | WQ_ARG2
+               WQ_ARG2
    )
|
    fs_wq_fn(
-               WQ_FREEZABLE
+               0
    )
)

@ add_auto_flag @
expression E1;
identifier fs_type;
@@

struct file_system_type fs_type = {
	.fs_flags = E1
+                   | FS_AUTOFREEZE
	,
};

Generated-by: Coccinelle SmPL
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250326112220.1988619-6-mcgrof@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/scrub.c   | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1a916716cefe..bce3ae569fe0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1962,8 +1962,8 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 {
 	u32 max_active = fs_info->thread_pool_size;
-	unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
-	unsigned int ordered_flags = WQ_MEM_RECLAIM | WQ_FREEZABLE;
+	unsigned int flags = WQ_MEM_RECLAIM | WQ_UNBOUND;
+	unsigned int ordered_flags = WQ_MEM_RECLAIM;
 
 	fs_info->workers =
 		btrfs_alloc_workqueue(fs_info, "worker", flags, max_active, 16);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 2c5edcee9450..5790177b4c2f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2877,7 +2877,7 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info)
 static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info)
 {
 	struct workqueue_struct *scrub_workers = NULL;
-	unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND;
+	unsigned int flags = WQ_UNBOUND;
 	int max_active = fs_info->thread_pool_size;
 	int ret = -ENOMEM;
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 3/6] xfs: replace kthread freezing with auto fs freezing
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
  2025-04-01  0:32             ` [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing Christian Brauner
  2025-04-01  0:32             ` [PATCH 2/6] btrfs: " Christian Brauner
@ 2025-04-01  0:32             ` Christian Brauner
  2025-04-01  1:11               ` Dave Chinner
  2025-04-01  0:32             ` [PATCH 4/6] fs: add owner of freeze/thaw Christian Brauner
                               ` (6 subsequent siblings)
  9 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

From: Luis Chamberlain <mcgrof@kernel.org>

The kernel power management now supports allowing the VFS
to handle filesystem freezing freezes and thawing. Take advantage
of that and remove the kthread freezing. This is needed so that we
properly really stop IO in flight without races after userspace
has been frozen. Without this we rely on kthread freezing and
its semantics are loose and error prone.

The filesystem therefore is in charge of properly dealing with
quiescing of the filesystem through its callbacks if it thinks
it knows better than how the VFS handles it.

The following Coccinelle rule was used as to remove the now superfluous
freezer calls:

make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/xfs

virtual patch

@ remove_set_freezable @
expression time;
statement S, S2;
expression task, current;
@@

(
-       set_freezable();
|
-       if (try_to_freeze())
-               continue;
|
-       try_to_freeze();
|
-       freezable_schedule();
+       schedule();
|
-       freezable_schedule_timeout(time);
+       schedule_timeout(time);
|
-       if (freezing(task)) { S }
|
-       if (freezing(task)) { S }
-       else
	    { S2 }
|
-       freezing(current)
)

@ remove_wq_freezable @
expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
identifier fs_wq_fn;
@@

(
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_FREEZABLE,
+                              WQ_ARG2,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
+                              WQ_ARG2 | WQ_ARG3,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
+                              WQ_ARG2 | WQ_ARG3,
			   ...);
|
    WQ_E = alloc_workqueue(WQ_ARG1,
-                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
+                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
			   ...);
|
	    WQ_E =
-               WQ_ARG1 | WQ_FREEZABLE
+               WQ_ARG1
|
	    WQ_E =
-               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
+               WQ_ARG1 | WQ_ARG3
|
    fs_wq_fn(
-               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
+               WQ_ARG2 | WQ_ARG3
    )
|
    fs_wq_fn(
-               WQ_FREEZABLE | WQ_ARG2
+               WQ_ARG2
    )
|
    fs_wq_fn(
-               WQ_FREEZABLE
+               0
    )
)

@ add_auto_flag @
expression E1;
identifier fs_type;
@@

struct file_system_type fs_type = {
	.fs_flags = E1
+                   | FS_AUTOFREEZE
	,
};

Generated-by: Coccinelle SmPL
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250326112220.1988619-7-mcgrof@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/xfs/xfs_discard.c   |  2 +-
 fs/xfs/xfs_log.c       |  3 +--
 fs/xfs/xfs_log_cil.c   |  2 +-
 fs/xfs/xfs_mru_cache.c |  2 +-
 fs/xfs/xfs_pwork.c     |  2 +-
 fs/xfs/xfs_super.c     | 14 +++++++-------
 fs/xfs/xfs_trans_ail.c |  3 ---
 fs/xfs/xfs_zone_gc.c   |  2 --
 8 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index c1a306268ae4..1596cf0ecb9b 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -333,7 +333,7 @@ xfs_trim_gather_extents(
 static bool
 xfs_trim_should_stop(void)
 {
-	return fatal_signal_pending(current) || freezing(current);
+	return fatal_signal_pending(current);
 }
 
 /*
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6493bdb57351..317f6db292fb 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1489,8 +1489,7 @@ xlog_alloc_log(
 	log->l_iclog->ic_prev = prev_iclog;	/* re-write 1st prev ptr */
 
 	log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM |
-				    WQ_HIGHPRI),
+			XFS_WQFLAGS(WQ_MEM_RECLAIM | WQ_HIGHPRI),
 			0, mp->m_super->s_id);
 	if (!log->l_ioend_workqueue)
 		goto out_free_iclog;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 1ca406ec1b40..8ff5d68394e6 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1932,7 +1932,7 @@ xlog_cil_init(
 	 * concurrency the log spinlocks will be exposed to.
 	 */
 	cil->xc_push_wq = alloc_workqueue("xfs-cil/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
+			XFS_WQFLAGS(WQ_MEM_RECLAIM | WQ_UNBOUND),
 			4, log->l_mp->m_super->s_id);
 	if (!cil->xc_push_wq)
 		goto out_destroy_cil;
diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c
index d0f5b403bdbe..c9a49c6f6129 100644
--- a/fs/xfs/xfs_mru_cache.c
+++ b/fs/xfs/xfs_mru_cache.c
@@ -293,7 +293,7 @@ int
 xfs_mru_cache_init(void)
 {
 	xfs_mru_reap_wq = alloc_workqueue("xfs_mru_cache",
-			XFS_WQFLAGS(WQ_MEM_RECLAIM | WQ_FREEZABLE), 1);
+			XFS_WQFLAGS(WQ_MEM_RECLAIM), 1);
 	if (!xfs_mru_reap_wq)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
index c283b801cc5d..3f5bf53f8778 100644
--- a/fs/xfs/xfs_pwork.c
+++ b/fs/xfs/xfs_pwork.c
@@ -72,7 +72,7 @@ xfs_pwork_init(
 	trace_xfs_pwork_init(mp, nr_threads, current->pid);
 
 	pctl->wq = alloc_workqueue("%s-%d",
-			WQ_UNBOUND | WQ_SYSFS | WQ_FREEZABLE, nr_threads, tag,
+			WQ_UNBOUND | WQ_SYSFS, nr_threads, tag,
 			current->pid);
 	if (!pctl->wq)
 		return -ENOMEM;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 53944cc7af24..06eb51a3d13b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -565,37 +565,37 @@ xfs_init_mount_workqueues(
 	struct xfs_mount	*mp)
 {
 	mp->m_buf_workqueue = alloc_workqueue("xfs-buf/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			XFS_WQFLAGS(WQ_MEM_RECLAIM),
 			1, mp->m_super->s_id);
 	if (!mp->m_buf_workqueue)
 		goto out;
 
 	mp->m_unwritten_workqueue = alloc_workqueue("xfs-conv/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			XFS_WQFLAGS(WQ_MEM_RECLAIM),
 			0, mp->m_super->s_id);
 	if (!mp->m_unwritten_workqueue)
 		goto out_destroy_buf;
 
 	mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			XFS_WQFLAGS(WQ_MEM_RECLAIM),
 			0, mp->m_super->s_id);
 	if (!mp->m_reclaim_workqueue)
 		goto out_destroy_unwritten;
 
 	mp->m_blockgc_wq = alloc_workqueue("xfs-blockgc/%s",
-			XFS_WQFLAGS(WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			XFS_WQFLAGS(WQ_UNBOUND | WQ_MEM_RECLAIM),
 			0, mp->m_super->s_id);
 	if (!mp->m_blockgc_wq)
 		goto out_destroy_reclaim;
 
 	mp->m_inodegc_wq = alloc_workqueue("xfs-inodegc/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			XFS_WQFLAGS(WQ_MEM_RECLAIM),
 			1, mp->m_super->s_id);
 	if (!mp->m_inodegc_wq)
 		goto out_destroy_blockgc;
 
 	mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s",
-			XFS_WQFLAGS(WQ_FREEZABLE), 0, mp->m_super->s_id);
+			XFS_WQFLAGS(0), 0, mp->m_super->s_id);
 	if (!mp->m_sync_workqueue)
 		goto out_destroy_inodegc;
 
@@ -2488,7 +2488,7 @@ xfs_init_workqueues(void)
 	 * max_active value for this workqueue.
 	 */
 	xfs_alloc_wq = alloc_workqueue("xfsalloc",
-			XFS_WQFLAGS(WQ_MEM_RECLAIM | WQ_FREEZABLE), 0);
+			XFS_WQFLAGS(WQ_MEM_RECLAIM), 0);
 	if (!xfs_alloc_wq)
 		return -ENOMEM;
 
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 0fcb1828e598..ad8183db0780 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -636,7 +636,6 @@ xfsaild(
 	unsigned int	noreclaim_flag;
 
 	noreclaim_flag = memalloc_noreclaim_save();
-	set_freezable();
 
 	while (1) {
 		/*
@@ -695,8 +694,6 @@ xfsaild(
 
 		__set_current_state(TASK_RUNNING);
 
-		try_to_freeze();
-
 		tout = xfsaild_push(ailp);
 	}
 
diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
index c5136ea9bb1d..1875b6551ab0 100644
--- a/fs/xfs/xfs_zone_gc.c
+++ b/fs/xfs/xfs_zone_gc.c
@@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
 	}
 
 	__set_current_state(TASK_RUNNING);
-	try_to_freeze();
 
 	if (reset_list)
 		xfs_zone_gc_reset_zones(data, reset_list);
@@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
 	unsigned int		nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
-	set_freezable();
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 4/6] fs: add owner of freeze/thaw
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (2 preceding siblings ...)
  2025-04-01  0:32             ` [PATCH 3/6] xfs: " Christian Brauner
@ 2025-04-01  0:32             ` Christian Brauner
  2025-04-01  0:32             ` [PATCH 5/6] fs: allow pagefault based writers to be frozen Christian Brauner
                               ` (5 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

For some kernel subsystems it is paramount that they are guaranteed that
they are the owner of the freeze to avoid any risk of deadlocks. This is
the case for the power subsystem. Enable it to recognize whether it did
actually freeze the filesystem.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 +++++++------
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 +++---
 fs/super.c                  | 68 +++++++++++++++++++++++++++++++++++++--------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_notify_failure.c |  6 ++--
 include/linux/fs.h          | 13 ++++++---
 8 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 2b8f9239bede..3e8af62c9e15 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
 	if (err)
 		return err;
 
-	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 	if (err)
 		return err;
 
 	if (f2fs_readonly(sbi->sb)) {
-		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 		if (err)
 			return err;
 		return -EROFS;
@@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
 out_err:
 	f2fs_up_write(&sbi->cp_global_sem);
 	f2fs_up_write(&sbi->gc_lock);
-	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 	return err;
 }
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 44e5658b896c..519943189109 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
 	return sdp->sd_log_error;
 }
 
-static int gfs2_do_thaw(struct gfs2_sbd *sdp)
+static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
 {
 	struct super_block *sb = sdp->sd_vfs;
 	int error;
@@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
 	error = gfs2_freeze_lock_shared(sdp);
 	if (error)
 		goto fail;
-	error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+	error = thaw_super(sb, who, freeze_owner);
 	if (!error)
 		return 0;
 
@@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
 	if (test_bit(SDF_FROZEN, &sdp->sd_flags))
 		goto freeze_failed;
 
-	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 	if (error)
 		goto freeze_failed;
 
 	gfs2_freeze_unlock(sdp);
 	set_bit(SDF_FROZEN, &sdp->sd_flags);
 
-	error = gfs2_do_thaw(sdp);
+	error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
 	if (error)
 		goto out;
 
@@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
  *
  */
 
-static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	int error;
@@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
 	}
 
 	for (;;) {
-		error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+		error = freeze_super(sb, who, freeze_owner);
 		if (error) {
 			fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
 				error);
@@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
 			break;
 		}
 
-		error = gfs2_do_thaw(sdp);
+		error = gfs2_do_thaw(sdp, who, freeze_owner);
 		if (error)
 			goto out;
 
@@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
  *
  */
 
-static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
+			   const void *freeze_owner)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	int error;
@@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
 	atomic_inc(&sb->s_active);
 	gfs2_freeze_unlock(sdp);
 
-	error = gfs2_do_thaw(sdp);
+	error = gfs2_do_thaw(sdp, who, freeze_owner);
 
 	if (!error) {
 		clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index ecc699f8d9fc..748125653d6c 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 
 	switch (n) {
 	case 0:
-		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
 		break;
 	case 1:
-		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
 		break;
 	default:
 		return -EINVAL;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index c91fd2b46a77..bedc83fc2f20 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
 
 	/* Freeze */
 	if (sb->s_op->freeze_super)
-		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
-	return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+	return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int ioctl_fsthaw(struct file *filp)
@@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
 
 	/* Thaw */
 	if (sb->s_op->thaw_super)
-		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
-	return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+	return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int ioctl_file_dedupe_range(struct file *file,
diff --git a/fs/super.c b/fs/super.c
index 3c4a496d6438..606072a3fab9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -39,7 +39,8 @@
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
 	if (IS_ENABLED(CONFIG_BLOCK))
 		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
 			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
-	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
 	return;
 }
 
@@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
 
 	if (sb->s_op->freeze_super)
 		error = sb->s_op->freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	else
 		error = freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	if (!error)
 		error = sync_blockdev(bdev);
 	deactivate_super(sb);
@@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
 
 	if (sb->s_op->thaw_super)
 		error = sb->s_op->thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	else
 		error = thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	deactivate_super(sb);
 	return error;
 }
@@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
 }
 
 #define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
-#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
+#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
 
 static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
 {
@@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
+	if (who & FREEZE_EXCL) {
+		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+			return false;
+
+		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+			return false;
+
+		return (sb->s_writers.freeze_kcount +
+			sb->s_writers.freeze_ucount) == 0;
+	}
+
+	/* This filesystem is already exclusively frozen. */
+	if (sb->s_writers.freeze_owner)
+		return false;
+
 	if (who & FREEZE_HOLDER_KERNEL)
 		return (who & FREEZE_MAY_NEST) ||
 		       sb->s_writers.freeze_kcount == 0;
@@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 	return false;
 }
 
+static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
+				const void *freeze_owner)
+{
+	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+	if (who & FREEZE_EXCL) {
+		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+			return false;
+		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+			return false;
+		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+			return false;
+		return sb->s_writers.freeze_owner == freeze_owner;
+	}
+
+	return sb->s_writers.freeze_owner == NULL;
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
  * @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
  *
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs may return
@@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
  * Return: If the freeze was successful zero is returned. If the freeze
  *         failed a negative error code is returned.
  */
-int freeze_super(struct super_block *sb, enum freeze_holder who)
+int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
 {
 	int ret;
 
@@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	if (sb_rdonly(sb)) {
 		/* Nothing to do really... */
 		WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+		sb->s_writers.freeze_owner = freeze_owner;
 		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 		wake_up_var(&sb->s_writers.frozen);
 		super_unlock_excl(sb);
@@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 	 */
 	WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+	sb->s_writers.freeze_owner = freeze_owner;
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 	wake_up_var(&sb->s_writers.frozen);
 	lockdep_sb_freeze_release(sb);
@@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
  * removes that state without releasing the other state or unlocking the
  * filesystem.
  */
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner)
 {
 	int error = -EINVAL;
 
 	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
 		goto out_unlock;
 
+	if (!may_unfreeze(sb, who, freeze_owner))
+		goto out_unlock;
+
 	/*
 	 * All freezers share a single active reference.
 	 * So just unlock in case there are any left.
@@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 
 	if (sb_rdonly(sb)) {
 		sb->s_writers.frozen = SB_UNFROZEN;
+		sb->s_writers.freeze_owner = NULL;
 		wake_up_var(&sb->s_writers.frozen);
 		goto out_deactivate;
 	}
@@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 	}
 
 	sb->s_writers.frozen = SB_UNFROZEN;
+	sb->s_writers.freeze_owner = NULL;
 	wake_up_var(&sb->s_writers.frozen);
 	sb_freeze_unlock(sb, SB_FREEZE_FS);
 out_deactivate:
@@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
  * thaw_super -- unlock filesystem
  * @sb: the super to thaw
  * @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super()
  * if there are no remaining freezes on the filesystem.
@@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
  * have been frozen through the block layer via multiple block devices.
  * The filesystem remains frozen until all block devices are unfrozen.
  */
-int thaw_super(struct super_block *sb, enum freeze_holder who)
+int thaw_super(struct super_block *sb, enum freeze_holder who,
+	       const void *freeze_owner)
 {
 	if (!super_lock_excl(sb)) {
 		WARN_ON_ONCE("Dying superblock while thawing!");
 		return -EINVAL;
 	}
-	return thaw_super_locked(sb, who);
+	return thaw_super_locked(sb, who, freeze_owner);
 }
 EXPORT_SYMBOL(thaw_super);
 
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index e629663e460a..9b598c5790ad 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -123,7 +123,7 @@ xchk_fsfreeze(
 {
 	int			error;
 
-	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
 	trace_xchk_fsfreeze(sc, error);
 	return error;
 }
@@ -135,7 +135,7 @@ xchk_fsthaw(
 	int			error;
 
 	/* This should always succeed, we have a kernel freeze */
-	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
 	trace_xchk_fsthaw(sc, error);
 	return error;
 }
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index ed8d8ed42f0a..3545dc1d953c 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
 	struct super_block	*sb = mp->m_super;
 	int			error;
 
-	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
 	if (error)
 		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
 
@@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
 	int			error;
 
 	if (kernel_frozen) {
-		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
 		if (error)
 			xfs_emerg(mp, "still frozen after notify failure, err=%d",
 				error);
@@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
 	 * Also thaw userspace call anyway because the device is about to be
 	 * removed immediately.
 	 */
-	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1aa578412f1b..b379a46b5576 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1307,6 +1307,7 @@ struct sb_writers {
 	unsigned short			frozen;		/* Is sb frozen? */
 	int				freeze_kcount;	/* How many kernel freeze requests? */
 	int				freeze_ucount;	/* How many userspace freeze requests? */
+	const void			*freeze_owner;	/* Owner of the freeze */
 	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
 };
 
@@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ * @FREEZE_EXCL: whether actual freezing must be done by the caller
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.
@@ -2283,6 +2285,7 @@ enum freeze_holder {
 	FREEZE_HOLDER_KERNEL	= (1U << 0),
 	FREEZE_HOLDER_USERSPACE	= (1U << 1),
 	FREEZE_MAY_NEST		= (1U << 2),
+	FREEZE_EXCL		= (1U << 3),
 };
 
 struct super_operations {
@@ -2296,9 +2299,9 @@ struct super_operations {
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
 	int (*sync_fs)(struct super_block *sb, int wait);
-	int (*freeze_super) (struct super_block *, enum freeze_holder who);
+	int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
 	int (*freeze_fs) (struct super_block *);
-	int (*thaw_super) (struct super_block *, enum freeze_holder who);
+	int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
@@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
 extern int vfs_statfs(const struct path *, struct kstatfs *);
 extern int user_statfs(const char __user *, struct kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
-int freeze_super(struct super_block *super, enum freeze_holder who);
-int thaw_super(struct super_block *super, enum freeze_holder who);
+int freeze_super(struct super_block *super, enum freeze_holder who,
+		 const void *freeze_owner);
+int thaw_super(struct super_block *super, enum freeze_holder who,
+	       const void *freeze_owner);
 extern __printf(2, 3)
 int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
 extern int super_setup_bdi(struct super_block *sb);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 5/6] fs: allow pagefault based writers to be frozen
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (3 preceding siblings ...)
  2025-04-01  0:32             ` [PATCH 4/6] fs: add owner of freeze/thaw Christian Brauner
@ 2025-04-01  0:32             ` Christian Brauner
  2025-04-01  0:32             ` [PATCH 6/6] power: freeze filesystems during suspend/resume Christian Brauner
                               ` (4 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

Otherwise tasks such as systemd-journald that mmap a file and write to
it will not be frozen after we've frozen the filesystem.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/fs.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..528e73f192ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
 	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-				   level == SB_FREEZE_WRITE);
+				   (level == SB_FREEZE_WRITE ||
+				    level == SB_FREEZE_PAGEFAULT));
 }
 
 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 6/6] power: freeze filesystems during suspend/resume
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (4 preceding siblings ...)
  2025-04-01  0:32             ` [PATCH 5/6] fs: allow pagefault based writers to be frozen Christian Brauner
@ 2025-04-01  0:32             ` Christian Brauner
  2025-04-01  8:16             ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (3 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  0:32 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keepin the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.

If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze for now.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c               | 14 ++++++++++----
 kernel/power/hibernate.c | 13 ++++++++++++-
 kernel/power/suspend.c   |  8 ++++++++
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 606072a3fab9..dd0d6def4a55 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
 	return active;
 }
 
+static const void *filesystems_freeze_ptr;
+
 static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 {
 	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
@@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->freeze_super)
-		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+				       filesystems_freeze_ptr);
 	else
-		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+			     filesystems_freeze_ptr);
 
 	deactivate_super(sb);
 }
@@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->thaw_super)
-		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+				     filesystems_freeze_ptr);
 	else
-		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+			   filesystems_freeze_ptr);
 
 	deactivate_super(sb);
 }
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 50ec26ea696b..1803b7d24757 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -777,6 +777,7 @@ int hibernate(void)
 		goto Restore;
 
 	ksys_sync_helper();
+	filesystems_freeze();
 
 	error = freeze_processes();
 	if (error)
@@ -841,6 +842,7 @@ int hibernate(void)
 			error = load_image_and_restore();
 	}
 	thaw_processes();
+	filesystems_thaw();
 
 	/* Don't bother checking whether freezer_test_done is true */
 	freezer_test_done = false;
@@ -881,6 +883,8 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 	if (error)
 		goto restore;
 
+	filesystems_freeze();
+
 	error = freeze_processes();
 	if (error)
 		goto exit;
@@ -940,6 +944,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 	thaw_processes();
 
 exit:
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
 
 restore:
@@ -1028,19 +1033,25 @@ static int software_resume(void)
 	if (error)
 		goto Restore;
 
+	filesystems_freeze();
+
 	pm_pr_dbg("Preparing processes for hibernation restore.\n");
 	error = freeze_processes();
-	if (error)
+	if (error) {
+		filesystems_thaw();
 		goto Close_Finish;
+	}
 
 	error = freeze_kernel_threads();
 	if (error) {
 		thaw_processes();
+		filesystems_thaw();
 		goto Close_Finish;
 	}
 
 	error = load_image_and_restore();
 	thaw_processes();
+	filesystems_thaw();
  Finish:
 	pm_notifier_call_chain(PM_POST_RESTORE);
  Restore:
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 8eaec4ab121d..4c476271f7f2 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -30,6 +30,7 @@
 #include <trace/events/power.h>
 #include <linux/compiler.h>
 #include <linux/moduleparam.h>
+#include <linux/fs.h>
 
 #include "power.h"
 
@@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
 	if (error)
 		goto Restore;
 
+	if (sync_on_suspend_enabled)
+		filesystems_freeze();
 	trace_suspend_resume(TPS("freeze_processes"), 0, true);
 	error = suspend_freeze_processes();
 	trace_suspend_resume(TPS("freeze_processes"), 0, false);
@@ -550,6 +553,8 @@ int suspend_devices_and_enter(suspend_state_t state)
 static void suspend_finish(void)
 {
 	suspend_thaw_processes();
+	if (sync_on_suspend_enabled)
+		filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_SUSPEND);
 	pm_restore_console();
 }
@@ -587,6 +592,7 @@ static int enter_state(suspend_state_t state)
 		trace_suspend_resume(TPS("sync_filesystems"), 0, true);
 		ksys_sync_helper();
 		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
+		filesystems_freeze();
 	}
 
 	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
@@ -609,6 +615,8 @@ static int enter_state(suspend_state_t state)
 	pm_pr_dbg("Finishing wakeup.\n");
 	suspend_finish();
  Unlock:
+	if (sync_on_suspend_enabled)
+		filesystems_thaw();
 	mutex_unlock(&system_transition_mutex);
 	return error;
 }

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/6] xfs: replace kthread freezing with auto fs freezing
  2025-04-01  0:32             ` [PATCH 3/6] xfs: " Christian Brauner
@ 2025-04-01  1:11               ` Dave Chinner
  2025-04-01  7:17                 ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: Dave Chinner @ 2025-04-01  1:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> The kernel power management now supports allowing the VFS
> to handle filesystem freezing freezes and thawing. Take advantage
> of that and remove the kthread freezing. This is needed so that we
> properly really stop IO in flight without races after userspace
> has been frozen. Without this we rely on kthread freezing and
> its semantics are loose and error prone.
> 
> The filesystem therefore is in charge of properly dealing with
> quiescing of the filesystem through its callbacks if it thinks
> it knows better than how the VFS handles it.
> 
.....

> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 0fcb1828e598..ad8183db0780 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -636,7 +636,6 @@ xfsaild(
>  	unsigned int	noreclaim_flag;
>  
>  	noreclaim_flag = memalloc_noreclaim_save();
> -	set_freezable();
>  
>  	while (1) {
>  		/*
> @@ -695,8 +694,6 @@ xfsaild(
>  
>  		__set_current_state(TASK_RUNNING);
>  
> -		try_to_freeze();
> -
>  		tout = xfsaild_push(ailp);
>  	}
>  

So what about the TASK_FREEZABLE flag that is set in this code
before sleeping?

i.e. this code before we schedule():

                if (tout && tout <= 20)
                        set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
                else
                        set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);

Shouldn't TASK_FREEZABLE go away, too?

> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index c5136ea9bb1d..1875b6551ab0 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
>  	}
>  
>  	__set_current_state(TASK_RUNNING);
> -	try_to_freeze();
>  
>  	if (reset_list)
>  		xfs_zone_gc_reset_zones(data, reset_list);
> @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
>  	unsigned int		nofs_flag;
>  
>  	nofs_flag = memalloc_nofs_save();
> -	set_freezable();
>  
>  	for (;;) {
>  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);

Same question here for this newly merged code, too...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-03-31 23:32     ` Christian Brauner
@ 2025-04-01  1:13       ` James Bottomley
  2025-04-01 11:20         ` Jan Kara
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-04-01  1:13 UTC (permalink / raw)
  To: Christian Brauner, jack
  Cc: linux-fsdevel, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote:
> On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote:
> > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote:
> > [...]
> > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem,
> > > bool
> > > reader)
> > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem,
> > > bool
> > > reader,
> > > +			      bool freeze)
> > >  {
> > >  	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
> > >  	bool wait;
> > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct
> > > percpu_rw_semaphore *sem, bool reader)
> > >  	spin_unlock_irq(&sem->waiters.lock);
> > >  
> > >  	while (wait) {
> > > -		set_current_state(TASK_UNINTERRUPTIBLE);
> > > +		set_current_state(TASK_UNINTERRUPTIBLE |
> > > +				  freeze ? TASK_FREEZABLE : 0);
> > 
> > This is a bit embarrassing, the bug I've been chasing is here: the
> > ?
> > operator is lower in precedence than | meaning this expression
> > always
> > evaluates to TASK_FREEZABLE and nothing else (which is why the
> > process
> > goes into R state and never wakes up).
> > 
> > Let me fix that and redo all the testing.
> 
> I don't think that's it. I think you're missing making pagefault
> writers such
> as systemd-journald freezable:
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..528e73f192ac 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct
> super_block *sb, int level)
>  static inline void __sb_start_write(struct super_block *sb, int
> level)
>  {
>         percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> -                                  level == SB_FREEZE_WRITE);
> +                                  (level == SB_FREEZE_WRITE ||
> +                                   level == SB_FREEZE_PAGEFAULT));
>  }

Yes, I was about to tell Jan that the condition here simply needs to be
true.  All our rwsem levels need to be freezable to avoid a hibernation
failure.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/6] xfs: replace kthread freezing with auto fs freezing
  2025-04-01  1:11               ` Dave Chinner
@ 2025-04-01  7:17                 ` Christian Brauner
  2025-04-01 11:35                   ` Dave Chinner
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  7:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 12:11:04PM +1100, Dave Chinner wrote:
> On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > The kernel power management now supports allowing the VFS
> > to handle filesystem freezing freezes and thawing. Take advantage
> > of that and remove the kthread freezing. This is needed so that we
> > properly really stop IO in flight without races after userspace
> > has been frozen. Without this we rely on kthread freezing and
> > its semantics are loose and error prone.
> > 
> > The filesystem therefore is in charge of properly dealing with
> > quiescing of the filesystem through its callbacks if it thinks
> > it knows better than how the VFS handles it.
> > 
> .....
> 
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 0fcb1828e598..ad8183db0780 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -636,7 +636,6 @@ xfsaild(
> >  	unsigned int	noreclaim_flag;
> >  
> >  	noreclaim_flag = memalloc_noreclaim_save();
> > -	set_freezable();
> >  
> >  	while (1) {
> >  		/*
> > @@ -695,8 +694,6 @@ xfsaild(
> >  
> >  		__set_current_state(TASK_RUNNING);
> >  
> > -		try_to_freeze();
> > -
> >  		tout = xfsaild_push(ailp);
> >  	}
> >  
> 
> So what about the TASK_FREEZABLE flag that is set in this code
> before sleeping?
> 
> i.e. this code before we schedule():
> 
>                 if (tout && tout <= 20)
>                         set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
>                 else
>                         set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> 
> Shouldn't TASK_FREEZABLE go away, too?

Thanks for spotting! Yes, yesterday late at night I just took Luis
patches as they are and had only gotten around to testing btrfs. The
coccinelle scripts seemed to have missed those. I'll wait for comments
and will do another pass and send out v2.

> > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> > index c5136ea9bb1d..1875b6551ab0 100644
> > --- a/fs/xfs/xfs_zone_gc.c
> > +++ b/fs/xfs/xfs_zone_gc.c
> > @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
> >  	}
> >  
> >  	__set_current_state(TASK_RUNNING);
> > -	try_to_freeze();
> >  
> >  	if (reset_list)
> >  		xfs_zone_gc_reset_zones(data, reset_list);
> > @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
> >  	unsigned int		nofs_flag;
> >  
> >  	nofs_flag = memalloc_nofs_save();
> > -	set_freezable();
> >  
> >  	for (;;) {
> >  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
> 
> Same question here for this newly merged code, too...

I'm not sure if this is supposed to be a snipe or not but just in case
this is a hidden question: This isn't merged. Per the cover letter this
is in a work.* branch. Anything that is considered mergable is in
vfs-6.16.* branches. But since we're pre -rc1 even those branches are
not yet showing up in -next.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (5 preceding siblings ...)
  2025-04-01  0:32             ` [PATCH 6/6] power: freeze filesystems during suspend/resume Christian Brauner
@ 2025-04-01  8:16             ` Christian Brauner
  2025-04-01  9:32             ` Jan Kara
                               ` (2 subsequent siblings)
  9 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  8:16 UTC (permalink / raw)
  To: jack, rafael
  Cc: Ard Biesheuvel, linux-efi, linux-kernel, James Bottomley, mcgrof,
	hch, david, djwong, pavel, peterz, mingo, will, boqun.feng,
	linux-fsdevel

On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation.

I would like to place this behind a Kconfig option and add a
/sys/power/freeze_on_suspend option as these changes are pretty
sensitive and to give userspace the ability to experiment with this for
a while until we remove it. That means we should skip the removal of all
the freezer changes in the filesystems until we're happy enough that
this works reliable enough.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing
  2025-04-01  0:32             ` [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing Christian Brauner
@ 2025-04-01  9:16               ` Jan Kara
  2025-04-01  9:35                 ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-04-01  9:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Tue 01-04-25 02:32:46, Christian Brauner wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> The kernel power management now supports allowing the VFS
> to handle filesystem freezing freezes and thawing. Take advantage
> of that and remove the kthread freezing. This is needed so that we
> properly really stop IO in flight without races after userspace
> has been frozen. Without this we rely on kthread freezing and
> its semantics are loose and error prone.
> 
> The filesystem therefore is in charge of properly dealing with
> quiescing of the filesystem through its callbacks if it thinks
> it knows better than how the VFS handles it.
> 
> The following Coccinelle rule was used as to remove the now superfluous
> freezer calls:
> 
> make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/ext4
> 
> virtual patch
> 
> @ remove_set_freezable @
> expression time;
> statement S, S2;
> expression task, current;
> @@
> 
> (
> -       set_freezable();
> |
> -       if (try_to_freeze())
> -               continue;
> |
> -       try_to_freeze();
> |
> -       freezable_schedule();
> +       schedule();
> |
> -       freezable_schedule_timeout(time);
> +       schedule_timeout(time);
> |
> -       if (freezing(task)) { S }
> |
> -       if (freezing(task)) { S }
> -       else
> 	    { S2 }
> |
> -       freezing(current)
> )
> 
> @ remove_wq_freezable @
> expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
> identifier fs_wq_fn;
> @@
> 
> (
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_FREEZABLE,
> +                              WQ_ARG2,
> 			   ...);
> |
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
> +                              WQ_ARG2 | WQ_ARG3,
> 			   ...);
> |
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
> +                              WQ_ARG2 | WQ_ARG3,
> 			   ...);
> |
>     WQ_E = alloc_workqueue(WQ_ARG1,
> -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
> +                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
> 			   ...);
> |
> 	    WQ_E =
> -               WQ_ARG1 | WQ_FREEZABLE
> +               WQ_ARG1
> |
> 	    WQ_E =
> -               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
> +               WQ_ARG1 | WQ_ARG3
> |
>     fs_wq_fn(
> -               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
> +               WQ_ARG2 | WQ_ARG3
>     )
> |
>     fs_wq_fn(
> -               WQ_FREEZABLE | WQ_ARG2
> +               WQ_ARG2
>     )
> |
>     fs_wq_fn(
> -               WQ_FREEZABLE
> +               0
>     )
> )
> 
> @ add_auto_flag @
> expression E1;
> identifier fs_type;
> @@
> 
> struct file_system_type fs_type = {
> 	.fs_flags = E1
> +                   | FS_AUTOFREEZE
> 	,
> };
> 
> Generated-by: Coccinelle SmPL
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Link: https://lore.kernel.org/r/20250326112220.1988619-5-mcgrof@kernel.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/ext4/mballoc.c | 2 +-
>  fs/ext4/super.c   | 3 ---
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 0d523e9fb3d5..ae235ec5ff3a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -6782,7 +6782,7 @@ static ext4_grpblk_t ext4_last_grp_cluster(struct super_block *sb,
>  
>  static bool ext4_trim_interrupted(void)
>  {
> -	return fatal_signal_pending(current) || freezing(current);
> +	return fatal_signal_pending(current);
>  }

This change should not happen. ext4_trim_interrupted() makes sure FITRIM
ioctl doesn't cause hibernation failures and has nothing to do with kthread
freezing...

Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (6 preceding siblings ...)
  2025-04-01  8:16             ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
@ 2025-04-01  9:32             ` Jan Kara
  2025-04-01 13:03               ` Christian Brauner
  2025-04-01 14:14             ` [PATCH 0/6] " Peter Zijlstra
  2025-04-01 17:02             ` James Bottomley
  9 siblings, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-04-01  9:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, James Bottomley, mcgrof, hch, david, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Tue 01-04-25 02:32:45, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as I
> could. Works for me (TM).
> 
> I need to catch some actual sleep now...
> 
> ---
> 
> Now all the pieces are in place to actually allow the power subsystem to
> freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.
> 
> Othwerwise it risks thawing filesystems it didn't own. This could be
> done differently be e.g., keeping the filesystems that were actually
> frozen on a list and then unfreezing them from that list. This is
> disgustingly unclean though and reeks of an ugly hack.
> 
> If the filesystem is already frozen by the time we've frozen all
> userspace processes we don't care to freeze it again. That's userspace's
> job once the process resumes. We only actually freeze filesystems if we
> absolutely have to and we ignore other failures to freeze.

Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST |
FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I
remember we have specifically designed nesting of freeze counters so that
this way power subsystem can be sure freezing succeeds even if the
filesystem is already frozen (by userspace or the kernel) and similarly
power subsystem cannot thaw a filesystem frozen by somebody else. It will
just drop its freeze refcount... What am I missing?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing
  2025-04-01  9:16               ` Jan Kara
@ 2025-04-01  9:35                 ` Christian Brauner
  2025-04-01 10:08                   ` Jan Kara
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-01  9:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 11:16:18AM +0200, Jan Kara wrote:
> On Tue 01-04-25 02:32:46, Christian Brauner wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > The kernel power management now supports allowing the VFS
> > to handle filesystem freezing freezes and thawing. Take advantage
> > of that and remove the kthread freezing. This is needed so that we
> > properly really stop IO in flight without races after userspace
> > has been frozen. Without this we rely on kthread freezing and
> > its semantics are loose and error prone.
> > 
> > The filesystem therefore is in charge of properly dealing with
> > quiescing of the filesystem through its callbacks if it thinks
> > it knows better than how the VFS handles it.
> > 
> > The following Coccinelle rule was used as to remove the now superfluous
> > freezer calls:
> > 
> > make coccicheck MODE=patch SPFLAGS="--in-place --no-show-diff" COCCI=./fs-freeze-cleanup.cocci M=fs/ext4
> > 
> > virtual patch
> > 
> > @ remove_set_freezable @
> > expression time;
> > statement S, S2;
> > expression task, current;
> > @@
> > 
> > (
> > -       set_freezable();
> > |
> > -       if (try_to_freeze())
> > -               continue;
> > |
> > -       try_to_freeze();
> > |
> > -       freezable_schedule();
> > +       schedule();
> > |
> > -       freezable_schedule_timeout(time);
> > +       schedule_timeout(time);
> > |
> > -       if (freezing(task)) { S }
> > |
> > -       if (freezing(task)) { S }
> > -       else
> > 	    { S2 }
> > |
> > -       freezing(current)
> > )
> > 
> > @ remove_wq_freezable @
> > expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
> > identifier fs_wq_fn;
> > @@
> > 
> > (
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_FREEZABLE,
> > +                              WQ_ARG2,
> > 			   ...);
> > |
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
> > +                              WQ_ARG2 | WQ_ARG3,
> > 			   ...);
> > |
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
> > +                              WQ_ARG2 | WQ_ARG3,
> > 			   ...);
> > |
> >     WQ_E = alloc_workqueue(WQ_ARG1,
> > -                              WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
> > +                              WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
> > 			   ...);
> > |
> > 	    WQ_E =
> > -               WQ_ARG1 | WQ_FREEZABLE
> > +               WQ_ARG1
> > |
> > 	    WQ_E =
> > -               WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
> > +               WQ_ARG1 | WQ_ARG3
> > |
> >     fs_wq_fn(
> > -               WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
> > +               WQ_ARG2 | WQ_ARG3
> >     )
> > |
> >     fs_wq_fn(
> > -               WQ_FREEZABLE | WQ_ARG2
> > +               WQ_ARG2
> >     )
> > |
> >     fs_wq_fn(
> > -               WQ_FREEZABLE
> > +               0
> >     )
> > )
> > 
> > @ add_auto_flag @
> > expression E1;
> > identifier fs_type;
> > @@
> > 
> > struct file_system_type fs_type = {
> > 	.fs_flags = E1
> > +                   | FS_AUTOFREEZE
> > 	,
> > };
> > 
> > Generated-by: Coccinelle SmPL
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Link: https://lore.kernel.org/r/20250326112220.1988619-5-mcgrof@kernel.org
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/ext4/mballoc.c | 2 +-
> >  fs/ext4/super.c   | 3 ---
> >  2 files changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 0d523e9fb3d5..ae235ec5ff3a 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -6782,7 +6782,7 @@ static ext4_grpblk_t ext4_last_grp_cluster(struct super_block *sb,
> >  
> >  static bool ext4_trim_interrupted(void)
> >  {
> > -	return fatal_signal_pending(current) || freezing(current);
> > +	return fatal_signal_pending(current);
> >  }
> 
> This change should not happen. ext4_trim_interrupted() makes sure FITRIM
> ioctl doesn't cause hibernation failures and has nothing to do with kthread
> freezing...
> 
> Otherwise the patch looks good.

Afaict, we don't have to do these changes now. Yes, once fsfreeze
reliably works in the suspend/resume codepaths then we can switch all
that off and remove the old freezer. But we should only do that once we
have some experience with the new filesystem freezing during
suspend/hibernate. So we should place this under a
/sys/power/freeze_filesystems knob and wait a few kernel releases to see
whether we see significant problems. How does that sound to you?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing
  2025-04-01  9:35                 ` Christian Brauner
@ 2025-04-01 10:08                   ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-01 10:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Tue 01-04-25 11:35:56, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 11:16:18AM +0200, Jan Kara wrote:
> > > ---
> > >  fs/ext4/mballoc.c | 2 +-
> > >  fs/ext4/super.c   | 3 ---
> > >  2 files changed, 1 insertion(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 0d523e9fb3d5..ae235ec5ff3a 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -6782,7 +6782,7 @@ static ext4_grpblk_t ext4_last_grp_cluster(struct super_block *sb,
> > >  
> > >  static bool ext4_trim_interrupted(void)
> > >  {
> > > -	return fatal_signal_pending(current) || freezing(current);
> > > +	return fatal_signal_pending(current);
> > >  }
> > 
> > This change should not happen. ext4_trim_interrupted() makes sure FITRIM
> > ioctl doesn't cause hibernation failures and has nothing to do with kthread
> > freezing...
> > 
> > Otherwise the patch looks good.
> 
> Afaict, we don't have to do these changes now. Yes, once fsfreeze
> reliably works in the suspend/resume codepaths then we can switch all
> that off and remove the old freezer. But we should only do that once we
> have some experience with the new filesystem freezing during
> suspend/hibernate. So we should place this under a
> /sys/power/freeze_filesystems knob and wait a few kernel releases to see
> whether we see significant problems. How does that sound to you?

I agree that enabling this with some knob to allow easy way out if things
don't work makes sense. And the removal of kthread freezing can be done
somewhat later when we are more confident filesystem freezing on
hibernation is solid.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-04-01  1:13       ` James Bottomley
@ 2025-04-01 11:20         ` Jan Kara
  2025-04-01 12:50           ` Christian Brauner
  2025-04-01 12:52           ` James Bottomley
  0 siblings, 2 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-01 11:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christian Brauner, jack, linux-fsdevel, linux-kernel, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Mon 31-03-25 21:13:20, James Bottomley wrote:
> On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote:
> > On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote:
> > > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote:
> > > [...]
> > > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem,
> > > > bool
> > > > reader)
> > > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem,
> > > > bool
> > > > reader,
> > > > +			      bool freeze)
> > > >  {
> > > >  	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
> > > >  	bool wait;
> > > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct
> > > > percpu_rw_semaphore *sem, bool reader)
> > > >  	spin_unlock_irq(&sem->waiters.lock);
> > > >  
> > > >  	while (wait) {
> > > > -		set_current_state(TASK_UNINTERRUPTIBLE);
> > > > +		set_current_state(TASK_UNINTERRUPTIBLE |
> > > > +				  freeze ? TASK_FREEZABLE : 0);
> > > 
> > > This is a bit embarrassing, the bug I've been chasing is here: the
> > > ?
> > > operator is lower in precedence than | meaning this expression
> > > always
> > > evaluates to TASK_FREEZABLE and nothing else (which is why the
> > > process
> > > goes into R state and never wakes up).
> > > 
> > > Let me fix that and redo all the testing.
> > 
> > I don't think that's it. I think you're missing making pagefault
> > writers such
> > as systemd-journald freezable:
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b379a46b5576..528e73f192ac 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct
> > super_block *sb, int level)
> >  static inline void __sb_start_write(struct super_block *sb, int
> > level)
> >  {
> >         percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> > -                                  level == SB_FREEZE_WRITE);
> > +                                  (level == SB_FREEZE_WRITE ||
> > +                                   level == SB_FREEZE_PAGEFAULT));
> >  }
> 
> Yes, I was about to tell Jan that the condition here simply needs to be
> true.  All our rwsem levels need to be freezable to avoid a hibernation
> failure.

So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired under
mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under some other
filesystem locks. So if you freeze the filesystem, a task can block on
frozen filesystem with e.g. mmap_sem held and if some other task then
blocks on grabbing that mmap_sem, hibernation fails because we'll be unable
to hibernate the task waiting for mmap_sem. So if you'd like to completely
avoid these hibernation failures, you'd have to make a slew of filesystem
related locks use freezable sleeping. I don't think that's feasible.

I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being
freezable would be rare enough but you've proven they are quite frequent.
We can try making SB_FREEZE_PAGEFAULT level (or even SB_FREEZE_INTERNAL)
freezable and see whether that works good enough...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/6] xfs: replace kthread freezing with auto fs freezing
  2025-04-01  7:17                 ` Christian Brauner
@ 2025-04-01 11:35                   ` Dave Chinner
  2025-04-01 12:45                     ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: Dave Chinner @ 2025-04-01 11:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 09:17:12AM +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 12:11:04PM +1100, Dave Chinner wrote:
> > On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> > > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> > > index c5136ea9bb1d..1875b6551ab0 100644
> > > --- a/fs/xfs/xfs_zone_gc.c
> > > +++ b/fs/xfs/xfs_zone_gc.c
> > > @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
> > >  	}
> > >  
> > >  	__set_current_state(TASK_RUNNING);
> > > -	try_to_freeze();
> > >  
> > >  	if (reset_list)
> > >  		xfs_zone_gc_reset_zones(data, reset_list);
> > > @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
> > >  	unsigned int		nofs_flag;
> > >  
> > >  	nofs_flag = memalloc_nofs_save();
> > > -	set_freezable();
> > >  
> > >  	for (;;) {
> > >  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
> > 
> > Same question here for this newly merged code, too...
>
> I'm not sure if this is supposed to be a snipe or not but just in case
> this is a hidden question:

No, I meant that this is changing shiny new just-merged XFS code
(part of zone device support). It only just arrived this merge
window and is largely just doing the same thing as the older aild
code. It is probably safe to assume that this new code has never
been tested against hibernate...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 3/6] xfs: replace kthread freezing with auto fs freezing
  2025-04-01 11:35                   ` Dave Chinner
@ 2025-04-01 12:45                     ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01 12:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, rafael, djwong, pavel, peterz,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 10:35:58PM +1100, Dave Chinner wrote:
> On Tue, Apr 01, 2025 at 09:17:12AM +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 12:11:04PM +1100, Dave Chinner wrote:
> > > On Tue, Apr 01, 2025 at 02:32:48AM +0200, Christian Brauner wrote:
> > > > diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> > > > index c5136ea9bb1d..1875b6551ab0 100644
> > > > --- a/fs/xfs/xfs_zone_gc.c
> > > > +++ b/fs/xfs/xfs_zone_gc.c
> > > > @@ -993,7 +993,6 @@ xfs_zone_gc_handle_work(
> > > >  	}
> > > >  
> > > >  	__set_current_state(TASK_RUNNING);
> > > > -	try_to_freeze();
> > > >  
> > > >  	if (reset_list)
> > > >  		xfs_zone_gc_reset_zones(data, reset_list);
> > > > @@ -1041,7 +1040,6 @@ xfs_zoned_gcd(
> > > >  	unsigned int		nofs_flag;
> > > >  
> > > >  	nofs_flag = memalloc_nofs_save();
> > > > -	set_freezable();
> > > >  
> > > >  	for (;;) {
> > > >  		set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
> > > 
> > > Same question here for this newly merged code, too...
> >
> > I'm not sure if this is supposed to be a snipe or not but just in case
> > this is a hidden question:
> 
> No, I meant that this is changing shiny new just-merged XFS code
> (part of zone device support). It only just arrived this merge
> window and is largely just doing the same thing as the older aild
> code. It is probably safe to assume that this new code has never
> been tested against hibernate...

Ah, my brain is completely fried. Apparently reading English is a skill
I've lost since coming back from Montreal. Thanks!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-04-01 11:20         ` Jan Kara
@ 2025-04-01 12:50           ` Christian Brauner
  2025-04-01 12:52           ` James Bottomley
  1 sibling, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-01 12:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: James Bottomley, linux-fsdevel, linux-kernel, mcgrof, hch, david,
	rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 01:20:37PM +0200, Jan Kara wrote:
> On Mon 31-03-25 21:13:20, James Bottomley wrote:
> > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote:
> > > On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote:
> > > > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote:
> > > > [...]
> > > > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem,
> > > > > bool
> > > > > reader)
> > > > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem,
> > > > > bool
> > > > > reader,
> > > > > +			      bool freeze)
> > > > >  {
> > > > >  	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
> > > > >  	bool wait;
> > > > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct
> > > > > percpu_rw_semaphore *sem, bool reader)
> > > > >  	spin_unlock_irq(&sem->waiters.lock);
> > > > >  
> > > > >  	while (wait) {
> > > > > -		set_current_state(TASK_UNINTERRUPTIBLE);
> > > > > +		set_current_state(TASK_UNINTERRUPTIBLE |
> > > > > +				  freeze ? TASK_FREEZABLE : 0);
> > > > 
> > > > This is a bit embarrassing, the bug I've been chasing is here: the
> > > > ?
> > > > operator is lower in precedence than | meaning this expression
> > > > always
> > > > evaluates to TASK_FREEZABLE and nothing else (which is why the
> > > > process
> > > > goes into R state and never wakes up).
> > > > 
> > > > Let me fix that and redo all the testing.
> > > 
> > > I don't think that's it. I think you're missing making pagefault
> > > writers such
> > > as systemd-journald freezable:
> > > 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index b379a46b5576..528e73f192ac 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct
> > > super_block *sb, int level)
> > >  static inline void __sb_start_write(struct super_block *sb, int
> > > level)
> > >  {
> > >         percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> > > -                                  level == SB_FREEZE_WRITE);
> > > +                                  (level == SB_FREEZE_WRITE ||
> > > +                                   level == SB_FREEZE_PAGEFAULT));
> > >  }
> > 
> > Yes, I was about to tell Jan that the condition here simply needs to be
> > true.  All our rwsem levels need to be freezable to avoid a hibernation
> > failure.
> 
> So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired under
> mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under some other
> filesystem locks. So if you freeze the filesystem, a task can block on
> frozen filesystem with e.g. mmap_sem held and if some other task then

Yeah, I wondered about that yesterday.

> blocks on grabbing that mmap_sem, hibernation fails because we'll be unable
> to hibernate the task waiting for mmap_sem. So if you'd like to completely
> avoid these hibernation failures, you'd have to make a slew of filesystem
> related locks use freezable sleeping. I don't think that's feasible.
> 
> I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being
> freezable would be rare enough but you've proven they are quite frequent.
> We can try making SB_FREEZE_PAGEFAULT level (or even SB_FREEZE_INTERNAL)
> freezable and see whether that works good enough...

I think that's fine and we'll see whether this causes a lot of issues.
I've got the patchset written in a way now that userspace can just
enable or disable freeze during migration.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-04-01 11:20         ` Jan Kara
  2025-04-01 12:50           ` Christian Brauner
@ 2025-04-01 12:52           ` James Bottomley
  2025-04-02 11:47             ` Jan Kara
  1 sibling, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-04-01 12:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, mcgrof, hch,
	david, rafael, djwong, pavel, peterz, mingo, will, boqun.feng

On Tue, 2025-04-01 at 13:20 +0200, Jan Kara wrote:
> On Mon 31-03-25 21:13:20, James Bottomley wrote:
> > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote:
[...]
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index b379a46b5576..528e73f192ac 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct
> > > super_block *sb, int level)
> > >  static inline void __sb_start_write(struct super_block *sb, int
> > > level)
> > >  {
> > >         percpu_down_read_freezable(sb->s_writers.rw_sem + level -
> > > 1,
> > > -                                  level == SB_FREEZE_WRITE);
> > > +                                  (level == SB_FREEZE_WRITE ||
> > > +                                   level ==
> > > SB_FREEZE_PAGEFAULT));
> > >  }
> > 
> > Yes, I was about to tell Jan that the condition here simply needs
> > to be true.  All our rwsem levels need to be freezable to avoid a
> > hibernation failure.
> 
> So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired
> under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under
> some other filesystem locks.

Just for SB_FREEZE_INTERNAL, I think there's no case of
sb_start_intwrite() that can ever hold in D wait because by the time we
acquire the semaphore for write, the internal freeze_fs should have
been called and the filesystem should have quiesced itself.  On the
other hand, if that theory itself is true, there's no real need for
sb_start_intwrite() at all because it can never conflict.

>  So if you freeze the filesystem, a task can block on frozen
> filesystem with e.g. mmap_sem held and if some other task then blocks
> on grabbing that mmap_sem, hibernation fails because we'll be unable
> to hibernate the task waiting for mmap_sem. So if you'd like to
> completely avoid these hibernation failures, you'd have to make a
> slew of filesystem related locks use freezable sleeping. I don't
> think that's feasible.

I wouldn't see that because I'm on x86_64 and that takes the vma_lock
in page faults not the mmap_lock.  The granularity of all these locks
is process level, so it's hard to see what they'd be racing with ...
even if I conjecture two threads trying to write to something, they'd
have to have some internal co-ordination which would likely prevent the
second one from writing if the first got stuck on the page fault. 

> I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being
> freezable would be rare enough but you've proven they are quite
> frequent. We can try making SB_FREEZE_PAGEFAULT level (or even
> SB_FREEZE_INTERNAL) freezable and see whether that works good
> enough...

I'll try to construct a more severe test than systemd-journald ... it
looks to be single threaded in its operation.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01  9:32             ` Jan Kara
@ 2025-04-01 13:03               ` Christian Brauner
  2025-04-01 16:57                 ` Jan Kara
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-01 13:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, rafael, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Tue, Apr 01, 2025 at 11:32:49AM +0200, Jan Kara wrote:
> On Tue 01-04-25 02:32:45, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > 
> > I know nothing about power or hibernation. I've tested it as best as I
> > could. Works for me (TM).
> > 
> > I need to catch some actual sleep now...
> > 
> > ---
> > 
> > Now all the pieces are in place to actually allow the power subsystem to
> > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > frozen and thawed if the power subsystem does actually own the freeze.
> > 
> > Othwerwise it risks thawing filesystems it didn't own. This could be
> > done differently be e.g., keeping the filesystems that were actually
> > frozen on a list and then unfreezing them from that list. This is
> > disgustingly unclean though and reeks of an ugly hack.
> > 
> > If the filesystem is already frozen by the time we've frozen all
> > userspace processes we don't care to freeze it again. That's userspace's
> > job once the process resumes. We only actually freeze filesystems if we
> > absolutely have to and we ignore other failures to freeze.
> 
> Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST |
> FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I
> remember we have specifically designed nesting of freeze counters so that
> this way power subsystem can be sure freezing succeeds even if the
> filesystem is already frozen (by userspace or the kernel) and similarly
> power subsystem cannot thaw a filesystem frozen by somebody else. It will
> just drop its freeze refcount... What am I missing?

If we have 10 filesystems and suspend/hibernate manges to freeze 5 and
then fails on the 6th for whatever odd reason (current or future) then
power needs to undo the freeze of the first 5 filesystems. We can't just
walk the list again because while it's unlikely that a new filesystem
got added in the meantime we still cannot tell what filesystems the
power subsystem actually managed to get a freeze reference count on that
we need to drop during thaw.

There's various ways out of this ugliness. Either we record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or we make sure that the power subsystem just actually
exclusively freezes things it can freeze and marking such filesystems as
being owned by power for the duration of the suspend or resume cycle. I
opted for the latter as that seemed the clean thing to do even if it
means more code changes. What are your thoughts on this?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (7 preceding siblings ...)
  2025-04-01  9:32             ` Jan Kara
@ 2025-04-01 14:14             ` Peter Zijlstra
  2025-04-01 14:40               ` Christian Brauner
  2025-04-01 17:02             ` James Bottomley
  9 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2025-04-01 14:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, James Bottomley, mcgrof, hch, david, djwong, pavel,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as I
> could. Works for me (TM).
> 
> I need to catch some actual sleep now...
> 
> ---
> 
> Now all the pieces are in place to actually allow the power subsystem to
> freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.

Urgh, I was relying on all kthreads to be freezable for live-patching:

  https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net

So I understand the problem with freezing filesystems, but can't we
leave the TASK_FREEZABLE in the kthreads? The way I understand it, the
power subsystem will first freeze the filesystems before it goes freeze
threads anyway. So them remaining freezable should not affect anything,
right?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01 14:14             ` [PATCH 0/6] " Peter Zijlstra
@ 2025-04-01 14:40               ` Christian Brauner
  2025-04-01 14:59                 ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-01 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, James Bottomley, mcgrof, hch, david, djwong, pavel,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 04:14:07PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > 
> > I know nothing about power or hibernation. I've tested it as best as I
> > could. Works for me (TM).
> > 
> > I need to catch some actual sleep now...
> > 
> > ---
> > 
> > Now all the pieces are in place to actually allow the power subsystem to
> > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > frozen and thawed if the power subsystem does actually own the freeze.
> 
> Urgh, I was relying on all kthreads to be freezable for live-patching:
> 
>   https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net
> 
> So I understand the problem with freezing filesystems, but can't we
> leave the TASK_FREEZABLE in the kthreads? The way I understand it, the

Yeah, we can.

> power subsystem will first freeze the filesystems before it goes freeze
> threads anyway. So them remaining freezable should not affect anything,
> right?

Yes. I've dropped the other patches. I've discussed this later
downthread with Jan.
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01 14:40               ` Christian Brauner
@ 2025-04-01 14:59                 ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2025-04-01 14:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, James Bottomley, mcgrof, hch, david, djwong, pavel,
	mingo, will, boqun.feng

On Tue, Apr 01, 2025 at 04:40:33PM +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 04:14:07PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 01, 2025 at 02:32:45AM +0200, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > 
> > > I know nothing about power or hibernation. I've tested it as best as I
> > > could. Works for me (TM).
> > > 
> > > I need to catch some actual sleep now...
> > > 
> > > ---
> > > 
> > > Now all the pieces are in place to actually allow the power subsystem to
> > > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > > frozen and thawed if the power subsystem does actually own the freeze.
> > 
> > Urgh, I was relying on all kthreads to be freezable for live-patching:
> > 
> >   https://lkml.kernel.org/r/20250324134909.GA14718@noisy.programming.kicks-ass.net
> > 
> > So I understand the problem with freezing filesystems, but can't we
> > leave the TASK_FREEZABLE in the kthreads? The way I understand it, the
> 
> Yeah, we can.
> 
> > power subsystem will first freeze the filesystems before it goes freeze
> > threads anyway. So them remaining freezable should not affect anything,
> > right?
> 
> Yes. I've dropped the other patches. I've discussed this later
> downthread with Jan.

Thanks!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01 13:03               ` Christian Brauner
@ 2025-04-01 16:57                 ` Jan Kara
  2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-04-01 16:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, James Bottomley, mcgrof, hch, david, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Tue 01-04-25 15:03:33, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 11:32:49AM +0200, Jan Kara wrote:
> > On Tue 01-04-25 02:32:45, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > 
> > > I know nothing about power or hibernation. I've tested it as best as I
> > > could. Works for me (TM).
> > > 
> > > I need to catch some actual sleep now...
> > > 
> > > ---
> > > 
> > > Now all the pieces are in place to actually allow the power subsystem to
> > > freeze/thaw filesystems during suspend/resume. Filesystems are only
> > > frozen and thawed if the power subsystem does actually own the freeze.
> > > 
> > > Othwerwise it risks thawing filesystems it didn't own. This could be
> > > done differently be e.g., keeping the filesystems that were actually
> > > frozen on a list and then unfreezing them from that list. This is
> > > disgustingly unclean though and reeks of an ugly hack.
> > > 
> > > If the filesystem is already frozen by the time we've frozen all
> > > userspace processes we don't care to freeze it again. That's userspace's
> > > job once the process resumes. We only actually freeze filesystems if we
> > > absolutely have to and we ignore other failures to freeze.
> > 
> > Hum, I don't follow here. I supposed we'll use FREEZE_MAY_NEST |
> > FREEZE_HOLDER_KERNEL for freezing from power subsystem. As far as I
> > remember we have specifically designed nesting of freeze counters so that
> > this way power subsystem can be sure freezing succeeds even if the
> > filesystem is already frozen (by userspace or the kernel) and similarly
> > power subsystem cannot thaw a filesystem frozen by somebody else. It will
> > just drop its freeze refcount... What am I missing?
> 
> If we have 10 filesystems and suspend/hibernate manges to freeze 5 and
> then fails on the 6th for whatever odd reason (current or future) then
> power needs to undo the freeze of the first 5 filesystems. We can't just
> walk the list again because while it's unlikely that a new filesystem
> got added in the meantime we still cannot tell what filesystems the
> power subsystem actually managed to get a freeze reference count on that
> we need to drop during thaw.
> 
> There's various ways out of this ugliness. Either we record the
> filesystems the power subsystem managed to freeze on a temporary list in
> the callbacks and then walk that list backwards during thaw to undo the
> freezing or we make sure that the power subsystem just actually
> exclusively freezes things it can freeze and marking such filesystems as
> being owned by power for the duration of the suspend or resume cycle. I
> opted for the latter as that seemed the clean thing to do even if it
> means more code changes. What are your thoughts on this?

Ah, I see. Thanks for explanation. So failure to freeze filesystem should
be rare (mostly only due to IO errors or similar serious issues) hence
I'd consider failing hibernation in case we fail to freeze some filesystem
appropriate. The function that's walking all superblocks and freezing them
could just walk from the superblock where freezing failed towards the end
and thaw all filesystems. That way the function also has the nice property
that it either freezes everything or keeps things as they were.

But you've touched on an interesting case I didn't consider: New
superblocks can be added to the end of the list while we are walking it.
These superblocks will not be frozen and on resume (or error recovery) this
will confuse things. Your "freeze owner" stuff deals with this problem
nicely. Somewhat lighter fix for this may be to provide the superblock to
start from / end with to these loops iterating and freezing / thawing
superblocks. It doesn't seem too hacky but if you prefer your freeze owner
approach I won't object.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
                               ` (8 preceding siblings ...)
  2025-04-01 14:14             ` [PATCH 0/6] " Peter Zijlstra
@ 2025-04-01 17:02             ` James Bottomley
  2025-04-02  7:46               ` Christian Brauner
  9 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-04-01 17:02 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, jack, rafael
  Cc: Ard Biesheuvel, linux-efi, linux-kernel, mcgrof, hch, david,
	djwong, pavel, peterz, mingo, will, boqun.feng

On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> The whole shebang can also be found at:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> 
> I know nothing about power or hibernation. I've tested it as best as
> I could. Works for me (TM).

I'm testing the latest you have in work.freeze and it doesn't currently
work for me.  Patch 7b315c39b67d ("power: freeze filesystems during
suspend/resume") doesn't set filesystems_freeze_ptr so it ends up being
NULL and tripping over this check 

+static inline bool may_unfreeze(struct super_block *sb, enum
freeze_holder who,
+                               const void *freeze_owner)
+{
+       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+       if (who & FREEZE_EXCL) {
+               if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+                       return false;

in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to resume
from hibernate.  Setting it to __builtin_return_address(0) in
filesystems_freeze() makes everything work as expected, so that's what
I'm testing now.

I suppose one minor, minor nit is that the vagaries of English grammar
mean that the verbs fail and succeed don't take the same grammatical
construction, so failed can take the infinitive (failed to thaw)
perfectly well, but succeeded takes a prepositional gerund construction
instead: "succeeded at/in thawing" instead of the infinitive "succeeded
to thaw" ... I've no idea why, but I'd probably blame the Victorians
...

Regards,

James

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 2/2] efivarfs: support freeze/thaw
  2025-03-31 12:42           ` [PATCH 2/2] efivarfs: support freeze/thaw Christian Brauner
  2025-03-31 14:46             ` James Bottomley
@ 2025-04-01 19:31             ` James Bottomley
  2025-04-02  7:44               ` Christian Brauner
  1 sibling, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-04-01 19:31 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel, jack, Ard Biesheuvel
  Cc: linux-efi, linux-kernel, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

On Mon, 2025-03-31 at 14:42 +0200, Christian Brauner wrote:
[...]
> +	pr_info("efivarfs: resyncing variable state\n");
> +	for (;;) {
> +		int err;
> +		size_t size;
> +		struct inode *inode;
> +		struct efivar_entry *entry;
> +
> +		child = find_next_child(sb->s_root, child);
> +		if (!child)
> +			break;
> +
> +		inode = d_inode(child);
> +		entry = efivar_entry(inode);
> +
> +		err = efivar_entry_size(entry, &size);
> +		if (err)
> +			size = 0;
> +		else
> +			size += sizeof(__u32);
> +
> +		inode_lock(inode);
> +		i_size_write(inode, size);
> +		inode_unlock(inode);
> +
> +		if (!err)
> +			continue;
> +
> +		/* The variable doesn't exist anymore, delete it. */

The message that should be here got deleted.  We now only print
messages about variables we add not variables we remove.  I get that
the code is a bit chatty here, but it should either print both the
removing and adding messages or print neither, I think.

> +		simple_recursive_removal(child, NULL);
>  	}
[...]
> -			pr_info("efivarfs: removing variable %pd\n",
> -				ectx.dentry);

This is the lost message, although ectx.dentry should become child.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 2/2] efivarfs: support freeze/thaw
  2025-04-01 19:31             ` James Bottomley
@ 2025-04-02  7:44               ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-02  7:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	mcgrof, hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

On Tue, Apr 01, 2025 at 03:31:13PM -0400, James Bottomley wrote:
> On Mon, 2025-03-31 at 14:42 +0200, Christian Brauner wrote:
> [...]
> > +	pr_info("efivarfs: resyncing variable state\n");
> > +	for (;;) {
> > +		int err;
> > +		size_t size;
> > +		struct inode *inode;
> > +		struct efivar_entry *entry;
> > +
> > +		child = find_next_child(sb->s_root, child);
> > +		if (!child)
> > +			break;
> > +
> > +		inode = d_inode(child);
> > +		entry = efivar_entry(inode);
> > +
> > +		err = efivar_entry_size(entry, &size);
> > +		if (err)
> > +			size = 0;
> > +		else
> > +			size += sizeof(__u32);
> > +
> > +		inode_lock(inode);
> > +		i_size_write(inode, size);
> > +		inode_unlock(inode);
> > +
> > +		if (!err)
> > +			continue;
> > +
> > +		/* The variable doesn't exist anymore, delete it. */
> 
> The message that should be here got deleted.  We now only print
> messages about variables we add not variables we remove.  I get that
> the code is a bit chatty here, but it should either print both the
> removing and adding messages or print neither, I think.

Ok, I added the deletion printk line back.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01 17:02             ` James Bottomley
@ 2025-04-02  7:46               ` Christian Brauner
  2025-04-08 15:43                 ` James Bottomley
  0 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-02  7:46 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, mcgrof, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > The whole shebang can also be found at:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > 
> > I know nothing about power or hibernation. I've tested it as best as
> > I could. Works for me (TM).
> 
> I'm testing the latest you have in work.freeze and it doesn't currently
> work for me.  Patch 7b315c39b67d ("power: freeze filesystems during
> suspend/resume") doesn't set filesystems_freeze_ptr so it ends up being
> NULL and tripping over this check 

I haven't pushed the new version there. Sorry about that. I only have it
locally.

> 
> +static inline bool may_unfreeze(struct super_block *sb, enum
> freeze_holder who,
> +                               const void *freeze_owner)
> +{
> +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> +
> +       if (who & FREEZE_EXCL) {
> +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> +                       return false;
> 
> 
> in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to resume
> from hibernate.  Setting it to __builtin_return_address(0) in
> filesystems_freeze() makes everything work as expected, so that's what
> I'm testing now.

+1

I'll send the final version out in a bit.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read
  2025-04-01 12:52           ` James Bottomley
@ 2025-04-02 11:47             ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-02 11:47 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, linux-kernel, mcgrof,
	hch, david, rafael, djwong, pavel, peterz, mingo, will,
	boqun.feng

On Tue 01-04-25 08:52:02, James Bottomley wrote:
> On Tue, 2025-04-01 at 13:20 +0200, Jan Kara wrote:
> > On Mon 31-03-25 21:13:20, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote:
> [...]
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index b379a46b5576..528e73f192ac 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct
> > > > super_block *sb, int level)
> > > >  static inline void __sb_start_write(struct super_block *sb, int
> > > > level)
> > > >  {
> > > >         percpu_down_read_freezable(sb->s_writers.rw_sem + level -
> > > > 1,
> > > > -                                  level == SB_FREEZE_WRITE);
> > > > +                                  (level == SB_FREEZE_WRITE ||
> > > > +                                   level ==
> > > > SB_FREEZE_PAGEFAULT));
> > > >  }
> > > 
> > > Yes, I was about to tell Jan that the condition here simply needs
> > > to be true.  All our rwsem levels need to be freezable to avoid a
> > > hibernation failure.
> > 
> > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired
> > under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under
> > some other filesystem locks.
> 
> Just for SB_FREEZE_INTERNAL, I think there's no case of
> sb_start_intwrite() that can ever hold in D wait because by the time we
> acquire the semaphore for write, the internal freeze_fs should have
> been called and the filesystem should have quiesced itself.  On the
> other hand, if that theory itself is true, there's no real need for
> sb_start_intwrite() at all because it can never conflict.

This is not true. Sure, userspace should all be blocked, dirty pages
written back, but you still have filesystem background tasks like lazy
initialization of inode tables, inode garbage collection, regular lazy
updates of statistics in the superblock. These generally happen from
kthreads / work queues and they can still be scheduled and executed
although freeze_super() has started blocking SB_FREEZE_WRITE and
SB_FREEZE_PAGEFAULT levels... And generally this freeze level is there
exactly because it needs to be acquired from locking context which doesn't
allow usage of SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT levels.

> >  So if you freeze the filesystem, a task can block on frozen
> > filesystem with e.g. mmap_sem held and if some other task then blocks
> > on grabbing that mmap_sem, hibernation fails because we'll be unable
> > to hibernate the task waiting for mmap_sem. So if you'd like to
> > completely avoid these hibernation failures, you'd have to make a
> > slew of filesystem related locks use freezable sleeping. I don't
> > think that's feasible.
> 
> I wouldn't see that because I'm on x86_64 and that takes the vma_lock
> in page faults not the mmap_lock.  The granularity of all these locks
> is process level, so it's hard to see what they'd be racing with ...

I agree that because of vma_lock it would be much harder to see this. But
as far as I remember mmap_sem is still a fallback option when we race with
VMA modification even for x86 so this problem is possible to hit, just much
more unlikely.

> even if I conjecture two threads trying to write to something, they'd
> have to have some internal co-ordination which would likely prevent the
> second one from writing if the first got stuck on the page fault. 
> 
> > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being
> > freezable would be rare enough but you've proven they are quite
> > frequent. We can try making SB_FREEZE_PAGEFAULT level (or even
> > SB_FREEZE_INTERNAL) freezable and see whether that works good
> > enough...
> 
> I'll try to construct a more severe test than systemd-journald ... it
> looks to be single threaded in its operation.

OK, thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-01 16:57                 ` Jan Kara
@ 2025-04-02 14:07                   ` Christian Brauner
  2025-04-02 14:07                     ` [PATCH v2 1/4] fs: add owner of freeze/thaw Christian Brauner
                                       ` (4 more replies)
  0 siblings, 5 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 14:07 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

Now all the pieces are in place to actually allow the power subsystem to
freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

Othwerwise it risks thawing filesystems it didn't own. This could be
done differently be e.g., keeping the filesystems that were actually
frozen on a list and then unfreezing them from that list. This is
disgustingly unclean though and reeks of an ugly hack.

If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's userspace's
job once the process resumes. We only actually freeze filesystems if we
absolutely have to and we ignore other failures to freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Drop all patches that remove TASK_FREEZABLE.
- Expand commit messages a bit.
- Link to v1: https://lore.kernel.org/r/20250401-work-freeze-v1-0-d000611d4ab0@kernel.org

---
Christian Brauner (4):
      fs: add owner of freeze/thaw
      fs: allow all writers to be frozen
      power: freeze filesystems during suspend/resume
      kernfs: add warning about implementing freeze/thaw

 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++-----
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/kernfs/mount.c           | 15 +++++++++
 fs/super.c                  | 82 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_notify_failure.c |  6 ++--
 include/linux/fs.h          | 16 +++++----
 kernel/power/hibernate.c    | 16 ++++++++-
 kernel/power/main.c         | 31 +++++++++++++++++
 kernel/power/power.h        |  4 +++
 kernel/power/suspend.c      |  7 ++++
 13 files changed, 174 insertions(+), 45 deletions(-)
---
base-commit: 62dfd8d59e2d16873398ede5b1835e302df789b3
change-id: 20250401-work-freeze-693b5b5a78e0

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 1/4] fs: add owner of freeze/thaw
  2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
@ 2025-04-02 14:07                     ` Christian Brauner
  2025-04-03 14:56                       ` Jan Kara
  2025-04-02 14:07                     ` [PATCH v2 2/4] fs: allow all writers to be frozen Christian Brauner
                                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 14:07 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

For some kernel subsystems it is paramount that they are guaranteed that
they are the owner of the freeze to avoid any risk of deadlocks. This is
the case for the power subsystem. Enable it to recognize whether it did
actually freeze the filesystem.

If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
and then fails on the 6th for whatever odd reason (current or future)
then power needs to undo the freeze of the first 5 filesystems. It can't
just walk the list again because while it's unlikely that a new
filesystem got added in the meantime it still cannot tell which
filesystems the power subsystem actually managed to get a freeze
reference count on that needs to be dropped during thaw.

There's various ways out of this ugliness. For example, record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or make sure that the power subsystem just actually exclusively
freezes things it can freeze and marking such filesystems as being owned
by power for the duration of the suspend or resume cycle. I opted for
the latter as that seemed the clean thing to do even if it means more
code changes.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/f2fs/gc.c                |  6 ++--
 fs/gfs2/super.c             | 20 ++++++------
 fs/gfs2/sys.c               |  4 +--
 fs/ioctl.c                  |  8 ++---
 fs/super.c                  | 76 ++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/fscounters.c   |  4 +--
 fs/xfs/xfs_notify_failure.c |  6 ++--
 include/linux/fs.h          | 13 +++++---
 8 files changed, 95 insertions(+), 42 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 2b8f9239bede..3e8af62c9e15 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
 	if (err)
 		return err;
 
-	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 	if (err)
 		return err;
 
 	if (f2fs_readonly(sbi->sb)) {
-		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 		if (err)
 			return err;
 		return -EROFS;
@@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
 out_err:
 	f2fs_up_write(&sbi->cp_global_sem);
 	f2fs_up_write(&sbi->gc_lock);
-	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
 	return err;
 }
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 44e5658b896c..519943189109 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
 	return sdp->sd_log_error;
 }
 
-static int gfs2_do_thaw(struct gfs2_sbd *sdp)
+static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
 {
 	struct super_block *sb = sdp->sd_vfs;
 	int error;
@@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
 	error = gfs2_freeze_lock_shared(sdp);
 	if (error)
 		goto fail;
-	error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+	error = thaw_super(sb, who, freeze_owner);
 	if (!error)
 		return 0;
 
@@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
 	if (test_bit(SDF_FROZEN, &sdp->sd_flags))
 		goto freeze_failed;
 
-	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 	if (error)
 		goto freeze_failed;
 
 	gfs2_freeze_unlock(sdp);
 	set_bit(SDF_FROZEN, &sdp->sd_flags);
 
-	error = gfs2_do_thaw(sdp);
+	error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
 	if (error)
 		goto out;
 
@@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
  *
  */
 
-static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	int error;
@@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
 	}
 
 	for (;;) {
-		error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+		error = freeze_super(sb, who, freeze_owner);
 		if (error) {
 			fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
 				error);
@@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
 			break;
 		}
 
-		error = gfs2_do_thaw(sdp);
+		error = gfs2_do_thaw(sdp, who, freeze_owner);
 		if (error)
 			goto out;
 
@@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
  *
  */
 
-static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
+static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
+			   const void *freeze_owner)
 {
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	int error;
@@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
 	atomic_inc(&sb->s_active);
 	gfs2_freeze_unlock(sdp);
 
-	error = gfs2_do_thaw(sdp);
+	error = gfs2_do_thaw(sdp, who, freeze_owner);
 
 	if (!error) {
 		clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index ecc699f8d9fc..748125653d6c 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 
 	switch (n) {
 	case 0:
-		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
 		break;
 	case 1:
-		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
+		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
 		break;
 	default:
 		return -EINVAL;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index c91fd2b46a77..bedc83fc2f20 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
 
 	/* Freeze */
 	if (sb->s_op->freeze_super)
-		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
-	return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
+		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+	return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int ioctl_fsthaw(struct file *filp)
@@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
 
 	/* Thaw */
 	if (sb->s_op->thaw_super)
-		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
-	return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
+	return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int ioctl_file_dedupe_range(struct file *file,
diff --git a/fs/super.c b/fs/super.c
index 3c4a496d6438..3ddded4360c6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -39,7 +39,8 @@
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
 	if (IS_ENABLED(CONFIG_BLOCK))
 		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
 			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
-	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
 	return;
 }
 
@@ -1195,9 +1196,9 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->freeze_super)
-		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 	else
-		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 
 	deactivate_super(sb);
 }
@@ -1217,9 +1218,9 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->thaw_super)
-		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 	else
-		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
+		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
 
 	deactivate_super(sb);
 }
@@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
 
 	if (sb->s_op->freeze_super)
 		error = sb->s_op->freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	else
 		error = freeze_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	if (!error)
 		error = sync_blockdev(bdev);
 	deactivate_super(sb);
@@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
 
 	if (sb->s_op->thaw_super)
 		error = sb->s_op->thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	else
 		error = thaw_super(sb,
-				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
+				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
 	deactivate_super(sb);
 	return error;
 }
@@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
 }
 
 #define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
-#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
+#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
 
 static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
 {
@@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
+	if (who & FREEZE_EXCL) {
+		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+			return false;
+
+		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+			return false;
+
+		return (sb->s_writers.freeze_kcount +
+			sb->s_writers.freeze_ucount) == 0;
+	}
+
+	/* This filesystem is already exclusively frozen. */
+	if (sb->s_writers.freeze_owner)
+		return false;
+
 	if (who & FREEZE_HOLDER_KERNEL)
 		return (who & FREEZE_MAY_NEST) ||
 		       sb->s_writers.freeze_kcount == 0;
@@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 	return false;
 }
 
+static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
+				const void *freeze_owner)
+{
+	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
+	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
+
+	if (who & FREEZE_EXCL) {
+		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
+			return false;
+		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
+			return false;
+		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+			return false;
+		return sb->s_writers.freeze_owner == freeze_owner;
+	}
+
+	return sb->s_writers.freeze_owner == NULL;
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
  * @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
  *
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs may return
@@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
  * Return: If the freeze was successful zero is returned. If the freeze
  *         failed a negative error code is returned.
  */
-int freeze_super(struct super_block *sb, enum freeze_holder who)
+int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
 {
 	int ret;
 
@@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	if (sb_rdonly(sb)) {
 		/* Nothing to do really... */
 		WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+		sb->s_writers.freeze_owner = freeze_owner;
 		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 		wake_up_var(&sb->s_writers.frozen);
 		super_unlock_excl(sb);
@@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 	 */
 	WARN_ON_ONCE(freeze_inc(sb, who) > 1);
+	sb->s_writers.freeze_owner = freeze_owner;
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 	wake_up_var(&sb->s_writers.frozen);
 	lockdep_sb_freeze_release(sb);
@@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
  * removes that state without releasing the other state or unlocking the
  * filesystem.
  */
-static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
+static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
+			     const void *freeze_owner)
 {
 	int error = -EINVAL;
 
 	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
 		goto out_unlock;
 
+	if (!may_unfreeze(sb, who, freeze_owner))
+		goto out_unlock;
+
 	/*
 	 * All freezers share a single active reference.
 	 * So just unlock in case there are any left.
@@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 
 	if (sb_rdonly(sb)) {
 		sb->s_writers.frozen = SB_UNFROZEN;
+		sb->s_writers.freeze_owner = NULL;
 		wake_up_var(&sb->s_writers.frozen);
 		goto out_deactivate;
 	}
@@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 	}
 
 	sb->s_writers.frozen = SB_UNFROZEN;
+	sb->s_writers.freeze_owner = NULL;
 	wake_up_var(&sb->s_writers.frozen);
 	sb_freeze_unlock(sb, SB_FREEZE_FS);
 out_deactivate:
@@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
  * thaw_super -- unlock filesystem
  * @sb: the super to thaw
  * @who: context that wants to freeze
+ * @freeze_owner: owner of the freeze
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super()
  * if there are no remaining freezes on the filesystem.
@@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
  * have been frozen through the block layer via multiple block devices.
  * The filesystem remains frozen until all block devices are unfrozen.
  */
-int thaw_super(struct super_block *sb, enum freeze_holder who)
+int thaw_super(struct super_block *sb, enum freeze_holder who,
+	       const void *freeze_owner)
 {
 	if (!super_lock_excl(sb)) {
 		WARN_ON_ONCE("Dying superblock while thawing!");
 		return -EINVAL;
 	}
-	return thaw_super_locked(sb, who);
+	return thaw_super_locked(sb, who, freeze_owner);
 }
 EXPORT_SYMBOL(thaw_super);
 
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index e629663e460a..9b598c5790ad 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -123,7 +123,7 @@ xchk_fsfreeze(
 {
 	int			error;
 
-	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
 	trace_xchk_fsfreeze(sc, error);
 	return error;
 }
@@ -135,7 +135,7 @@ xchk_fsthaw(
 	int			error;
 
 	/* This should always succeed, we have a kernel freeze */
-	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
+	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
 	trace_xchk_fsthaw(sc, error);
 	return error;
 }
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index ed8d8ed42f0a..3545dc1d953c 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
 	struct super_block	*sb = mp->m_super;
 	int			error;
 
-	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
 	if (error)
 		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
 
@@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
 	int			error;
 
 	if (kernel_frozen) {
-		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
 		if (error)
 			xfs_emerg(mp, "still frozen after notify failure, err=%d",
 				error);
@@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
 	 * Also thaw userspace call anyway because the device is about to be
 	 * removed immediately.
 	 */
-	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
 }
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1aa578412f1b..b379a46b5576 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1307,6 +1307,7 @@ struct sb_writers {
 	unsigned short			frozen;		/* Is sb frozen? */
 	int				freeze_kcount;	/* How many kernel freeze requests? */
 	int				freeze_ucount;	/* How many userspace freeze requests? */
+	const void			*freeze_owner;	/* Owner of the freeze */
 	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
 };
 
@@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
+ * @FREEZE_EXCL: whether actual freezing must be done by the caller
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.
@@ -2283,6 +2285,7 @@ enum freeze_holder {
 	FREEZE_HOLDER_KERNEL	= (1U << 0),
 	FREEZE_HOLDER_USERSPACE	= (1U << 1),
 	FREEZE_MAY_NEST		= (1U << 2),
+	FREEZE_EXCL		= (1U << 3),
 };
 
 struct super_operations {
@@ -2296,9 +2299,9 @@ struct super_operations {
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
 	int (*sync_fs)(struct super_block *sb, int wait);
-	int (*freeze_super) (struct super_block *, enum freeze_holder who);
+	int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
 	int (*freeze_fs) (struct super_block *);
-	int (*thaw_super) (struct super_block *, enum freeze_holder who);
+	int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
@@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
 extern int vfs_statfs(const struct path *, struct kstatfs *);
 extern int user_statfs(const char __user *, struct kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
-int freeze_super(struct super_block *super, enum freeze_holder who);
-int thaw_super(struct super_block *super, enum freeze_holder who);
+int freeze_super(struct super_block *super, enum freeze_holder who,
+		 const void *freeze_owner);
+int thaw_super(struct super_block *super, enum freeze_holder who,
+	       const void *freeze_owner);
 extern __printf(2, 3)
 int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
 extern int super_setup_bdi(struct super_block *sb);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 2/4] fs: allow all writers to be frozen
  2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
  2025-04-02 14:07                     ` [PATCH v2 1/4] fs: add owner of freeze/thaw Christian Brauner
@ 2025-04-02 14:07                     ` Christian Brauner
  2025-04-02 15:32                       ` Christian Brauner
  2025-04-03 14:59                       ` Jan Kara
  2025-04-02 14:07                     ` [PATCH v2 3/4] power: freeze filesystems during suspend/resume Christian Brauner
                                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 14:07 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

During freeze/thaw we need to be able to freeze all writers during
suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
file and write to it will not be frozen after we've already frozen the
filesystem.

This has some risk of not being able to freeze processes in case a
process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
filesystem is frozen, a task can block on the frozen filesystem with
e.g., mmap_sem held. If some other task then blocks on grabbing that
mmap_sem, hibernation ill fail because it is unable to hibernate a task
holding mmap_sem. This could be fixed by making a range of filesystem
related locks use freezable sleeping. That's impractical and not
warranted just for suspend/hibernate. Assume that this is an infrequent
problem and we've given userspace a way to skip filesystem freezing
through a sysfs file.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/fs.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b379a46b5576..1edcba3cd68e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
-	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
-				   level == SB_FREEZE_WRITE);
+	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
 }
 
 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 3/4] power: freeze filesystems during suspend/resume
  2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
  2025-04-02 14:07                     ` [PATCH v2 1/4] fs: add owner of freeze/thaw Christian Brauner
  2025-04-02 14:07                     ` [PATCH v2 2/4] fs: allow all writers to be frozen Christian Brauner
@ 2025-04-02 14:07                     ` Christian Brauner
  2025-04-03 16:29                       ` Jan Kara
  2025-04-02 14:07                     ` [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw Christian Brauner
  2025-07-20 19:23                     ` [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume Askar Safin
  4 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 14:07 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.

We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.

What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.

We add a new sysctl know /sys/power/freeze_filesystems that will allow
userspace to freeze filesystems during suspend/hibernate. For now it
defaults to off. The thaw logic doesn't require checking whether
freezing is enabled because the power subsystem exclusively owns frozen
filesystems for the duration of suspend/hibernate and is able to skip
filesystems it doesn't need to freeze.

Also it is technically possible that filesystem
filesystem_freeze_enabled is true and power freezes the filesystems but
before freezing all processes another process disables
filesystem_freeze_enabled. If power were to place the filesystems_thaw()
call under filesystems_freeze_enabled it would fail to thaw the
fileystems it frozw. The exclusive holder mechanism makes it possible to
iterate through the list without any concern making sure that no
filesystems are left frozen.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c               | 14 ++++++++++----
 kernel/power/hibernate.c | 16 +++++++++++++++-
 kernel/power/main.c      | 31 +++++++++++++++++++++++++++++++
 kernel/power/power.h     |  4 ++++
 kernel/power/suspend.c   |  7 +++++++
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 3ddded4360c6..b4bdbc509dba 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
 	return active;
 }
 
+static const char *filesystems_freeze_ptr = "filesystems_freeze";
+
 static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 {
 	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
@@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->freeze_super)
-		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+				       filesystems_freeze_ptr);
 	else
-		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+			     filesystems_freeze_ptr);
 
 	deactivate_super(sb);
 }
@@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
 		return;
 
 	if (sb->s_op->thaw_super)
-		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+				     filesystems_freeze_ptr);
 	else
-		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
+		thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
+			   filesystems_freeze_ptr);
 
 	deactivate_super(sb);
 }
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 50ec26ea696b..37d733945c59 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -777,6 +777,8 @@ int hibernate(void)
 		goto Restore;
 
 	ksys_sync_helper();
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
 
 	error = freeze_processes();
 	if (error)
@@ -845,6 +847,7 @@ int hibernate(void)
 	/* Don't bother checking whether freezer_test_done is true */
 	freezer_test_done = false;
  Exit:
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
  Restore:
 	pm_restore_console();
@@ -881,6 +884,9 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 	if (error)
 		goto restore;
 
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
+
 	error = freeze_processes();
 	if (error)
 		goto exit;
@@ -940,6 +946,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
 	thaw_processes();
 
 exit:
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_HIBERNATION);
 
 restore:
@@ -1028,19 +1035,26 @@ static int software_resume(void)
 	if (error)
 		goto Restore;
 
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
+
 	pm_pr_dbg("Preparing processes for hibernation restore.\n");
 	error = freeze_processes();
-	if (error)
+	if (error) {
+		filesystems_thaw();
 		goto Close_Finish;
+	}
 
 	error = freeze_kernel_threads();
 	if (error) {
 		thaw_processes();
+		filesystems_thaw();
 		goto Close_Finish;
 	}
 
 	error = load_image_and_restore();
 	thaw_processes();
+	filesystems_thaw();
  Finish:
 	pm_notifier_call_chain(PM_POST_RESTORE);
  Restore:
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 6254814d4817..0b0e76324c43 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -962,6 +962,34 @@ power_attr(pm_freeze_timeout);
 
 #endif	/* CONFIG_FREEZER*/
 
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+bool filesystem_freeze_enabled = false;
+
+static ssize_t freeze_filesystems_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", filesystem_freeze_enabled);
+}
+
+static ssize_t freeze_filesystems_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t n)
+{
+	unsigned long val;
+
+	if (kstrtoul(buf, 10, &val))
+		return -EINVAL;
+
+	if (val > 1)
+		return -EINVAL;
+
+	filesystem_freeze_enabled = !!val;
+	return n;
+}
+
+power_attr(freeze_filesystems);
+#endif /* CONFIG_SUSPEND || CONFIG_HIBERNATION */
+
 static struct attribute * g[] = {
 	&state_attr.attr,
 #ifdef CONFIG_PM_TRACE
@@ -991,6 +1019,9 @@ static struct attribute * g[] = {
 #endif
 #ifdef CONFIG_FREEZER
 	&pm_freeze_timeout_attr.attr,
+#endif
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+	&freeze_filesystems_attr.attr,
 #endif
 	NULL,
 };
diff --git a/kernel/power/power.h b/kernel/power/power.h
index c352dea2f67b..2eb81662b8fa 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -18,6 +18,10 @@ struct swsusp_info {
 	unsigned long		size;
 } __aligned(PAGE_SIZE);
 
+#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
+extern bool filesystem_freeze_enabled;
+#endif
+
 #ifdef CONFIG_HIBERNATION
 /* kernel/power/snapshot.c */
 extern void __init hibernate_reserved_size_init(void);
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 8eaec4ab121d..76b141b9aac0 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -30,6 +30,7 @@
 #include <trace/events/power.h>
 #include <linux/compiler.h>
 #include <linux/moduleparam.h>
+#include <linux/fs.h>
 
 #include "power.h"
 
@@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
 	if (error)
 		goto Restore;
 
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
 	trace_suspend_resume(TPS("freeze_processes"), 0, true);
 	error = suspend_freeze_processes();
 	trace_suspend_resume(TPS("freeze_processes"), 0, false);
@@ -550,6 +553,7 @@ int suspend_devices_and_enter(suspend_state_t state)
 static void suspend_finish(void)
 {
 	suspend_thaw_processes();
+	filesystems_thaw();
 	pm_notifier_call_chain(PM_POST_SUSPEND);
 	pm_restore_console();
 }
@@ -588,6 +592,8 @@ static int enter_state(suspend_state_t state)
 		ksys_sync_helper();
 		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
 	}
+	if (filesystem_freeze_enabled)
+		filesystems_freeze();
 
 	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
 	pm_suspend_clear_flags();
@@ -609,6 +615,7 @@ static int enter_state(suspend_state_t state)
 	pm_pr_dbg("Finishing wakeup.\n");
 	suspend_finish();
  Unlock:
+	filesystems_thaw();
 	mutex_unlock(&system_transition_mutex);
 	return error;
 }

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw
  2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
                                       ` (2 preceding siblings ...)
  2025-04-02 14:07                     ` [PATCH v2 3/4] power: freeze filesystems during suspend/resume Christian Brauner
@ 2025-04-02 14:07                     ` Christian Brauner
  2025-04-03 15:00                       ` Jan Kara
  2025-07-20 19:23                     ` [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume Askar Safin
  4 siblings, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 14:07 UTC (permalink / raw)
  To: linux-fsdevel, jack
  Cc: Christian Brauner, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

Sysfs is built on top of kernfs and sysfs provides the power management
infrastructure to support suspend/hibernate by writing to various files
in /sys/power/. As filesystems may be automatically frozen during
suspend/hibernate implementing freeze/thaw support for kernfs
generically will cause deadlocks as the suspending/hibernation
initiating task will hold a VFS lock that it will then wait upon to be
released. If freeze/thaw for kernfs is needed talk to the VFS.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/kernfs/mount.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1..d2073bb2b633 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,21 @@ const struct super_operations kernfs_sops = {
 
 	.show_options	= kernfs_sop_show_options,
 	.show_path	= kernfs_sop_show_path,
+
+	/*
+	 * sysfs is built on top of kernfs and sysfs provides the power
+	 * management infrastructure to support suspend/hibernate by
+	 * writing to various files in /sys/power/. As filesystems may
+	 * be automatically frozen during suspend/hibernate implementing
+	 * freeze/thaw support for kernfs generically will cause
+	 * deadlocks as the suspending/hibernation initiating task will
+	 * hold a VFS lock that it will then wait upon to be released.
+	 * If freeze/thaw for kernfs is needed talk to the VFS.
+	 */
+	.freeze_fs	= NULL,
+	.unfreeze_fs	= NULL,
+	.freeze_super	= NULL,
+	.thaw_super	= NULL,
 };
 
 static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/4] fs: allow all writers to be frozen
  2025-04-02 14:07                     ` [PATCH v2 2/4] fs: allow all writers to be frozen Christian Brauner
@ 2025-04-02 15:32                       ` Christian Brauner
  2025-04-02 16:03                         ` James Bottomley
  2025-04-03 14:59                       ` Jan Kara
  1 sibling, 1 reply; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 15:32 UTC (permalink / raw)
  To: jack, peterz
  Cc: Ard Biesheuvel, linux-efi, linux-kernel, James Bottomley, mcgrof,
	hch, david, rafael, djwong, pavel, mingo, will, boqun.feng,
	linux-fsdevel

On Wed, Apr 02, 2025 at 04:07:32PM +0200, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
> 
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/linux/fs.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>  
>  static inline void __sb_start_write(struct super_block *sb, int level)
>  {
> -	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> -				   level == SB_FREEZE_WRITE);
> +	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
>  }

Jan, one more thought about freezability here. We know that there will
can be at least one process during hibernation that ends up generating
page faults and that's systemd-journald. When systemd-sleep requests
writing a hibernation image via /sys/power/ files it will inevitably end
up freezing systemd-journald and it may be generating a page fault with
->mmap_lock held. systemd-journald is now sleeping with
SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know this can cause
hibernation to fail. That part is fine. What isn't is that we will very
likely always trigger:

#ifdef CONFIG_LOCKDEP
        /*
         * It's dangerous to freeze with locks held; there be dragons there.
         */
        if (!(state & __TASK_FREEZABLE_UNSAFE))
                WARN_ON_ONCE(debug_locks && p->lockdep_depth);
#endif

with lockdep enabled.

So we really actually need percpu_rswem_read_freezable_unsafe(), i.e.,
TASK_FREEZABLE_UNSAFE.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/4] fs: allow all writers to be frozen
  2025-04-02 15:32                       ` Christian Brauner
@ 2025-04-02 16:03                         ` James Bottomley
  2025-04-02 16:13                           ` Christian Brauner
  0 siblings, 1 reply; 120+ messages in thread
From: James Bottomley @ 2025-04-02 16:03 UTC (permalink / raw)
  To: Christian Brauner, jack, peterz
  Cc: Ard Biesheuvel, linux-efi, linux-kernel, mcgrof, hch, david,
	rafael, djwong, pavel, mingo, will, boqun.feng, linux-fsdevel

On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
[...]
> Jan, one more thought about freezability here. We know that there
> will can be at least one process during hibernation that ends up
> generating page faults and that's systemd-journald. When systemd-
> sleep requests writing a hibernation image via /sys/power/ files it
> will inevitably end up freezing systemd-journald and it may be
> generating a page fault with ->mmap_lock held. systemd-journald is
> now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> this can cause hibernation to fail. That part is fine. What isn't is
> that we will very likely always trigger:
> 
> #ifdef CONFIG_LOCKDEP
>         /*
>          * It's dangerous to freeze with locks held; there be dragons
> there.
>          */
>         if (!(state & __TASK_FREEZABLE_UNSAFE))
>                 WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> #endif
> 
> with lockdep enabled.
> 
> So we really actually need percpu_rswem_read_freezable_unsafe(),
> i.e., TASK_FREEZABLE_UNSAFE.

The sched people have pretty strong views about people not doing this,
expressed in the comment in sched.h and commit f5d39b020809
("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
variants got removed with prejudice.

If we do get into this situation the worst that can happen is that
another upper lock acquisition triggers a hibernate failure and we thaw
everything, thus we can never truly deadlock, which is the fear, so
perhaps they might be OK with this.  Note that Rafael's solution to
this was to disable lockdep around hibernate/suspend and resume, which
is another possibility.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/4] fs: allow all writers to be frozen
  2025-04-02 16:03                         ` James Bottomley
@ 2025-04-02 16:13                           ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-02 16:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: jack, peterz, Ard Biesheuvel, linux-efi, linux-kernel, mcgrof,
	hch, david, rafael, djwong, pavel, mingo, will, boqun.feng,
	linux-fsdevel

On Wed, Apr 02, 2025 at 12:03:24PM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 17:32 +0200, Christian Brauner wrote:
> [...]
> > Jan, one more thought about freezability here. We know that there
> > will can be at least one process during hibernation that ends up
> > generating page faults and that's systemd-journald. When systemd-
> > sleep requests writing a hibernation image via /sys/power/ files it
> > will inevitably end up freezing systemd-journald and it may be
> > generating a page fault with ->mmap_lock held. systemd-journald is
> > now sleeping with SB_FREEZE_PAGEFAULT and TASK_FREEZABLE. We know
> > this can cause hibernation to fail. That part is fine. What isn't is
> > that we will very likely always trigger:
> > 
> > #ifdef CONFIG_LOCKDEP
> >         /*
> >          * It's dangerous to freeze with locks held; there be dragons
> > there.
> >          */
> >         if (!(state & __TASK_FREEZABLE_UNSAFE))
> >                 WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > #endif
> > 
> > with lockdep enabled.
> > 
> > So we really actually need percpu_rswem_read_freezable_unsafe(),
> > i.e., TASK_FREEZABLE_UNSAFE.
> 
> The sched people have pretty strong views about people not doing this,
> expressed in the comment in sched.h and commit f5d39b020809
> ("freezer,sched: Rewrite core freezer logic") where most of the _unsafe
> variants got removed with prejudice.
> 
> If we do get into this situation the worst that can happen is that
> another upper lock acquisition triggers a hibernate failure and we thaw
> everything, thus we can never truly deadlock, which is the fear, so

Yes, I know that it's harmless but we need to not generate misleading
lockdep splats when lockdep is turned on and confuse users.

> perhaps they might be OK with this.  Note that Rafael's solution to
> this was to disable lockdep around hibernate/suspend and resume, which
> is another possibility.

It can be done as a follow-up. I'm just saying it needs treatment.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/4] fs: add owner of freeze/thaw
  2025-04-02 14:07                     ` [PATCH v2 1/4] fs: add owner of freeze/thaw Christian Brauner
@ 2025-04-03 14:56                       ` Jan Kara
  2025-04-03 19:33                         ` Christian Brauner
  2025-04-04 10:24                         ` [PATCH] fs: allow nesting with FREEZE_EXCL Christian Brauner
  0 siblings, 2 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-03 14:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Wed 02-04-25 16:07:31, Christian Brauner wrote:
> For some kernel subsystems it is paramount that they are guaranteed that
> they are the owner of the freeze to avoid any risk of deadlocks. This is
> the case for the power subsystem. Enable it to recognize whether it did
> actually freeze the filesystem.
> 
> If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
> and then fails on the 6th for whatever odd reason (current or future)
> then power needs to undo the freeze of the first 5 filesystems. It can't
> just walk the list again because while it's unlikely that a new
> filesystem got added in the meantime it still cannot tell which
> filesystems the power subsystem actually managed to get a freeze
> reference count on that needs to be dropped during thaw.
> 
> There's various ways out of this ugliness. For example, record the
> filesystems the power subsystem managed to freeze on a temporary list in
> the callbacks and then walk that list backwards during thaw to undo the
> freezing or make sure that the power subsystem just actually exclusively
> freezes things it can freeze and marking such filesystems as being owned
> by power for the duration of the suspend or resume cycle. I opted for
> the latter as that seemed the clean thing to do even if it means more
> code changes.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

I have realized a slight catch with this approach that if hibernation races
with filesystem freezing (e.g. DM reconfiguration), then hibernation need
not freeze a filesystem because it's already frozen but userspace may thaw
the filesystem before hibernation actually happens (relatively harmless).
If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY (rather unexpected). So somehow tracking which
fs was frozen by suspend while properly nesting with other freeze users may
be actually a better approach (maybe just a sb flag even though it's
somewhat hacky?).

								Honza

> ---
>  fs/f2fs/gc.c                |  6 ++--
>  fs/gfs2/super.c             | 20 ++++++------
>  fs/gfs2/sys.c               |  4 +--
>  fs/ioctl.c                  |  8 ++---
>  fs/super.c                  | 76 ++++++++++++++++++++++++++++++++++++---------
>  fs/xfs/scrub/fscounters.c   |  4 +--
>  fs/xfs/xfs_notify_failure.c |  6 ++--
>  include/linux/fs.h          | 13 +++++---
>  8 files changed, 95 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 2b8f9239bede..3e8af62c9e15 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -2271,12 +2271,12 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
>  	if (err)
>  		return err;
>  
> -	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> +	err = freeze_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	if (err)
>  		return err;
>  
>  	if (f2fs_readonly(sbi->sb)) {
> -		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> +		err = thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
>  		if (err)
>  			return err;
>  		return -EROFS;
> @@ -2333,6 +2333,6 @@ int f2fs_resize_fs(struct file *filp, __u64 block_count)
>  out_err:
>  	f2fs_up_write(&sbi->cp_global_sem);
>  	f2fs_up_write(&sbi->gc_lock);
> -	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE);
> +	thaw_super(sbi->sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	return err;
>  }
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 44e5658b896c..519943189109 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -674,7 +674,7 @@ static int gfs2_sync_fs(struct super_block *sb, int wait)
>  	return sdp->sd_log_error;
>  }
>  
> -static int gfs2_do_thaw(struct gfs2_sbd *sdp)
> +static int gfs2_do_thaw(struct gfs2_sbd *sdp, enum freeze_holder who, const void *freeze_owner)
>  {
>  	struct super_block *sb = sdp->sd_vfs;
>  	int error;
> @@ -682,7 +682,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
>  	error = gfs2_freeze_lock_shared(sdp);
>  	if (error)
>  		goto fail;
> -	error = thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +	error = thaw_super(sb, who, freeze_owner);
>  	if (!error)
>  		return 0;
>  
> @@ -703,14 +703,14 @@ void gfs2_freeze_func(struct work_struct *work)
>  	if (test_bit(SDF_FROZEN, &sdp->sd_flags))
>  		goto freeze_failed;
>  
> -	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> +	error = freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	if (error)
>  		goto freeze_failed;
>  
>  	gfs2_freeze_unlock(sdp);
>  	set_bit(SDF_FROZEN, &sdp->sd_flags);
>  
> -	error = gfs2_do_thaw(sdp);
> +	error = gfs2_do_thaw(sdp, FREEZE_HOLDER_USERSPACE, NULL);
>  	if (error)
>  		goto out;
>  
> @@ -731,7 +731,8 @@ void gfs2_freeze_func(struct work_struct *work)
>   *
>   */
>  
> -static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
> +static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who,
> +			     const void *freeze_owner)
>  {
>  	struct gfs2_sbd *sdp = sb->s_fs_info;
>  	int error;
> @@ -744,7 +745,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
>  	}
>  
>  	for (;;) {
> -		error = freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> +		error = freeze_super(sb, who, freeze_owner);
>  		if (error) {
>  			fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
>  				error);
> @@ -758,7 +759,7 @@ static int gfs2_freeze_super(struct super_block *sb, enum freeze_holder who)
>  			break;
>  		}
>  
> -		error = gfs2_do_thaw(sdp);
> +		error = gfs2_do_thaw(sdp, who, freeze_owner);
>  		if (error)
>  			goto out;
>  
> @@ -799,7 +800,8 @@ static int gfs2_freeze_fs(struct super_block *sb)
>   *
>   */
>  
> -static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
> +static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who,
> +			   const void *freeze_owner)
>  {
>  	struct gfs2_sbd *sdp = sb->s_fs_info;
>  	int error;
> @@ -814,7 +816,7 @@ static int gfs2_thaw_super(struct super_block *sb, enum freeze_holder who)
>  	atomic_inc(&sb->s_active);
>  	gfs2_freeze_unlock(sdp);
>  
> -	error = gfs2_do_thaw(sdp);
> +	error = gfs2_do_thaw(sdp, who, freeze_owner);
>  
>  	if (!error) {
>  		clear_bit(SDF_FREEZE_INITIATOR, &sdp->sd_flags);
> diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
> index ecc699f8d9fc..748125653d6c 100644
> --- a/fs/gfs2/sys.c
> +++ b/fs/gfs2/sys.c
> @@ -174,10 +174,10 @@ static ssize_t freeze_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
>  
>  	switch (n) {
>  	case 0:
> -		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
> +		error = thaw_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
>  		break;
>  	case 1:
> -		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE);
> +		error = freeze_super(sdp->sd_vfs, FREEZE_HOLDER_USERSPACE, NULL);
>  		break;
>  	default:
>  		return -EINVAL;
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index c91fd2b46a77..bedc83fc2f20 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -396,8 +396,8 @@ static int ioctl_fsfreeze(struct file *filp)
>  
>  	/* Freeze */
>  	if (sb->s_op->freeze_super)
> -		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> -	return freeze_super(sb, FREEZE_HOLDER_USERSPACE);
> +		return sb->s_op->freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> +	return freeze_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  }
>  
>  static int ioctl_fsthaw(struct file *filp)
> @@ -409,8 +409,8 @@ static int ioctl_fsthaw(struct file *filp)
>  
>  	/* Thaw */
>  	if (sb->s_op->thaw_super)
> -		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> -	return thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +		return sb->s_op->thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
> +	return thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  }
>  
>  static int ioctl_file_dedupe_range(struct file *file,
> diff --git a/fs/super.c b/fs/super.c
> index 3c4a496d6438..3ddded4360c6 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -39,7 +39,8 @@
>  #include <uapi/linux/mount.h>
>  #include "internal.h"
>  
> -static int thaw_super_locked(struct super_block *sb, enum freeze_holder who);
> +static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
> +			     const void *freeze_owner);
>  
>  static LIST_HEAD(super_blocks);
>  static DEFINE_SPINLOCK(sb_lock);
> @@ -1148,7 +1149,7 @@ static void do_thaw_all_callback(struct super_block *sb, void *unused)
>  	if (IS_ENABLED(CONFIG_BLOCK))
>  		while (sb->s_bdev && !bdev_thaw(sb->s_bdev))
>  			pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
> -	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE);
> +	thaw_super_locked(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  	return;
>  }
>  
> @@ -1195,9 +1196,9 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->freeze_super)
> -		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  	else
> -		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  
>  	deactivate_super(sb);
>  }
> @@ -1217,9 +1218,9 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->thaw_super)
> -		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  	else
> -		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL);
> +		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
>  
>  	deactivate_super(sb);
>  }
> @@ -1522,10 +1523,10 @@ static int fs_bdev_freeze(struct block_device *bdev)
>  
>  	if (sb->s_op->freeze_super)
>  		error = sb->s_op->freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	else
>  		error = freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	if (!error)
>  		error = sync_blockdev(bdev);
>  	deactivate_super(sb);
> @@ -1571,10 +1572,10 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  
>  	if (sb->s_op->thaw_super)
>  		error = sb->s_op->thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	else
>  		error = thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE);
> +				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
>  	deactivate_super(sb);
>  	return error;
>  }
> @@ -1946,7 +1947,7 @@ static int wait_for_partially_frozen(struct super_block *sb)
>  }
>  
>  #define FREEZE_HOLDERS (FREEZE_HOLDER_KERNEL | FREEZE_HOLDER_USERSPACE)
> -#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST)
> +#define FREEZE_FLAGS (FREEZE_HOLDERS | FREEZE_MAY_NEST | FREEZE_EXCL)
>  
>  static inline int freeze_inc(struct super_block *sb, enum freeze_holder who)
>  {
> @@ -1977,6 +1978,21 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>  	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
>  	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>  
> +	if (who & FREEZE_EXCL) {
> +		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> +			return false;
> +
> +		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +			return false;
> +
> +		return (sb->s_writers.freeze_kcount +
> +			sb->s_writers.freeze_ucount) == 0;
> +	}
> +
> +	/* This filesystem is already exclusively frozen. */
> +	if (sb->s_writers.freeze_owner)
> +		return false;
> +
>  	if (who & FREEZE_HOLDER_KERNEL)
>  		return (who & FREEZE_MAY_NEST) ||
>  		       sb->s_writers.freeze_kcount == 0;
> @@ -1986,10 +2002,30 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>  	return false;
>  }
>  
> +static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
> +				const void *freeze_owner)
> +{
> +	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> +	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> +
> +	if (who & FREEZE_EXCL) {
> +		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> +			return false;
> +		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
> +			return false;
> +		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +			return false;
> +		return sb->s_writers.freeze_owner == freeze_owner;
> +	}
> +
> +	return sb->s_writers.freeze_owner == NULL;
> +}
> +
>  /**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
>   * @who: context that wants to freeze
> + * @freeze_owner: owner of the freeze
>   *
>   * Syncs the super to make sure the filesystem is consistent and calls the fs's
>   * freeze_fs.  Subsequent calls to this without first thawing the fs may return
> @@ -2041,7 +2077,7 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>   * Return: If the freeze was successful zero is returned. If the freeze
>   *         failed a negative error code is returned.
>   */
> -int freeze_super(struct super_block *sb, enum freeze_holder who)
> +int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
>  {
>  	int ret;
>  
> @@ -2075,6 +2111,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	if (sb_rdonly(sb)) {
>  		/* Nothing to do really... */
>  		WARN_ON_ONCE(freeze_inc(sb, who) > 1);
> +		sb->s_writers.freeze_owner = freeze_owner;
>  		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>  		wake_up_var(&sb->s_writers.frozen);
>  		super_unlock_excl(sb);
> @@ -2122,6 +2159,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
>  	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
>  	 */
>  	WARN_ON_ONCE(freeze_inc(sb, who) > 1);
> +	sb->s_writers.freeze_owner = freeze_owner;
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>  	wake_up_var(&sb->s_writers.frozen);
>  	lockdep_sb_freeze_release(sb);
> @@ -2136,13 +2174,17 @@ EXPORT_SYMBOL(freeze_super);
>   * removes that state without releasing the other state or unlocking the
>   * filesystem.
>   */
> -static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
> +static int thaw_super_locked(struct super_block *sb, enum freeze_holder who,
> +			     const void *freeze_owner)
>  {
>  	int error = -EINVAL;
>  
>  	if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
>  		goto out_unlock;
>  
> +	if (!may_unfreeze(sb, who, freeze_owner))
> +		goto out_unlock;
> +
>  	/*
>  	 * All freezers share a single active reference.
>  	 * So just unlock in case there are any left.
> @@ -2152,6 +2194,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  
>  	if (sb_rdonly(sb)) {
>  		sb->s_writers.frozen = SB_UNFROZEN;
> +		sb->s_writers.freeze_owner = NULL;
>  		wake_up_var(&sb->s_writers.frozen);
>  		goto out_deactivate;
>  	}
> @@ -2169,6 +2212,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>  	}
>  
>  	sb->s_writers.frozen = SB_UNFROZEN;
> +	sb->s_writers.freeze_owner = NULL;
>  	wake_up_var(&sb->s_writers.frozen);
>  	sb_freeze_unlock(sb, SB_FREEZE_FS);
>  out_deactivate:
> @@ -2184,6 +2228,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>   * thaw_super -- unlock filesystem
>   * @sb: the super to thaw
>   * @who: context that wants to freeze
> + * @freeze_owner: owner of the freeze
>   *
>   * Unlocks the filesystem and marks it writeable again after freeze_super()
>   * if there are no remaining freezes on the filesystem.
> @@ -2197,13 +2242,14 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
>   * have been frozen through the block layer via multiple block devices.
>   * The filesystem remains frozen until all block devices are unfrozen.
>   */
> -int thaw_super(struct super_block *sb, enum freeze_holder who)
> +int thaw_super(struct super_block *sb, enum freeze_holder who,
> +	       const void *freeze_owner)
>  {
>  	if (!super_lock_excl(sb)) {
>  		WARN_ON_ONCE("Dying superblock while thawing!");
>  		return -EINVAL;
>  	}
> -	return thaw_super_locked(sb, who);
> +	return thaw_super_locked(sb, who, freeze_owner);
>  }
>  EXPORT_SYMBOL(thaw_super);
>  
> diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
> index e629663e460a..9b598c5790ad 100644
> --- a/fs/xfs/scrub/fscounters.c
> +++ b/fs/xfs/scrub/fscounters.c
> @@ -123,7 +123,7 @@ xchk_fsfreeze(
>  {
>  	int			error;
>  
> -	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
> +	error = freeze_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
>  	trace_xchk_fsfreeze(sc, error);
>  	return error;
>  }
> @@ -135,7 +135,7 @@ xchk_fsthaw(
>  	int			error;
>  
>  	/* This should always succeed, we have a kernel freeze */
> -	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL);
> +	error = thaw_super(sc->mp->m_super, FREEZE_HOLDER_KERNEL, NULL);
>  	trace_xchk_fsthaw(sc, error);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index ed8d8ed42f0a..3545dc1d953c 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -127,7 +127,7 @@ xfs_dax_notify_failure_freeze(
>  	struct super_block	*sb = mp->m_super;
>  	int			error;
>  
> -	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL, NULL);
>  	if (error)
>  		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>  
> @@ -143,7 +143,7 @@ xfs_dax_notify_failure_thaw(
>  	int			error;
>  
>  	if (kernel_frozen) {
> -		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL, NULL);
>  		if (error)
>  			xfs_emerg(mp, "still frozen after notify failure, err=%d",
>  				error);
> @@ -153,7 +153,7 @@ xfs_dax_notify_failure_thaw(
>  	 * Also thaw userspace call anyway because the device is about to be
>  	 * removed immediately.
>  	 */
> -	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE, NULL);
>  }
>  
>  static int
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1aa578412f1b..b379a46b5576 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1307,6 +1307,7 @@ struct sb_writers {
>  	unsigned short			frozen;		/* Is sb frozen? */
>  	int				freeze_kcount;	/* How many kernel freeze requests? */
>  	int				freeze_ucount;	/* How many userspace freeze requests? */
> +	const void			*freeze_owner;	/* Owner of the freeze */
>  	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
>  };
>  
> @@ -2270,6 +2271,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> + * @FREEZE_EXCL: whether actual freezing must be done by the caller
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> @@ -2283,6 +2285,7 @@ enum freeze_holder {
>  	FREEZE_HOLDER_KERNEL	= (1U << 0),
>  	FREEZE_HOLDER_USERSPACE	= (1U << 1),
>  	FREEZE_MAY_NEST		= (1U << 2),
> +	FREEZE_EXCL		= (1U << 3),
>  };
>  
>  struct super_operations {
> @@ -2296,9 +2299,9 @@ struct super_operations {
>  	void (*evict_inode) (struct inode *);
>  	void (*put_super) (struct super_block *);
>  	int (*sync_fs)(struct super_block *sb, int wait);
> -	int (*freeze_super) (struct super_block *, enum freeze_holder who);
> +	int (*freeze_super) (struct super_block *, enum freeze_holder who, const void *owner);
>  	int (*freeze_fs) (struct super_block *);
> -	int (*thaw_super) (struct super_block *, enum freeze_holder who);
> +	int (*thaw_super) (struct super_block *, enum freeze_holder who, const void *owner);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
>  	int (*remount_fs) (struct super_block *, int *, char *);
> @@ -2706,8 +2709,10 @@ extern int unregister_filesystem(struct file_system_type *);
>  extern int vfs_statfs(const struct path *, struct kstatfs *);
>  extern int user_statfs(const char __user *, struct kstatfs *);
>  extern int fd_statfs(int, struct kstatfs *);
> -int freeze_super(struct super_block *super, enum freeze_holder who);
> -int thaw_super(struct super_block *super, enum freeze_holder who);
> +int freeze_super(struct super_block *super, enum freeze_holder who,
> +		 const void *freeze_owner);
> +int thaw_super(struct super_block *super, enum freeze_holder who,
> +	       const void *freeze_owner);
>  extern __printf(2, 3)
>  int super_setup_bdi_name(struct super_block *sb, char *fmt, ...);
>  extern int super_setup_bdi(struct super_block *sb);
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 2/4] fs: allow all writers to be frozen
  2025-04-02 14:07                     ` [PATCH v2 2/4] fs: allow all writers to be frozen Christian Brauner
  2025-04-02 15:32                       ` Christian Brauner
@ 2025-04-03 14:59                       ` Jan Kara
  1 sibling, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-03 14:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Wed 02-04-25 16:07:32, Christian Brauner wrote:
> During freeze/thaw we need to be able to freeze all writers during
> suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
> file and write to it will not be frozen after we've already frozen the
> filesystem.
> 
> This has some risk of not being able to freeze processes in case a
> process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
> SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
> filesystem is frozen, a task can block on the frozen filesystem with
> e.g., mmap_sem held. If some other task then blocks on grabbing that
> mmap_sem, hibernation ill fail because it is unable to hibernate a task
> holding mmap_sem. This could be fixed by making a range of filesystem
> related locks use freezable sleeping. That's impractical and not
> warranted just for suspend/hibernate. Assume that this is an infrequent
> problem and we've given userspace a way to skip filesystem freezing
> through a sysfs file.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

BTW, I agree about the need to silence lockdep somehow.

								Honza

> ---
>  include/linux/fs.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b379a46b5576..1edcba3cd68e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1781,8 +1781,7 @@ static inline void __sb_end_write(struct super_block *sb, int level)
>  
>  static inline void __sb_start_write(struct super_block *sb, int level)
>  {
> -	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
> -				   level == SB_FREEZE_WRITE);
> +	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, true);
>  }
>  
>  static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw
  2025-04-02 14:07                     ` [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw Christian Brauner
@ 2025-04-03 15:00                       ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-03 15:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Wed 02-04-25 16:07:34, Christian Brauner wrote:
> Sysfs is built on top of kernfs and sysfs provides the power management
> infrastructure to support suspend/hibernate by writing to various files
> in /sys/power/. As filesystems may be automatically frozen during
> suspend/hibernate implementing freeze/thaw support for kernfs
> generically will cause deadlocks as the suspending/hibernation
> initiating task will hold a VFS lock that it will then wait upon to be
> released. If freeze/thaw for kernfs is needed talk to the VFS.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Yeah, good idea. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/kernfs/mount.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index 1358c21837f1..d2073bb2b633 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,21 @@ const struct super_operations kernfs_sops = {
>  
>  	.show_options	= kernfs_sop_show_options,
>  	.show_path	= kernfs_sop_show_path,
> +
> +	/*
> +	 * sysfs is built on top of kernfs and sysfs provides the power
> +	 * management infrastructure to support suspend/hibernate by
> +	 * writing to various files in /sys/power/. As filesystems may
> +	 * be automatically frozen during suspend/hibernate implementing
> +	 * freeze/thaw support for kernfs generically will cause
> +	 * deadlocks as the suspending/hibernation initiating task will
> +	 * hold a VFS lock that it will then wait upon to be released.
> +	 * If freeze/thaw for kernfs is needed talk to the VFS.
> +	 */
> +	.freeze_fs	= NULL,
> +	.unfreeze_fs	= NULL,
> +	.freeze_super	= NULL,
> +	.thaw_super	= NULL,
>  };
>  
>  static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 3/4] power: freeze filesystems during suspend/resume
  2025-04-02 14:07                     ` [PATCH v2 3/4] power: freeze filesystems during suspend/resume Christian Brauner
@ 2025-04-03 16:29                       ` Jan Kara
  0 siblings, 0 replies; 120+ messages in thread
From: Jan Kara @ 2025-04-03 16:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Wed 02-04-25 16:07:33, Christian Brauner wrote:
> Now all the pieces are in place to actually allow the power subsystem
> to freeze/thaw filesystems during suspend/resume. Filesystems are only
> frozen and thawed if the power subsystem does actually own the freeze.
> 
> We could bubble up errors and fail suspend/resume if the error isn't
> EBUSY (aka it's already frozen) but I don't think that this is worth it.
> Filesystem freezing during suspend/resume is best-effort. If the user
> has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
> reason then we simply skip them.
> 
> What we have now is already a big improvement and let's see how we fare
> with it before making our lives even harder (and uglier) than we have
> to.
> 
> We add a new sysctl know /sys/power/freeze_filesystems that will allow
> userspace to freeze filesystems during suspend/hibernate. For now it
> defaults to off. The thaw logic doesn't require checking whether
> freezing is enabled because the power subsystem exclusively owns frozen
> filesystems for the duration of suspend/hibernate and is able to skip
> filesystems it doesn't need to freeze.
> 
> Also it is technically possible that filesystem
> filesystem_freeze_enabled is true and power freezes the filesystems but
> before freezing all processes another process disables
> filesystem_freeze_enabled. If power were to place the filesystems_thaw()
> call under filesystems_freeze_enabled it would fail to thaw the
> fileystems it frozw. The exclusive holder mechanism makes it possible to
> iterate through the list without any concern making sure that no
> filesystems are left frozen.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Looks good modulo the nesting issue I've mentioned in my comments to patch
1. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c               | 14 ++++++++++----
>  kernel/power/hibernate.c | 16 +++++++++++++++-
>  kernel/power/main.c      | 31 +++++++++++++++++++++++++++++++
>  kernel/power/power.h     |  4 ++++
>  kernel/power/suspend.c   |  7 +++++++
>  5 files changed, 67 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 3ddded4360c6..b4bdbc509dba 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1187,6 +1187,8 @@ static inline bool get_active_super(struct super_block *sb)
>  	return active;
>  }
>  
> +static const char *filesystems_freeze_ptr = "filesystems_freeze";
> +
>  static void filesystems_freeze_callback(struct super_block *sb, void *unused)
>  {
>  	if (!sb->s_op->freeze_fs && !sb->s_op->freeze_super)
> @@ -1196,9 +1198,11 @@ static void filesystems_freeze_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->freeze_super)
> -		sb->s_op->freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		sb->s_op->freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +				       filesystems_freeze_ptr);
>  	else
> -		freeze_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		freeze_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +			     filesystems_freeze_ptr);
>  
>  	deactivate_super(sb);
>  }
> @@ -1218,9 +1222,11 @@ static void filesystems_thaw_callback(struct super_block *sb, void *unused)
>  		return;
>  
>  	if (sb->s_op->thaw_super)
> -		sb->s_op->thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		sb->s_op->thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +				     filesystems_freeze_ptr);
>  	else
> -		thaw_super(sb, FREEZE_MAY_NEST | FREEZE_HOLDER_KERNEL, NULL);
> +		thaw_super(sb, FREEZE_EXCL | FREEZE_HOLDER_KERNEL,
> +			   filesystems_freeze_ptr);
>  
>  	deactivate_super(sb);
>  }
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 50ec26ea696b..37d733945c59 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -777,6 +777,8 @@ int hibernate(void)
>  		goto Restore;
>  
>  	ksys_sync_helper();
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
>  
>  	error = freeze_processes();
>  	if (error)
> @@ -845,6 +847,7 @@ int hibernate(void)
>  	/* Don't bother checking whether freezer_test_done is true */
>  	freezer_test_done = false;
>   Exit:
> +	filesystems_thaw();
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>   Restore:
>  	pm_restore_console();
> @@ -881,6 +884,9 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  	if (error)
>  		goto restore;
>  
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
> +
>  	error = freeze_processes();
>  	if (error)
>  		goto exit;
> @@ -940,6 +946,7 @@ int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  	thaw_processes();
>  
>  exit:
> +	filesystems_thaw();
>  	pm_notifier_call_chain(PM_POST_HIBERNATION);
>  
>  restore:
> @@ -1028,19 +1035,26 @@ static int software_resume(void)
>  	if (error)
>  		goto Restore;
>  
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
> +
>  	pm_pr_dbg("Preparing processes for hibernation restore.\n");
>  	error = freeze_processes();
> -	if (error)
> +	if (error) {
> +		filesystems_thaw();
>  		goto Close_Finish;
> +	}
>  
>  	error = freeze_kernel_threads();
>  	if (error) {
>  		thaw_processes();
> +		filesystems_thaw();
>  		goto Close_Finish;
>  	}
>  
>  	error = load_image_and_restore();
>  	thaw_processes();
> +	filesystems_thaw();
>   Finish:
>  	pm_notifier_call_chain(PM_POST_RESTORE);
>   Restore:
> diff --git a/kernel/power/main.c b/kernel/power/main.c
> index 6254814d4817..0b0e76324c43 100644
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -962,6 +962,34 @@ power_attr(pm_freeze_timeout);
>  
>  #endif	/* CONFIG_FREEZER*/
>  
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +bool filesystem_freeze_enabled = false;
> +
> +static ssize_t freeze_filesystems_show(struct kobject *kobj,
> +				       struct kobj_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", filesystem_freeze_enabled);
> +}
> +
> +static ssize_t freeze_filesystems_store(struct kobject *kobj,
> +					struct kobj_attribute *attr,
> +					const char *buf, size_t n)
> +{
> +	unsigned long val;
> +
> +	if (kstrtoul(buf, 10, &val))
> +		return -EINVAL;
> +
> +	if (val > 1)
> +		return -EINVAL;
> +
> +	filesystem_freeze_enabled = !!val;
> +	return n;
> +}
> +
> +power_attr(freeze_filesystems);
> +#endif /* CONFIG_SUSPEND || CONFIG_HIBERNATION */
> +
>  static struct attribute * g[] = {
>  	&state_attr.attr,
>  #ifdef CONFIG_PM_TRACE
> @@ -991,6 +1019,9 @@ static struct attribute * g[] = {
>  #endif
>  #ifdef CONFIG_FREEZER
>  	&pm_freeze_timeout_attr.attr,
> +#endif
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +	&freeze_filesystems_attr.attr,
>  #endif
>  	NULL,
>  };
> diff --git a/kernel/power/power.h b/kernel/power/power.h
> index c352dea2f67b..2eb81662b8fa 100644
> --- a/kernel/power/power.h
> +++ b/kernel/power/power.h
> @@ -18,6 +18,10 @@ struct swsusp_info {
>  	unsigned long		size;
>  } __aligned(PAGE_SIZE);
>  
> +#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
> +extern bool filesystem_freeze_enabled;
> +#endif
> +
>  #ifdef CONFIG_HIBERNATION
>  /* kernel/power/snapshot.c */
>  extern void __init hibernate_reserved_size_init(void);
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 8eaec4ab121d..76b141b9aac0 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -30,6 +30,7 @@
>  #include <trace/events/power.h>
>  #include <linux/compiler.h>
>  #include <linux/moduleparam.h>
> +#include <linux/fs.h>
>  
>  #include "power.h"
>  
> @@ -374,6 +375,8 @@ static int suspend_prepare(suspend_state_t state)
>  	if (error)
>  		goto Restore;
>  
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
>  	trace_suspend_resume(TPS("freeze_processes"), 0, true);
>  	error = suspend_freeze_processes();
>  	trace_suspend_resume(TPS("freeze_processes"), 0, false);
> @@ -550,6 +553,7 @@ int suspend_devices_and_enter(suspend_state_t state)
>  static void suspend_finish(void)
>  {
>  	suspend_thaw_processes();
> +	filesystems_thaw();
>  	pm_notifier_call_chain(PM_POST_SUSPEND);
>  	pm_restore_console();
>  }
> @@ -588,6 +592,8 @@ static int enter_state(suspend_state_t state)
>  		ksys_sync_helper();
>  		trace_suspend_resume(TPS("sync_filesystems"), 0, false);
>  	}
> +	if (filesystem_freeze_enabled)
> +		filesystems_freeze();
>  
>  	pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
>  	pm_suspend_clear_flags();
> @@ -609,6 +615,7 @@ static int enter_state(suspend_state_t state)
>  	pm_pr_dbg("Finishing wakeup.\n");
>  	suspend_finish();
>   Unlock:
> +	filesystems_thaw();
>  	mutex_unlock(&system_transition_mutex);
>  	return error;
>  }
> 
> -- 
> 2.47.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/4] fs: add owner of freeze/thaw
  2025-04-03 14:56                       ` Jan Kara
@ 2025-04-03 19:33                         ` Christian Brauner
  2025-04-04 10:24                         ` [PATCH] fs: allow nesting with FREEZE_EXCL Christian Brauner
  1 sibling, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-03 19:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Thu, Apr 03, 2025 at 04:56:57PM +0200, Jan Kara wrote:
> On Wed 02-04-25 16:07:31, Christian Brauner wrote:
> > For some kernel subsystems it is paramount that they are guaranteed that
> > they are the owner of the freeze to avoid any risk of deadlocks. This is
> > the case for the power subsystem. Enable it to recognize whether it did
> > actually freeze the filesystem.
> > 
> > If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
> > and then fails on the 6th for whatever odd reason (current or future)
> > then power needs to undo the freeze of the first 5 filesystems. It can't
> > just walk the list again because while it's unlikely that a new
> > filesystem got added in the meantime it still cannot tell which
> > filesystems the power subsystem actually managed to get a freeze
> > reference count on that needs to be dropped during thaw.
> > 
> > There's various ways out of this ugliness. For example, record the
> > filesystems the power subsystem managed to freeze on a temporary list in
> > the callbacks and then walk that list backwards during thaw to undo the
> > freezing or make sure that the power subsystem just actually exclusively
> > freezes things it can freeze and marking such filesystems as being owned
> > by power for the duration of the suspend or resume cycle. I opted for
> > the latter as that seemed the clean thing to do even if it means more
> > code changes.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> I have realized a slight catch with this approach that if hibernation races
> with filesystem freezing (e.g. DM reconfiguration), then hibernation need
> not freeze a filesystem because it's already frozen but userspace may thaw
> the filesystem before hibernation actually happens (relatively harmless).
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY (rather unexpected). So somehow tracking which
> fs was frozen by suspend while properly nesting with other freeze users may
> be actually a better approach (maybe just a sb flag even though it's
> somewhat hacky?).

The approach that I originally had was to add FREEZE_POWER which adds a
simple boolean into the sb_writers instead of a holder and then this
simply nests with the rest. I'll try to post that diff tomorrow.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH] fs: allow nesting with FREEZE_EXCL
  2025-04-03 14:56                       ` Jan Kara
  2025-04-03 19:33                         ` Christian Brauner
@ 2025-04-04 10:24                         ` Christian Brauner
  2025-04-07  9:08                           ` Christoph Hellwig
  2025-05-07 11:18                           ` Jan Kara
  1 sibling, 2 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-04 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Ard Biesheuvel, linux-efi,
	linux-kernel, James Bottomley, mcgrof, hch, david, rafael, djwong,
	pavel, peterz, mingo, will, boqun.feng

If hibernation races with filesystem freezing (e.g. DM reconfiguration),
then hibernation need not freeze a filesystem because it's already
frozen but userspace may thaw the filesystem before hibernation actually
happens.

If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY.

So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
cannot be undone by any of the other concurrent freezers.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/super.c         | 71 ++++++++++++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |  2 +-
 2 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index b4bdbc509dba..e2fee655fbed 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1979,26 +1979,34 @@ static inline int freeze_dec(struct super_block *sb, enum freeze_holder who)
 	return sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount;
 }
 
-static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
+static inline bool may_freeze(struct super_block *sb, enum freeze_holder who,
+			      const void *freeze_owner)
 {
+	lockdep_assert_held(&sb->s_umount);
+
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
 	if (who & FREEZE_EXCL) {
 		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
 			return false;
-
-		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
 			return false;
-
-		return (sb->s_writers.freeze_kcount +
-			sb->s_writers.freeze_ucount) == 0;
+		if (WARN_ON_ONCE(!freeze_owner))
+			return false;
+		/* This freeze already has a specific owner. */
+		if (sb->s_writers.freeze_owner)
+			return false;
+		/*
+		 * This is already frozen multiple times so we're just
+		 * going to take a reference count and mark it as
+		 * belonging to use.
+		 */
+		if (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount)
+			sb->s_writers.freeze_owner = freeze_owner;
+		return true;
 	}
 
-	/* This filesystem is already exclusively frozen. */
-	if (sb->s_writers.freeze_owner)
-		return false;
-
 	if (who & FREEZE_HOLDER_KERNEL)
 		return (who & FREEZE_MAY_NEST) ||
 		       sb->s_writers.freeze_kcount == 0;
@@ -2011,20 +2019,51 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
 static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
 				const void *freeze_owner)
 {
+	lockdep_assert_held(&sb->s_umount);
+
 	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
 	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
 
 	if (who & FREEZE_EXCL) {
-		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
-			return false;
 		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
 			return false;
-		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
+		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
+			return false;
+		if (WARN_ON_ONCE(!freeze_owner))
+			return false;
+		if (WARN_ON_ONCE(sb->s_writers.freeze_kcount == 0))
 			return false;
-		return sb->s_writers.freeze_owner == freeze_owner;
+		/* This isn't exclusively frozen. */
+		if (!sb->s_writers.freeze_owner)
+			return false;
+		/* This isn't exclusively frozen by us. */
+		if (sb->s_writers.freeze_owner != freeze_owner)
+			return false;
+		/*
+		 * This is still frozen multiple times so we're just
+		 * going to drop our reference count and undo our
+		 * exclusive freeze.
+		 */
+		if ((sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount) > 1)
+			sb->s_writers.freeze_owner = NULL;
+		return true;
+	}
+
+	if (who & FREEZE_HOLDER_KERNEL) {
+		/*
+		 * Someone's trying to steal the reference belonging to
+		 * @sb->s_writers.freeze_owner.
+		 */
+		if (sb->s_writers.freeze_kcount == 1 &&
+		    sb->s_writers.freeze_owner)
+			return false;
+		return sb->s_writers.freeze_kcount > 0;
 	}
 
-	return sb->s_writers.freeze_owner == NULL;
+	if (who & FREEZE_HOLDER_USERSPACE)
+		return sb->s_writers.freeze_ucount > 0;
+
+	return false;
 }
 
 /**
@@ -2095,7 +2134,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
 
 retry:
 	if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
-		if (may_freeze(sb, who))
+		if (may_freeze(sb, who, freeze_owner))
 			ret = !!WARN_ON_ONCE(freeze_inc(sb, who) == 1);
 		else
 			ret = -EBUSY;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edcba3cd68e..7a3f821d2723 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2270,7 +2270,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
  * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
  * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
  * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
- * @FREEZE_EXCL: whether actual freezing must be done by the caller
+ * @FREEZE_EXCL: a freeze that can only be undone by the owner
  *
  * Indicate who the owner of the freeze or thaw request is and whether
  * the freeze needs to be exclusive or can nest.

---
base-commit: a83fe97e0d53f7d2b0fc62fd9a322a963cb30306
change-id: 20250404-work-freeze-5eacb515f044


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH] fs: allow nesting with FREEZE_EXCL
  2025-04-04 10:24                         ` [PATCH] fs: allow nesting with FREEZE_EXCL Christian Brauner
@ 2025-04-07  9:08                           ` Christoph Hellwig
  2025-05-07 11:18                           ` Jan Kara
  1 sibling, 0 replies; 120+ messages in thread
From: Christoph Hellwig @ 2025-04-07  9:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Fri, Apr 04, 2025 at 12:24:09PM +0200, Christian Brauner wrote:
> If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> then hibernation need not freeze a filesystem because it's already
> frozen but userspace may thaw the filesystem before hibernation actually
> happens.
> 
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY.
> 
> So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> cannot be undone by any of the other concurrent freezers.

What is FREEZE_EXCL?  I can't find it anywhere including in linux-next.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-02  7:46               ` Christian Brauner
@ 2025-04-08 15:43                 ` James Bottomley
  2025-04-08 17:09                   ` Luis Chamberlain
  2025-04-09  7:17                   ` Christian Brauner
  0 siblings, 2 replies; 120+ messages in thread
From: James Bottomley @ 2025-04-08 15:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, mcgrof, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > The whole shebang can also be found at:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > 
> > > I know nothing about power or hibernation. I've tested it as best
> > > as I could. Works for me (TM).
> > 
> > I'm testing the latest you have in work.freeze and it doesn't
> > currently work for me.  Patch 7b315c39b67d ("power: freeze
> > filesystems during suspend/resume") doesn't set
> > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > this check 
> 
> I haven't pushed the new version there. Sorry about that. I only have
> it locally.
> 
> > 
> > +static inline bool may_unfreeze(struct super_block *sb, enum
> > freeze_holder who,
> > +                               const void *freeze_owner)
> > +{
> > +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > +
> > +       if (who & FREEZE_EXCL) {
> > +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > NULL))
> > +                       return false;
> > 
> > 
> > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > resume from hibernate.  Setting it to __builtin_return_address(0)
> > in filesystems_freeze() makes everything work as expected, so
> > that's what I'm testing now.
> 
> +1
> 
> I'll send the final version out in a bit.

I've now done some extensive testing on loop nested filesystems with
fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
Hibernate/Resume has currently worked on these without a hitch (and the
fio load burps a bit but then starts running at full speed within a few
seconds). What I'm doing is a single round of hibernate/resume followed
by a reboot. I'm relying on the fschecks to detect any filesystem
corruption. I've also tried doing a couple of fresh starts of the
hibernated image to check that we did correctly freeze the filesystems.

The problems I've noticed are:

   1. I'm using 9p to push host directories throught and that
      completely hangs after a resume. This is expected because the
      virtio server is out of sync, but it does indicate a need to
      address Jeff's question of what we should be doing for network
      filesystems (and is also the reason I have to reboot after
      resuming).
   2. Top doesn't show any CPU activity after resume even though fio is
      definitely running.  This seems to be a suspend issue and
      unrelated to filesystems, but I'll continue investigating.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-08 15:43                 ` James Bottomley
@ 2025-04-08 17:09                   ` Luis Chamberlain
  2025-04-08 17:20                     ` Luis Chamberlain
  2025-04-08 17:24                     ` James Bottomley
  2025-04-09  7:17                   ` Christian Brauner
  1 sibling, 2 replies; 120+ messages in thread
From: Luis Chamberlain @ 2025-04-08 17:09 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christian Brauner, linux-fsdevel, jack, rafael, Ard Biesheuvel,
	linux-efi, linux-kernel, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > > The whole shebang can also be found at:
> > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > > 
> > > > I know nothing about power or hibernation. I've tested it as best
> > > > as I could. Works for me (TM).
> > > 
> > > I'm testing the latest you have in work.freeze and it doesn't
> > > currently work for me.  Patch 7b315c39b67d ("power: freeze
> > > filesystems during suspend/resume") doesn't set
> > > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > > this check 
> > 
> > I haven't pushed the new version there. Sorry about that. I only have
> > it locally.
> > 
> > > 
> > > +static inline bool may_unfreeze(struct super_block *sb, enum
> > > freeze_holder who,
> > > +                               const void *freeze_owner)
> > > +{
> > > +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > > +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > > +
> > > +       if (who & FREEZE_EXCL) {
> > > +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > > NULL))
> > > +                       return false;
> > > 
> > > 
> > > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > > resume from hibernate.  Setting it to __builtin_return_address(0)
> > > in filesystems_freeze() makes everything work as expected, so
> > > that's what I'm testing now.
> > 
> > +1
> > 
> > I'll send the final version out in a bit.
> 
> I've now done some extensive testing on loop nested filesystems with
> fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
> Hibernate/Resume has currently worked on these without a hitch (and the
> fio load burps a bit but then starts running at full speed within a few
> seconds). What I'm doing is a single round of hibernate/resume followed
> by a reboot. I'm relying on the fschecks to detect any filesystem
> corruption. I've also tried doing a couple of fresh starts of the
> hibernated image to check that we did correctly freeze the filesystems.
> 
> The problems I've noticed are:
> 
>    1. I'm using 9p to push host directories throught and that
>       completely hangs after a resume. This is expected because the
>       virtio server is out of sync, but it does indicate a need to
>       address Jeff's question of what we should be doing for network
>       filesystems (and is also the reason I have to reboot after
>       resuming).
>    2. Top doesn't show any CPU activity after resume even though fio is
>       definitely running.  This seems to be a suspend issue and
>       unrelated to filesystems, but I'll continue investigating.

To be clear, on the fio run -- are you running fio *while*
suspend/resume cycle on XFS? That used to stall / break suspend
resume. We may want to test dd against a drive too, that will use
the block device cache, and I forget if we have a freeze/thaw for it.

  Luis

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-08 17:09                   ` Luis Chamberlain
@ 2025-04-08 17:20                     ` Luis Chamberlain
  2025-04-08 17:26                       ` James Bottomley
  2025-04-08 17:24                     ` James Bottomley
  1 sibling, 1 reply; 120+ messages in thread
From: Luis Chamberlain @ 2025-04-08 17:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christian Brauner, linux-fsdevel, jack, rafael, Ard Biesheuvel,
	linux-efi, linux-kernel, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

And in case its useful, to test this on a VM you'll need on libvirt:

</os>                                                                                                                                                                                                                                      
<pm>                                                                                                                                                                                                                                     
<suspend-to-mem enabled='yes'/>                                                                                                                                                                                                          
<suspend-to-disk enabled='yes'/>                                                                                                                                                                                                         
</pm>   

The litmus test which used to stall without ever returning was:

while true; do                                                                                                                                                                                                                               
dd if=/dev/zero of=/path/to/xfs/foo bs=1M count=1024 &> /dev/null
done 

To suspend you can use one of these two:

echo mem > /sys/power/state
systemctl suspend

Verify you can resume from suspend:
virsh qemu-monitor-command guest_foo 'query-current-machine'
{"return":{"wakeup-suspend-support":true},"id":"libvirt-415"}

To resume the guest:

virsh dompmwakeup guest_foo

  Luis

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-08 17:09                   ` Luis Chamberlain
  2025-04-08 17:20                     ` Luis Chamberlain
@ 2025-04-08 17:24                     ` James Bottomley
  1 sibling, 0 replies; 120+ messages in thread
From: James Bottomley @ 2025-04-08 17:24 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christian Brauner, linux-fsdevel, jack, rafael, Ard Biesheuvel,
	linux-efi, linux-kernel, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Tue, 2025-04-08 at 10:09 -0700, Luis Chamberlain wrote:
> On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
[...]
> > I've now done some extensive testing on loop nested filesystems
> > with fio load on the upper. I've tried xfs on ext4 and ext4 on
> > ext4. Hibernate/Resume has currently worked on these without a
> > hitch (and the fio load burps a bit but then starts running at full
> > speed within a few seconds). What I'm doing is a single round of
> > hibernate/resume followed by a reboot. I'm relying on the fschecks
> > to detect any filesystem corruption. I've also tried doing a couple
> > of fresh starts of the hibernated image to check that we did
> > correctly freeze the filesystems.
> > 
> > The problems I've noticed are:
> > 
> >    1. I'm using 9p to push host directories throught and that
> >       completely hangs after a resume. This is expected because the
> >       virtio server is out of sync, but it does indicate a need to
> >       address Jeff's question of what we should be doing for
> > network
> >       filesystems (and is also the reason I have to reboot after
> >       resuming).
> >    2. Top doesn't show any CPU activity after resume even though
> > fio is
> >       definitely running.  This seems to be a suspend issue and
> >       unrelated to filesystems, but I'll continue investigating.
> 
> To be clear, on the fio run -- are you running fio *while*
> suspend/resume cycle on XFS?

Yes, that's why I said "the fio load burps a bit" (as in after resume)
"but then starts running full speed after a few seconds".

>  That used to stall / break suspend resume. We may want to test dd
> against a drive too, that will use the block device cache, and I
> forget if we have a freeze/thaw for it.

fio is running a read/write test, but I think all my caches are write
through for safety (although I have verified that the device cache
flush is sent as the last sequence of hibernate).

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-08 17:20                     ` Luis Chamberlain
@ 2025-04-08 17:26                       ` James Bottomley
  0 siblings, 0 replies; 120+ messages in thread
From: James Bottomley @ 2025-04-08 17:26 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christian Brauner, linux-fsdevel, jack, rafael, Ard Biesheuvel,
	linux-efi, linux-kernel, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Tue, 2025-04-08 at 10:20 -0700, Luis Chamberlain wrote:
> And in case its useful, to test this on a VM you'll need on libvirt:

Just so we're clear, I'm only doing hibernate, not suspend tests.  I
figure they should be a bit harsher, but you never know.  The reason is
I set up my test rig for efivarfs, which only has a testable problem on
hibernate.

Regards,

James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-08 15:43                 ` James Bottomley
  2025-04-08 17:09                   ` Luis Chamberlain
@ 2025-04-09  7:17                   ` Christian Brauner
  1 sibling, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-04-09  7:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, jack, rafael, Ard Biesheuvel, linux-efi,
	linux-kernel, mcgrof, hch, david, djwong, pavel, peterz, mingo,
	will, boqun.feng

On Tue, Apr 08, 2025 at 11:43:46AM -0400, James Bottomley wrote:
> On Wed, 2025-04-02 at 09:46 +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 01:02:07PM -0400, James Bottomley wrote:
> > > On Tue, 2025-04-01 at 02:32 +0200, Christian Brauner wrote:
> > > > The whole shebang can also be found at:
> > > > https://web.git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.freeze
> > > > 
> > > > I know nothing about power or hibernation. I've tested it as best
> > > > as I could. Works for me (TM).
> > > 
> > > I'm testing the latest you have in work.freeze and it doesn't
> > > currently work for me.  Patch 7b315c39b67d ("power: freeze
> > > filesystems during suspend/resume") doesn't set
> > > filesystems_freeze_ptr so it ends up being NULL and tripping over
> > > this check 
> > 
> > I haven't pushed the new version there. Sorry about that. I only have
> > it locally.
> > 
> > > 
> > > +static inline bool may_unfreeze(struct super_block *sb, enum
> > > freeze_holder who,
> > > +                               const void *freeze_owner)
> > > +{
> > > +       WARN_ON_ONCE((who & ~FREEZE_FLAGS));
> > > +       WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
> > > +
> > > +       if (who & FREEZE_EXCL) {
> > > +               if (WARN_ON_ONCE(sb->s_writers.freeze_owner ==
> > > NULL))
> > > +                       return false;
> > > 
> > > 
> > > in f15a9ae05a71 ("fs: add owner of freeze/thaw") and failing to
> > > resume from hibernate.  Setting it to __builtin_return_address(0)
> > > in filesystems_freeze() makes everything work as expected, so
> > > that's what I'm testing now.
> > 
> > +1
> > 
> > I'll send the final version out in a bit.
> 
> I've now done some extensive testing on loop nested filesystems with
> fio load on the upper. I've tried xfs on ext4 and ext4 on ext4.
> Hibernate/Resume has currently worked on these without a hitch (and the
> fio load burps a bit but then starts running at full speed within a few
> seconds). What I'm doing is a single round of hibernate/resume followed
> by a reboot. I'm relying on the fschecks to detect any filesystem
> corruption. I've also tried doing a couple of fresh starts of the
> hibernated image to check that we did correctly freeze the filesystems.
> 
> The problems I've noticed are:
> 
>    1. I'm using 9p to push host directories throught and that
>       completely hangs after a resume. This is expected because the
>       virtio server is out of sync, but it does indicate a need to
>       address Jeff's question of what we should be doing for network
>       filesystems (and is also the reason I have to reboot after
>       resuming).

No network filesystem supports freeze/thaw so they're not frozen/thawed
during hibernation.

virtiofs doesn't support filesystem freezing either and it warns about
virtio_driver based freezing not being implemented.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH] fs: allow nesting with FREEZE_EXCL
  2025-04-04 10:24                         ` [PATCH] fs: allow nesting with FREEZE_EXCL Christian Brauner
  2025-04-07  9:08                           ` Christoph Hellwig
@ 2025-05-07 11:18                           ` Jan Kara
  2025-05-09 10:38                             ` Christian Brauner
  1 sibling, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-05-07 11:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Fri 04-04-25 12:24:09, Christian Brauner wrote:
> If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> then hibernation need not freeze a filesystem because it's already
> frozen but userspace may thaw the filesystem before hibernation actually
> happens.
> 
> If the race happens the other way around, DM reconfiguration may
> unexpectedly fail with EBUSY.
> 
> So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> cannot be undone by any of the other concurrent freezers.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

This has fallen through the cracks in my inbox but the patch now looks good
to me. Maybe we should fold it into "fs: add owner of freeze/thaw" to not
have strange intermediate state in the series?

								Honza

> ---
>  fs/super.c         | 71 ++++++++++++++++++++++++++++++++++++++++++------------
>  include/linux/fs.h |  2 +-
>  2 files changed, 56 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b4bdbc509dba..e2fee655fbed 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1979,26 +1979,34 @@ static inline int freeze_dec(struct super_block *sb, enum freeze_holder who)
>  	return sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount;
>  }
>  
> -static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
> +static inline bool may_freeze(struct super_block *sb, enum freeze_holder who,
> +			      const void *freeze_owner)
>  {
> +	lockdep_assert_held(&sb->s_umount);
> +
>  	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
>  	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>  
>  	if (who & FREEZE_EXCL) {
>  		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
>  			return false;
> -
> -		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
>  			return false;
> -
> -		return (sb->s_writers.freeze_kcount +
> -			sb->s_writers.freeze_ucount) == 0;
> +		if (WARN_ON_ONCE(!freeze_owner))
> +			return false;
> +		/* This freeze already has a specific owner. */
> +		if (sb->s_writers.freeze_owner)
> +			return false;
> +		/*
> +		 * This is already frozen multiple times so we're just
> +		 * going to take a reference count and mark it as
> +		 * belonging to use.
> +		 */
> +		if (sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount)
> +			sb->s_writers.freeze_owner = freeze_owner;
> +		return true;
>  	}
>  
> -	/* This filesystem is already exclusively frozen. */
> -	if (sb->s_writers.freeze_owner)
> -		return false;
> -
>  	if (who & FREEZE_HOLDER_KERNEL)
>  		return (who & FREEZE_MAY_NEST) ||
>  		       sb->s_writers.freeze_kcount == 0;
> @@ -2011,20 +2019,51 @@ static inline bool may_freeze(struct super_block *sb, enum freeze_holder who)
>  static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
>  				const void *freeze_owner)
>  {
> +	lockdep_assert_held(&sb->s_umount);
> +
>  	WARN_ON_ONCE((who & ~FREEZE_FLAGS));
>  	WARN_ON_ONCE(hweight32(who & FREEZE_HOLDERS) > 1);
>  
>  	if (who & FREEZE_EXCL) {
> -		if (WARN_ON_ONCE(sb->s_writers.freeze_owner == NULL))
> -			return false;
>  		if (WARN_ON_ONCE(!(who & FREEZE_HOLDER_KERNEL)))
>  			return false;
> -		if (who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL))
> +		if (WARN_ON_ONCE(who & ~(FREEZE_EXCL | FREEZE_HOLDER_KERNEL)))
> +			return false;
> +		if (WARN_ON_ONCE(!freeze_owner))
> +			return false;
> +		if (WARN_ON_ONCE(sb->s_writers.freeze_kcount == 0))
>  			return false;
> -		return sb->s_writers.freeze_owner == freeze_owner;
> +		/* This isn't exclusively frozen. */
> +		if (!sb->s_writers.freeze_owner)
> +			return false;
> +		/* This isn't exclusively frozen by us. */
> +		if (sb->s_writers.freeze_owner != freeze_owner)
> +			return false;
> +		/*
> +		 * This is still frozen multiple times so we're just
> +		 * going to drop our reference count and undo our
> +		 * exclusive freeze.
> +		 */
> +		if ((sb->s_writers.freeze_kcount + sb->s_writers.freeze_ucount) > 1)
> +			sb->s_writers.freeze_owner = NULL;
> +		return true;
> +	}
> +
> +	if (who & FREEZE_HOLDER_KERNEL) {
> +		/*
> +		 * Someone's trying to steal the reference belonging to
> +		 * @sb->s_writers.freeze_owner.
> +		 */
> +		if (sb->s_writers.freeze_kcount == 1 &&
> +		    sb->s_writers.freeze_owner)
> +			return false;
> +		return sb->s_writers.freeze_kcount > 0;
>  	}
>  
> -	return sb->s_writers.freeze_owner == NULL;
> +	if (who & FREEZE_HOLDER_USERSPACE)
> +		return sb->s_writers.freeze_ucount > 0;
> +
> +	return false;
>  }
>  
>  /**
> @@ -2095,7 +2134,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
>  
>  retry:
>  	if (sb->s_writers.frozen == SB_FREEZE_COMPLETE) {
> -		if (may_freeze(sb, who))
> +		if (may_freeze(sb, who, freeze_owner))
>  			ret = !!WARN_ON_ONCE(freeze_inc(sb, who) == 1);
>  		else
>  			ret = -EBUSY;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1edcba3cd68e..7a3f821d2723 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2270,7 +2270,7 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>   * @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
>   * @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
>   * @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
> - * @FREEZE_EXCL: whether actual freezing must be done by the caller
> + * @FREEZE_EXCL: a freeze that can only be undone by the owner
>   *
>   * Indicate who the owner of the freeze or thaw request is and whether
>   * the freeze needs to be exclusive or can nest.
> 
> ---
> base-commit: a83fe97e0d53f7d2b0fc62fd9a322a963cb30306
> change-id: 20250404-work-freeze-5eacb515f044
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH] fs: allow nesting with FREEZE_EXCL
  2025-05-07 11:18                           ` Jan Kara
@ 2025-05-09 10:38                             ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-05-09 10:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ard Biesheuvel, linux-efi, linux-kernel,
	James Bottomley, mcgrof, hch, david, rafael, djwong, pavel,
	peterz, mingo, will, boqun.feng

On Wed, May 07, 2025 at 01:18:34PM +0200, Jan Kara wrote:
> On Fri 04-04-25 12:24:09, Christian Brauner wrote:
> > If hibernation races with filesystem freezing (e.g. DM reconfiguration),
> > then hibernation need not freeze a filesystem because it's already
> > frozen but userspace may thaw the filesystem before hibernation actually
> > happens.
> > 
> > If the race happens the other way around, DM reconfiguration may
> > unexpectedly fail with EBUSY.
> > 
> > So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
> > cannot be undone by any of the other concurrent freezers.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> This has fallen through the cracks in my inbox but the patch now looks good
> to me. Maybe we should fold it into "fs: add owner of freeze/thaw" to not
> have strange intermediate state in the series?

Done.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/6] super: remove pointless s_root checks
  2025-03-29  8:42         ` [PATCH v2 1/6] super: remove pointless s_root checks Christian Brauner
  2025-03-31  9:57           ` Jan Kara
@ 2025-06-11 16:26           ` Darrick J. Wong
  2025-06-12 12:20             ` Christian Brauner
  1 sibling, 1 reply; 120+ messages in thread
From: Darrick J. Wong @ 2025-06-11 16:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, pavel, peterz, mingo, will, boqun.feng

On Sat, Mar 29, 2025 at 09:42:14AM +0100, Christian Brauner wrote:
> The locking guarantees that the superblock is alive and sb->s_root is
> still set. Remove the pointless check.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/super.c | 19 ++++++-------------
>  1 file changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 97a17f9d9023..dc14f4bf73a6 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -930,8 +930,7 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
>  
>  		locked = super_lock_shared(sb);
>  		if (locked) {
> -			if (sb->s_root)
> -				f(sb, arg);
> +			f(sb, arg);
>  			super_unlock_shared(sb);
>  		}
>  
> @@ -967,11 +966,8 @@ void iterate_supers_type(struct file_system_type *type,
>  		spin_unlock(&sb_lock);
>  
>  		locked = super_lock_shared(sb);
> -		if (locked) {
> -			if (sb->s_root)
> -				f(sb, arg);
> -			super_unlock_shared(sb);
> -		}
> +		if (locked)
> +			f(sb, arg);

Hey Christian,

I might be trying to be the second(?) user of iterate_supers_type[1]. :)

This change removes the call to super_unlock_shared, which means that
iterate_supers_type returns with the super_lock(s) still held.  I'm
guessing that this is a bug and not an intentional change to require the
callback to call super_unlock_shared, right?

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=health-monitoring&id=3ae9b1d43dcdeaa38e93dc400d1871872ba0e27f

>  
>  		spin_lock(&sb_lock);
>  		if (p)
> @@ -991,18 +987,15 @@ struct super_block *user_get_super(dev_t dev, bool excl)
>  
>  	spin_lock(&sb_lock);
>  	list_for_each_entry(sb, &super_blocks, s_list) {
> -		if (sb->s_dev ==  dev) {
> +		if (sb->s_dev == dev) {
>  			bool locked;
>  
>  			sb->s_count++;
>  			spin_unlock(&sb_lock);
>  			/* still alive? */
>  			locked = super_lock(sb, excl);
> -			if (locked) {
> -				if (sb->s_root)
> -					return sb;
> -				super_unlock(sb, excl);
> -			}
> +			if (locked)
> +				return sb; /* caller will drop */
>  			/* nope, got unmounted */
>  			spin_lock(&sb_lock);
>  			__put_super(sb);
> 
> -- 
> 2.47.2
> 
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 1/6] super: remove pointless s_root checks
  2025-06-11 16:26           ` Darrick J. Wong
@ 2025-06-12 12:20             ` Christian Brauner
  0 siblings, 0 replies; 120+ messages in thread
From: Christian Brauner @ 2025-06-12 12:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, jack, linux-kernel, James Bottomley, mcgrof, hch,
	david, rafael, pavel, peterz, mingo, will, boqun.feng

On Wed, Jun 11, 2025 at 09:26:29AM -0700, Darrick J. Wong wrote:
> On Sat, Mar 29, 2025 at 09:42:14AM +0100, Christian Brauner wrote:
> > The locking guarantees that the superblock is alive and sb->s_root is
> > still set. Remove the pointless check.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/super.c | 19 ++++++-------------
> >  1 file changed, 6 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 97a17f9d9023..dc14f4bf73a6 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -930,8 +930,7 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> >  
> >  		locked = super_lock_shared(sb);
> >  		if (locked) {
> > -			if (sb->s_root)
> > -				f(sb, arg);
> > +			f(sb, arg);
> >  			super_unlock_shared(sb);
> >  		}
> >  
> > @@ -967,11 +966,8 @@ void iterate_supers_type(struct file_system_type *type,
> >  		spin_unlock(&sb_lock);
> >  
> >  		locked = super_lock_shared(sb);
> > -		if (locked) {
> > -			if (sb->s_root)
> > -				f(sb, arg);
> > -			super_unlock_shared(sb);
> > -		}
> > +		if (locked)
> > +			f(sb, arg);
> 
> Hey Christian,
> 
> I might be trying to be the second(?) user of iterate_supers_type[1]. :)
> 
> This change removes the call to super_unlock_shared, which means that
> iterate_supers_type returns with the super_lock(s) still held.  I'm
> guessing that this is a bug and not an intentional change to require the
> callback to call super_unlock_shared, right?
> 
> --D
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=health-monitoring&id=3ae9b1d43dcdeaa38e93dc400d1871872ba0e27f

Yes, that's a bug. Can you send me a fix, please?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
                                       ` (3 preceding siblings ...)
  2025-04-02 14:07                     ` [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw Christian Brauner
@ 2025-07-20 19:23                     ` Askar Safin
  2025-07-21 12:09                       ` Jan Kara
  4 siblings, 1 reply; 120+ messages in thread
From: Askar Safin @ 2025-07-20 19:23 UTC (permalink / raw)
  To: brauner
  Cc: James.Bottomley, ardb, boqun.feng, david, djwong, hch, jack,
	linux-efi, linux-fsdevel, linux-kernel, mcgrof, mingo, pavel,
	peterz, rafael, will

Hi, Christian Brauner, Jan Kara and other contributors of this patchset.

I did experiments on my laptop, and these experiments show that this patchset does not solve various longstanding problems related to suspend and filesystems. (Even if I enable /sys/power/freeze_filesystems )

Now let me describe problems I had in the past (and still have!) and then experiments I did and their results.

So, I had these 3 problems:

- Suspend doesn't work if fstrim in progress (note that I use btrfs as root file system)

- Suspend doesn't work if scrub in progress

- Suspend doesn't work if we try to read from fuse-sshfs filesystem while network is down

Let me describe third problem in more detail. To reproduce you need to do this:

- Mount remote filesystem using sshfs (it is based on ssh and fuse)

- Disable internet

- Run command "ls" in that sshfs filesystem (this command will, of course, hang, because network is down)

- Then suspend

Suspend will not work.

Does your patchset supposed to fix these problems?

Okay, so just now I was able to reproduce all 3 problems on latest mainline ( f4a40a4282f467ec99745c6ba62cb84346e42139 ), which (as well as I understand) has this patchset applied.

I reproduced them with /sys/power/freeze_filesystems set to both 0 and 1 (thus I did 3 * 2 = 6 experiments).

I'm available for further testing.

--
Askar Safin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-07-20 19:23                     ` [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume Askar Safin
@ 2025-07-21 12:09                       ` Jan Kara
  2025-08-04  5:31                         ` Miklos Szeredi
  0 siblings, 1 reply; 120+ messages in thread
From: Jan Kara @ 2025-07-21 12:09 UTC (permalink / raw)
  To: Askar Safin
  Cc: brauner, James.Bottomley, ardb, boqun.feng, david, djwong, hch,
	jack, linux-efi, linux-fsdevel, linux-kernel, mcgrof, mingo,
	pavel, peterz, rafael, will

Hi!

On Sun 20-07-25 22:23:36, Askar Safin wrote:
> I did experiments on my laptop, and these experiments show that this
> patchset does not solve various longstanding problems related to suspend
> and filesystems. (Even if I enable /sys/power/freeze_filesystems )
> 
> Now let me describe problems I had in the past (and still have!) and then
> experiments I did and their results.
> 
> So, I had these 3 problems:
> 
> - Suspend doesn't work if fstrim in progress (note that I use btrfs as
> root file system)

Right, this is expected because the FITRIM ioctl (syscall as any other)
likely takes too long and so the suspend code looses its patience. There's
nothing VFS can do about this. You can talk to btrfs developers to
periodically check for pending signal / freezing event like e.g. ext4 does
in ext4_trim_interrupted() to avoid suspend failures when FITRIM is
running.

> - Suspend doesn't work if scrub in progress

Similar situation as with FITRIM. This is fully in control of the
filesystem and unless the filesystem adds checks and early abort paths, VFS
cannot do anything.

> - Suspend doesn't work if we try to read from fuse-sshfs filesystem while
> network is down

On the surface the problem is the same as the above two but the details
here are subtly different. Here I expect (although I'm not 100% sure) the
blocked process is inside the FUSE filesystem waiting for the FUSE daemon
to reply (a /proc/<pid>/stack of the blocked process would be useful here).
In theory, FUSE filesystem should be able to make the wait for reply in
TASK_FREEZABLE state which would fix your issue. In any case this is very
likely work for FUSE developers.

So I'm sorry but the patch set you speak about isn't supposed to fix any of
the above issues you hit.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-07-21 12:09                       ` Jan Kara
@ 2025-08-04  5:31                         ` Miklos Szeredi
  2025-08-04  6:02                           ` Askar Safin
  0 siblings, 1 reply; 120+ messages in thread
From: Miklos Szeredi @ 2025-08-04  5:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Askar Safin, brauner, James.Bottomley, ardb, boqun.feng, david,
	djwong, hch, linux-efi, linux-fsdevel, linux-kernel, mcgrof,
	mingo, pavel, peterz, rafael, will

On Mon, 21 Jul 2025 at 14:09, Jan Kara <jack@suse.cz> wrote:
>
> Hi!
>
> On Sun 20-07-25 22:23:36, Askar Safin wrote:

> > - Suspend doesn't work if we try to read from fuse-sshfs filesystem while
> > network is down
>
> On the surface the problem is the same as the above two but the details
> here are subtly different. Here I expect (although I'm not 100% sure) the
> blocked process is inside the FUSE filesystem waiting for the FUSE daemon
> to reply (a /proc/<pid>/stack of the blocked process would be useful here).
> In theory, FUSE filesystem should be able to make the wait for reply in
> TASK_FREEZABLE state which would fix your issue. In any case this is very
> likely work for FUSE developers.

This is a known problem with an unknown solution.

We can fix some of the cases, but changing all filesystem locks to be
freezable is likely not workable.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-08-04  5:31                         ` Miklos Szeredi
@ 2025-08-04  6:02                           ` Askar Safin
  2025-08-04  6:51                             ` Sergey Senozhatsky
  0 siblings, 1 reply; 120+ messages in thread
From: Askar Safin @ 2025-08-04  6:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, brauner, James.Bottomley, ardb, boqun.feng, david,
	djwong, hch, linux-efi, linux-fsdevel, linux-kernel, mcgrof,
	mingo, pavel, peterz, rafael, will, Joanne Koong, linux-pm,
	senozhatsky

 ---- On Mon, 04 Aug 2025 09:31:00 +0400  Miklos Szeredi <miklos@szeredi.hu> wrote --- 
 > This is a known problem with an unknown solution.
 > 
 > We can fix some of the cases, but changing all filesystem locks to be
 > freezable is likely not workable.

So what to do in case of networked FUSE, such as sshfs?
We should put workaround to other place?
Where? To libfuse or to each networked FUSE daemon, such as sshfs?
I hit this bug in the past, and I'm very angry.
A lot of other people hit this FUSE bug, too: https://github.com/systemd/systemd/issues/37590 .

What about this timeout solution: https://lore.kernel.org/linux-fsdevel/20250122215528.1270478-1-joannelkoong@gmail.com/ ?
Will it work? As well as I understand, currently kernel waits 20 seconds, when it tries to freeze processes when suspending.
So, what if we use this Joanne Koong's timeout patch, and set timeout to 19 seconds?

So, in short, if we cannot properly fix kernel, then where fix belongs? What should we do?
--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume
  2025-08-04  6:02                           ` Askar Safin
@ 2025-08-04  6:51                             ` Sergey Senozhatsky
  0 siblings, 0 replies; 120+ messages in thread
From: Sergey Senozhatsky @ 2025-08-04  6:51 UTC (permalink / raw)
  To: Askar Safin
  Cc: Miklos Szeredi, Jan Kara, brauner, James.Bottomley, ardb,
	boqun.feng, david, djwong, hch, linux-efi, linux-fsdevel,
	linux-kernel, mcgrof, mingo, pavel, peterz, rafael, will,
	Joanne Koong, linux-pm, senozhatsky

On (25/08/04 10:02), Askar Safin wrote:
> What about this timeout solution: https://lore.kernel.org/linux-fsdevel/20250122215528.1270478-1-joannelkoong@gmail.com/ ?
> Will it work? As well as I understand, currently kernel waits 20 seconds, when it tries to freeze processes when suspending.
> So, what if we use this Joanne Koong's timeout patch, and set timeout to 19 seconds?

I think the problem with this approach is that not all fuse connections
are remote (over the network).  One can use fuse to mount vfat/ntfs or
even archives.  Destroying those connections for suspend is unlikely to
be loved by users.

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2025-08-04  6:51 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-27 14:06 [RFC PATCH 0/4] vfs freeze/thaw on suspend/resume James Bottomley
2025-03-27 14:06 ` [RFC PATCH 1/4] locking/percpu-rwsem: add freezable alternative to down_read James Bottomley
2025-03-31 19:51   ` James Bottomley
2025-03-31 23:32     ` Christian Brauner
2025-04-01  1:13       ` James Bottomley
2025-04-01 11:20         ` Jan Kara
2025-04-01 12:50           ` Christian Brauner
2025-04-01 12:52           ` James Bottomley
2025-04-02 11:47             ` Jan Kara
2025-03-27 14:06 ` [RFC PATCH 2/4] vfs: make sb_start_write freezable James Bottomley
2025-03-27 17:36   ` Jan Kara
2025-03-27 14:06 ` [RFC PATCH 3/4] fs/super.c: introduce reverse superblock iterator and use it in emergency remount James Bottomley
2025-03-28 11:56   ` Christian Brauner
2025-03-28 12:38     ` James Bottomley
2025-03-28 16:15     ` [PATCH 0/6] Extend freeze support to suspend and hibernate Christian Brauner
2025-03-28 16:15       ` [PATCH 1/6] super: remove pointless s_root checks Christian Brauner
2025-03-28 16:15       ` [PATCH 2/6] super: simplify user_get_super() Christian Brauner
2025-03-28 16:15       ` [PATCH 3/6] super: skip dying superblocks early Christian Brauner
2025-03-28 16:15       ` [PATCH 4/6] super: use a common iterator (Part 1) Christian Brauner
2025-03-28 16:15       ` [PATCH 5/6] super: use common iterator (Part 2) Christian Brauner
2025-03-28 18:58         ` James Bottomley
2025-03-29  7:34           ` Christian Brauner
2025-03-28 16:15       ` [PATCH 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
2025-03-29  8:42       ` [PATCH v2 0/6] Extend freeze support to " Christian Brauner
2025-03-29  8:42         ` [PATCH v2 1/6] super: remove pointless s_root checks Christian Brauner
2025-03-31  9:57           ` Jan Kara
2025-06-11 16:26           ` Darrick J. Wong
2025-06-12 12:20             ` Christian Brauner
2025-03-29  8:42         ` [PATCH v2 2/6] super: simplify user_get_super() Christian Brauner
2025-03-31  9:58           ` Jan Kara
2025-03-29  8:42         ` [PATCH v2 3/6] super: skip dying superblocks early Christian Brauner
2025-03-31 10:00           ` Jan Kara
2025-03-29  8:42         ` [PATCH v2 4/6] super: use a common iterator (Part 1) Christian Brauner
2025-03-31 10:01           ` Jan Kara
2025-03-29  8:42         ` [PATCH v2 5/6] super: use common iterator (Part 2) Christian Brauner
2025-03-31 10:07           ` Jan Kara
2025-03-31 10:15             ` Christian Brauner
2025-03-29  8:42         ` [PATCH v2 6/6] super: add filesystem freezing helpers for suspend and hibernate Christian Brauner
2025-03-29  8:46           ` Christian Brauner
2025-03-31 10:23           ` Jan Kara
2025-03-31 10:25             ` Christian Brauner
2025-03-29 14:04         ` [PATCH v2 0/6] Extend freeze support to " James Bottomley
2025-03-29 17:02           ` James Bottomley
2025-03-30  8:33             ` Christian Brauner
2025-03-30 11:53               ` Christian Brauner
2025-03-30 14:00               ` James Bottomley
2025-03-31  9:13                 ` Christian Brauner
2025-03-31 10:36             ` Jan Kara
2025-03-31 14:49               ` James Bottomley
2025-03-31 23:33               ` Christian Brauner
2025-03-31 12:42         ` [PATCH 0/2] efivarfs: support freeze/thaw Christian Brauner
2025-03-31 12:42           ` [PATCH 1/2] libfs: export find_next_child() Christian Brauner
2025-03-31 12:42           ` [PATCH 2/2] efivarfs: support freeze/thaw Christian Brauner
2025-03-31 14:46             ` James Bottomley
2025-03-31 15:03               ` Christian Brauner
2025-04-01 19:31             ` James Bottomley
2025-04-02  7:44               ` Christian Brauner
2025-03-31 14:05           ` [PATCH 0/2] " Ard Biesheuvel
2025-04-01  0:32           ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
2025-04-01  0:32             ` [PATCH 1/6] ext4: replace kthread freezing with auto fs freezing Christian Brauner
2025-04-01  9:16               ` Jan Kara
2025-04-01  9:35                 ` Christian Brauner
2025-04-01 10:08                   ` Jan Kara
2025-04-01  0:32             ` [PATCH 2/6] btrfs: " Christian Brauner
2025-04-01  0:32             ` [PATCH 3/6] xfs: " Christian Brauner
2025-04-01  1:11               ` Dave Chinner
2025-04-01  7:17                 ` Christian Brauner
2025-04-01 11:35                   ` Dave Chinner
2025-04-01 12:45                     ` Christian Brauner
2025-04-01  0:32             ` [PATCH 4/6] fs: add owner of freeze/thaw Christian Brauner
2025-04-01  0:32             ` [PATCH 5/6] fs: allow pagefault based writers to be frozen Christian Brauner
2025-04-01  0:32             ` [PATCH 6/6] power: freeze filesystems during suspend/resume Christian Brauner
2025-04-01  8:16             ` [PATCH 0/6] power: wire-up filesystem freeze/thaw with suspend/resume Christian Brauner
2025-04-01  9:32             ` Jan Kara
2025-04-01 13:03               ` Christian Brauner
2025-04-01 16:57                 ` Jan Kara
2025-04-02 14:07                   ` [PATCH v2 0/4] " Christian Brauner
2025-04-02 14:07                     ` [PATCH v2 1/4] fs: add owner of freeze/thaw Christian Brauner
2025-04-03 14:56                       ` Jan Kara
2025-04-03 19:33                         ` Christian Brauner
2025-04-04 10:24                         ` [PATCH] fs: allow nesting with FREEZE_EXCL Christian Brauner
2025-04-07  9:08                           ` Christoph Hellwig
2025-05-07 11:18                           ` Jan Kara
2025-05-09 10:38                             ` Christian Brauner
2025-04-02 14:07                     ` [PATCH v2 2/4] fs: allow all writers to be frozen Christian Brauner
2025-04-02 15:32                       ` Christian Brauner
2025-04-02 16:03                         ` James Bottomley
2025-04-02 16:13                           ` Christian Brauner
2025-04-03 14:59                       ` Jan Kara
2025-04-02 14:07                     ` [PATCH v2 3/4] power: freeze filesystems during suspend/resume Christian Brauner
2025-04-03 16:29                       ` Jan Kara
2025-04-02 14:07                     ` [PATCH v2 4/4] kernfs: add warning about implementing freeze/thaw Christian Brauner
2025-04-03 15:00                       ` Jan Kara
2025-07-20 19:23                     ` [PATCH v2 0/4] power: wire-up filesystem freeze/thaw with suspend/resume Askar Safin
2025-07-21 12:09                       ` Jan Kara
2025-08-04  5:31                         ` Miklos Szeredi
2025-08-04  6:02                           ` Askar Safin
2025-08-04  6:51                             ` Sergey Senozhatsky
2025-04-01 14:14             ` [PATCH 0/6] " Peter Zijlstra
2025-04-01 14:40               ` Christian Brauner
2025-04-01 14:59                 ` Peter Zijlstra
2025-04-01 17:02             ` James Bottomley
2025-04-02  7:46               ` Christian Brauner
2025-04-08 15:43                 ` James Bottomley
2025-04-08 17:09                   ` Luis Chamberlain
2025-04-08 17:20                     ` Luis Chamberlain
2025-04-08 17:26                       ` James Bottomley
2025-04-08 17:24                     ` James Bottomley
2025-04-09  7:17                   ` Christian Brauner
2025-03-27 14:06 ` [RFC PATCH 4/4] vfs: add filesystem freeze/thaw callbacks for power management James Bottomley
2025-03-27 18:20   ` Jan Kara
2025-03-28 14:21     ` James Bottomley
2025-03-28 14:36       ` James Bottomley
2025-03-28 10:08   ` Christian Brauner
2025-03-28 14:14     ` James Bottomley
2025-03-28 15:52       ` Christian Brauner
2025-03-28 16:15         ` James Bottomley
2025-03-29  8:23           ` Christian Brauner
2025-03-28 12:01   ` Christian Brauner
2025-03-28 14:40     ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).