Re: [LSF/MM/BPF TOPIC] Filesystem Suspend Resume

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
       [not found]   ` <acae7a99f8acb0ebf408bb6fc82ab53fb687559c.camel@HansenPartnership.com>
@ 2025-03-21  5:23     ` Christoph Hellwig
  2025-03-21 12:34       ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-03-21  5:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-fsdevel, lsf-pc, Rafael J. Wysocki, Pavel Machek, Len Brown,
	linux-pm

On Thu, Mar 20, 2025 at 02:15:15PM -0400, James Bottomley wrote:
> On Thu, 2025-03-20 at 09:48 -0700, Christoph Hellwig wrote:
> [...]
> > We finally got hibernate to freeze file system on suspend,
> 
> I was looking for this to see if I could possibly plug something in for
> pseudo filesystems that don't have backing devices.  However, I can't
> find the path where suspend causes freeze (at least the bdev doesn't
> seem to register any power notifier like the scsi block device does),
> where is the code?

Looking again I can't find it either.  On the internet I find a patch
adding it from 2006:

https://groups.google.com/g/fa.linux.kernel/c/dtxsNJ7ks58/m/mqU8SIAbvLgJ

But I couldn't see if it got applied or disappaeared again somehow.
Adding the relevant maintainers.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-21  5:23     ` [LSF/MM/BPF TOPIC] Filesystem Suspend Resume Christoph Hellwig
@ 2025-03-21 12:34       ` James Bottomley
  2025-03-21 17:00         ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2025-03-21 12:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, lsf-pc, Rafael J. Wysocki, Pavel Machek, Len Brown,
	linux-pm

On Thu, 2025-03-20 at 22:23 -0700, Christoph Hellwig wrote:
> On Thu, Mar 20, 2025 at 02:15:15PM -0400, James Bottomley wrote:
> > On Thu, 2025-03-20 at 09:48 -0700, Christoph Hellwig wrote:
> > [...]
> > > We finally got hibernate to freeze file system on suspend,
> > 
> > I was looking for this to see if I could possibly plug something in
> > for pseudo filesystems that don't have backing devices.  However, I
> > can't find the path where suspend causes freeze (at least the bdev
> > doesn't seem to register any power notifier like the scsi block
> > device does), where is the code?
> 
> Looking again I can't find it either.  On the internet I find a patch
> adding it from 2006:
>  
> https://groups.google.com/g/fa.linux.kernel/c/dtxsNJ7ks58/m/mqU8SIAbvLgJ

Wow google has a terrible interface.  This is the lore link:

https://lore.kernel.org/all/200611011200.18438.rjw@sisk.pl/

So the patch indicates where to put direct hooks in the power
management but it operates via bdev_freeze/thaw() which wouldn't work
for pseudo filesystems, but could be replaced by a direct hook into the
vfs that would iterate over superblocks calling
freeze_super/thaw_super().

> But I couldn't see if it got applied or disappaeared again somehow.
> Adding the relevant maintainers.

It looks like it got reposted about 5 years later as well (in the
middle of a thread about xfs hibernate lockups):

https://lore.kernel.org/all/201108032315.06012.rjw__14254.1066081778$1312406161$gmane$org@sisk.pl/

Then again 6 months later:

https://lore.kernel.org/all/201201281445.49377.rjw@sisk.pl/

everything kept foundering on deadlock problems between filesystems
needing threads to shrink and complete writeout and the freezing of
those threads.

Let me digest all that and see if we have more hope this time around.

Regards,

James


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-21 12:34       ` James Bottomley
@ 2025-03-21 17:00         ` James Bottomley
  2025-03-21 17:17           ` Lukas Wunner
  2025-03-24 11:38           ` [Lsf-pc] " Jan Kara
  0 siblings, 2 replies; 19+ messages in thread
From: James Bottomley @ 2025-03-21 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, lsf-pc, Rafael J. Wysocki, Pavel Machek, Len Brown,
	linux-pm

On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
[...]
> Let me digest all that and see if we have more hope this time around.

OK, I think I've gone over it all.  The biggest problem with
resurrecting the patch was bugs in ext3, which isn't a problem now. 
Most of the suspend system has been rearchitected to separate
suspending user space processes from kernel ones.  The sync it
currently does occurs before even user processes are frozen.  I think
(as most of the original proposals did) that we just do freeze all
supers (using the reverse list) after user processes are frozen but
just before kernel threads are (this shouldn't perturb the image
allocation in hibernate, which was another source of bugs in xfs).

There's a final wrinkle in that if I plumb efivarfs into all this, it
needs to know whether it was a hibernate or suspend, but I can add that
as an extra freeze_holder flag.

This looked like such a tiny can of worms when I opened it; now it
seems to be a lot bigger on the inside than it was on the outside,
sigh.

Regards,

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-21 17:00         ` James Bottomley
@ 2025-03-21 17:17           ` Lukas Wunner
  2025-03-21 18:20             ` James Bottomley
  2025-03-24 11:38           ` [Lsf-pc] " Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Lukas Wunner @ 2025-03-21 17:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On Fri, Mar 21, 2025 at 01:00:24PM -0400, James Bottomley wrote:
> There's a final wrinkle in that if I plumb efivarfs into all this, it
> needs to know whether it was a hibernate or suspend, but I can add that
> as an extra freeze_holder flag.

Perhaps system_entering_hibernation() does what you need?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-21 17:17           ` Lukas Wunner
@ 2025-03-21 18:20             ` James Bottomley
  0 siblings, 0 replies; 19+ messages in thread
From: James Bottomley @ 2025-03-21 18:20 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On Fri, 2025-03-21 at 18:17 +0100, Lukas Wunner wrote:
> On Fri, Mar 21, 2025 at 01:00:24PM -0400, James Bottomley wrote:
> > There's a final wrinkle in that if I plumb efivarfs into all this,
> > it needs to know whether it was a hibernate or suspend, but I can
> > add that as an extra freeze_holder flag.
> 
> Perhaps system_entering_hibernation() does what you need?

efivarfs needs to know on the resume path, unfortunately, which that
call doesn't seem to work for.  Also filesystems would have to suspend
before devices ... i.e. before this is set even in the suspend path,
but I suppose it would be possible to design a flag that has the width
of scope required (which would be about the same amount of work as
simply adding the extra flags to communicate what the freeze or thaw
are for).

Regards,

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-21 17:00         ` James Bottomley
  2025-03-21 17:17           ` Lukas Wunner
@ 2025-03-24 11:38           ` Jan Kara
  2025-03-24 14:34             ` James Bottomley
  2025-03-24 20:50             ` Dave Chinner
  1 sibling, 2 replies; 19+ messages in thread
From: Jan Kara @ 2025-03-24 11:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> [...]
> > Let me digest all that and see if we have more hope this time around.
> 
> OK, I think I've gone over it all.  The biggest problem with
> resurrecting the patch was bugs in ext3, which isn't a problem now. 
> Most of the suspend system has been rearchitected to separate
> suspending user space processes from kernel ones.  The sync it
> currently does occurs before even user processes are frozen.  I think
> (as most of the original proposals did) that we just do freeze all
> supers (using the reverse list) after user processes are frozen but
> just before kernel threads are (this shouldn't perturb the image
> allocation in hibernate, which was another source of bugs in xfs).

So as far as my memory serves the fundamental problem with this approach
was FUSE - once userspace is frozen, you cannot write to FUSE filesystems
so filesystem freezing of FUSE would block if userspace is already
suspended. You may even have a setup like:

bdev <- fs <- FUSE filesystem <- loopback file <- loop device <- another fs

So you really have to be careful to freeze this stack without causing
deadlocks. So you need to be freezing userspace after filesystems are
frozen but then you have to deal with the fact that parts of your userspace
will be blocked in the kernel (trying to do some write) waiting for the
filesystem to thaw. But it might be tractable these days since I have a
vague recollection that system suspend is now able to gracefully handle
even tasks in uninterruptible sleep.

> There's a final wrinkle in that if I plumb efivarfs into all this, it
> needs to know whether it was a hibernate or suspend, but I can add that
> as an extra freeze_holder flag.
> 
> This looked like such a tiny can of worms when I opened it; now it
> seems to be a lot bigger on the inside than it was on the outside,
> sigh.

Never underestimate the amount of worms in a can ;)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 11:38           ` [Lsf-pc] " Jan Kara
@ 2025-03-24 14:34             ` James Bottomley
  2025-03-24 19:28               ` Jan Kara
  2025-03-24 20:56               ` Dave Chinner
  2025-03-24 20:50             ` Dave Chinner
  1 sibling, 2 replies; 19+ messages in thread
From: James Bottomley @ 2025-03-24 14:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On Mon, 2025-03-24 at 12:38 +0100, Jan Kara wrote:
> On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > [...]
> > > Let me digest all that and see if we have more hope this time
> > > around.
> > 
> > OK, I think I've gone over it all.  The biggest problem with
> > resurrecting the patch was bugs in ext3, which isn't a problem now.
> > Most of the suspend system has been rearchitected to separate
> > suspending user space processes from kernel ones.  The sync it
> > currently does occurs before even user processes are frozen.  I
> > think
> > (as most of the original proposals did) that we just do freeze all
> > supers (using the reverse list) after user processes are frozen but
> > just before kernel threads are (this shouldn't perturb the image
> > allocation in hibernate, which was another source of bugs in xfs).
> 
> So as far as my memory serves the fundamental problem with this
> approach was FUSE - once userspace is frozen, you cannot write to
> FUSE filesystems so filesystem freezing of FUSE would block if
> userspace is already suspended. You may even have a setup like:
> 
> bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> another fs
> 
> So you really have to be careful to freeze this stack without causing
> deadlocks.

Ah, so that explains why the sys_sync() sits in suspend resume *before*
freezing userspace ... that always appeared odd to me.

>  So you need to be freezing userspace after filesystems are
> frozen but then you have to deal with the fact that parts of your
> userspace will be blocked in the kernel (trying to do some write)
> waiting for the filesystem to thaw. But it might be tractable these
> days since I have a vague recollection that system suspend is now
> able to gracefully handle even tasks in uninterruptible sleep.

There is another thing I thought about: we don't actually have to
freeze across the sleep.  It might be possible simply to invoke
freeze/thaw where sys_sync() is now done to get a better on stable
storage image?  That should have fewer deadlock issues.

> > There's a final wrinkle in that if I plumb efivarfs into all this,
> > it needs to know whether it was a hibernate or suspend, but I can
> > add that as an extra freeze_holder flag.
> > 
> > This looked like such a tiny can of worms when I opened it; now it
> > seems to be a lot bigger on the inside than it was on the outside,
> > sigh.
> 
> Never underestimate the amount of worms in a can ;)

Tell me about it ...

Regards,

James


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 14:34             ` James Bottomley
@ 2025-03-24 19:28               ` Jan Kara
  2025-03-27 14:55                 ` Eric Sandeen
  2025-03-24 20:56               ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2025-03-24 19:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Mon 24-03-25 10:34:56, James Bottomley wrote:
> On Mon, 2025-03-24 at 12:38 +0100, Jan Kara wrote:
> > On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > > [...]
> > > > Let me digest all that and see if we have more hope this time
> > > > around.
> > > 
> > > OK, I think I've gone over it all.  The biggest problem with
> > > resurrecting the patch was bugs in ext3, which isn't a problem now.
> > > Most of the suspend system has been rearchitected to separate
> > > suspending user space processes from kernel ones.  The sync it
> > > currently does occurs before even user processes are frozen.  I
> > > think
> > > (as most of the original proposals did) that we just do freeze all
> > > supers (using the reverse list) after user processes are frozen but
> > > just before kernel threads are (this shouldn't perturb the image
> > > allocation in hibernate, which was another source of bugs in xfs).
> > 
> > So as far as my memory serves the fundamental problem with this
> > approach was FUSE - once userspace is frozen, you cannot write to
> > FUSE filesystems so filesystem freezing of FUSE would block if
> > userspace is already suspended. You may even have a setup like:
> > 
> > bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> > another fs
> > 
> > So you really have to be careful to freeze this stack without causing
> > deadlocks.
> 
> Ah, so that explains why the sys_sync() sits in suspend resume *before*
> freezing userspace ... that always appeared odd to me.
> 
> >  So you need to be freezing userspace after filesystems are
> > frozen but then you have to deal with the fact that parts of your
> > userspace will be blocked in the kernel (trying to do some write)
> > waiting for the filesystem to thaw. But it might be tractable these
> > days since I have a vague recollection that system suspend is now
> > able to gracefully handle even tasks in uninterruptible sleep.
> 
> There is another thing I thought about: we don't actually have to
> freeze across the sleep.  It might be possible simply to invoke
> freeze/thaw where sys_sync() is now done to get a better on stable
> storage image?  That should have fewer deadlock issues.

Well, there's not going to be a huge difference between doing sync(2) and
doing freeze+thaw for each filesystem. After you thaw the filesystem
drivers usually mark that the fs is in inconsistent state and that triggers
journal replay / fsck on next mount.
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 11:38           ` [Lsf-pc] " Jan Kara
  2025-03-24 14:34             ` James Bottomley
@ 2025-03-24 20:50             ` Dave Chinner
  2025-03-24 21:02               ` James Bottomley
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2025-03-24 20:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: James Bottomley, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Mon, Mar 24, 2025 at 12:38:20PM +0100, Jan Kara wrote:
> On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > [...]
> > > Let me digest all that and see if we have more hope this time around.
> > 
> > OK, I think I've gone over it all.  The biggest problem with
> > resurrecting the patch was bugs in ext3, which isn't a problem now. 
> > Most of the suspend system has been rearchitected to separate
> > suspending user space processes from kernel ones.  The sync it
> > currently does occurs before even user processes are frozen.  I think
> > (as most of the original proposals did) that we just do freeze all
> > supers (using the reverse list) after user processes are frozen but
> > just before kernel threads are (this shouldn't perturb the image
> > allocation in hibernate, which was another source of bugs in xfs).
> 
> So as far as my memory serves the fundamental problem with this approach
> was FUSE - once userspace is frozen, you cannot write to FUSE filesystems
> so filesystem freezing of FUSE would block if userspace is already
> suspended. You may even have a setup like:
> 
> bdev <- fs <- FUSE filesystem <- loopback file <- loop device <- another fs
> 
> So you really have to be careful to freeze this stack without causing
> deadlocks. So you need to be freezing userspace after filesystems are
> frozen but then you have to deal with the fact that parts of your userspace
> will be blocked in the kernel (trying to do some write) waiting for the
> filesystem to thaw. But it might be tractable these days since I have a
> vague recollection that system suspend is now able to gracefully handle
> even tasks in uninterruptible sleep.

I thought we largely solved this problem with userspace flusher
threads being able to call prctl(PR_IO_FLUSHER) to tell the kernel
they are part of the IO stack and so need to be considered
special from the POV of memory allocation and write (dirty page)
throttling.

Maybe hibernate needs to be aware of these userspace flusher
tasks and only suspend them after filesystems are frozen instead
of when userspace is initially halted?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 14:34             ` James Bottomley
  2025-03-24 19:28               ` Jan Kara
@ 2025-03-24 20:56               ` Dave Chinner
  1 sibling, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2025-03-24 20:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Mon, Mar 24, 2025 at 10:34:56AM -0400, James Bottomley wrote:
> On Mon, 2025-03-24 at 12:38 +0100, Jan Kara wrote:
> > On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > > [...]
> > > > Let me digest all that and see if we have more hope this time
> > > > around.
> > > 
> > > OK, I think I've gone over it all.  The biggest problem with
> > > resurrecting the patch was bugs in ext3, which isn't a problem now.
> > > Most of the suspend system has been rearchitected to separate
> > > suspending user space processes from kernel ones.  The sync it
> > > currently does occurs before even user processes are frozen.  I
> > > think
> > > (as most of the original proposals did) that we just do freeze all
> > > supers (using the reverse list) after user processes are frozen but
> > > just before kernel threads are (this shouldn't perturb the image
> > > allocation in hibernate, which was another source of bugs in xfs).
> > 
> > So as far as my memory serves the fundamental problem with this
> > approach was FUSE - once userspace is frozen, you cannot write to
> > FUSE filesystems so filesystem freezing of FUSE would block if
> > userspace is already suspended. You may even have a setup like:
> > 
> > bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> > another fs
> > 
> > So you really have to be careful to freeze this stack without causing
> > deadlocks.
> 
> Ah, so that explains why the sys_sync() sits in suspend resume *before*
> freezing userspace ... that always appeared odd to me.
> 
> >  So you need to be freezing userspace after filesystems are
> > frozen but then you have to deal with the fact that parts of your
> > userspace will be blocked in the kernel (trying to do some write)
> > waiting for the filesystem to thaw. But it might be tractable these
> > days since I have a vague recollection that system suspend is now
> > able to gracefully handle even tasks in uninterruptible sleep.
> 
> There is another thing I thought about: we don't actually have to
> freeze across the sleep.

Yes we do.

Filesystems have background workers that do stuff even when the
filesystem has been synced, and this can race with hibernate
shutting stuff down. This is the whole reason we needed to move to
filesystem freezing - to tell the filesystems to *temporarily stop
dirtying* new objects.

> It might be possible simply to invoke
> freeze/thaw where sys_sync() is now done to get a better on stable
> storage image?  That should have fewer deadlock issues.

A freeze/thaw cycle still allows the filesystems to dirty objects in
the background whilst hibernate continues onwards assuming
filesystem are all clean. It took a long time to get all those worms
in the can, and we really don't want to let them back out....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 20:50             ` Dave Chinner
@ 2025-03-24 21:02               ` James Bottomley
  2025-03-24 21:07                 ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2025-03-24 21:02 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On Tue, 2025-03-25 at 07:50 +1100, Dave Chinner wrote:
> On Mon, Mar 24, 2025 at 12:38:20PM +0100, Jan Kara wrote:
> > On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > > [...]
> > > > Let me digest all that and see if we have more hope this time
> > > > around.
> > > 
> > > OK, I think I've gone over it all.  The biggest problem with
> > > resurrecting the patch was bugs in ext3, which isn't a problem
> > > now.  Most of the suspend system has been rearchitected to
> > > separate suspending user space processes from kernel ones.  The
> > > sync it currently does occurs before even user processes are
> > > frozen.  I think (as most of the original proposals did) that we
> > > just do freeze all supers (using the reverse list) after user
> > > processes are frozen but just before kernel threads are (this
> > > shouldn't perturb the image allocation in hibernate, which was
> > > another source of bugs in xfs).
> > 
> > So as far as my memory serves the fundamental problem with this
> > approach was FUSE - once userspace is frozen, you cannot write to
> > FUSE filesystems so filesystem freezing of FUSE would block if
> > userspace is already suspended. You may even have a setup like:
> > 
> > bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> > another fs
> > 
> > So you really have to be careful to freeze this stack without
> > causing deadlocks. So you need to be freezing userspace after
> > filesystems are frozen but then you have to deal with the fact that
> > parts of your userspace will be blocked in the kernel (trying to do
> > some write) waiting for the filesystem to thaw. But it might be
> > tractable these days since I have a vague recollection that system
> > suspend is now able to gracefully handle even tasks in
> > uninterruptible sleep.
> 
> I thought we largely solved this problem with userspace flusher
> threads being able to call prctl(PR_IO_FLUSHER) to tell the kernel
> they are part of the IO stack and so need to be considered
> special from the POV of memory allocation and write (dirty page)
> throttling.
> 
> Maybe hibernate needs to be aware of these userspace flusher
> tasks and only suspend them after filesystems are frozen instead
> of when userspace is initially halted?

I can confirm it's not.  Its check for kernel thread is in
kernel/power/process.c:try_to_freeze_tasks().  It really only uses the
PF_KTHREAD flag in differentiating between user and kernel threads.

But what I heard in the session was that we should freeze filesystems
before any tasks because that means tasks touching the frozen fs freeze
themselves.

Regards,

James


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 21:02               ` James Bottomley
@ 2025-03-24 21:07                 ` Dave Chinner
  2025-03-25 13:42                   ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2025-03-24 21:07 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Mon, Mar 24, 2025 at 05:02:54PM -0400, James Bottomley wrote:
> On Tue, 2025-03-25 at 07:50 +1100, Dave Chinner wrote:
> > On Mon, Mar 24, 2025 at 12:38:20PM +0100, Jan Kara wrote:
> > > On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > > > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > > > [...]
> > > > > Let me digest all that and see if we have more hope this time
> > > > > around.
> > > > 
> > > > OK, I think I've gone over it all.  The biggest problem with
> > > > resurrecting the patch was bugs in ext3, which isn't a problem
> > > > now.  Most of the suspend system has been rearchitected to
> > > > separate suspending user space processes from kernel ones.  The
> > > > sync it currently does occurs before even user processes are
> > > > frozen.  I think (as most of the original proposals did) that we
> > > > just do freeze all supers (using the reverse list) after user
> > > > processes are frozen but just before kernel threads are (this
> > > > shouldn't perturb the image allocation in hibernate, which was
> > > > another source of bugs in xfs).
> > > 
> > > So as far as my memory serves the fundamental problem with this
> > > approach was FUSE - once userspace is frozen, you cannot write to
> > > FUSE filesystems so filesystem freezing of FUSE would block if
> > > userspace is already suspended. You may even have a setup like:
> > > 
> > > bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> > > another fs
> > > 
> > > So you really have to be careful to freeze this stack without
> > > causing deadlocks. So you need to be freezing userspace after
> > > filesystems are frozen but then you have to deal with the fact that
> > > parts of your userspace will be blocked in the kernel (trying to do
> > > some write) waiting for the filesystem to thaw. But it might be
> > > tractable these days since I have a vague recollection that system
> > > suspend is now able to gracefully handle even tasks in
> > > uninterruptible sleep.
> > 
> > I thought we largely solved this problem with userspace flusher
> > threads being able to call prctl(PR_IO_FLUSHER) to tell the kernel
> > they are part of the IO stack and so need to be considered
> > special from the POV of memory allocation and write (dirty page)
> > throttling.
> > 
> > Maybe hibernate needs to be aware of these userspace flusher
> > tasks and only suspend them after filesystems are frozen instead
> > of when userspace is initially halted?
> 
> I can confirm it's not.  Its check for kernel thread is in
> kernel/power/process.c:try_to_freeze_tasks().  It really only uses the
> PF_KTHREAD flag in differentiating between user and kernel threads.
> 
> But what I heard in the session was that we should freeze filesystems
> before any tasks because that means tasks touching the frozen fs freeze
> themselves.

But that's exactly the behaviour that leads to FUSE based deadlocks,
is it not? i.e. freeze the backing fs, then try to freeze the FUSE
filesystem and the freeze blocks forever trying to write to the
frozen backing fs....

What am I missing here?

-Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 21:07                 ` Dave Chinner
@ 2025-03-25 13:42                   ` Jan Kara
  2025-03-26  2:36                     ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2025-03-25 13:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: James Bottomley, Jan Kara, Christoph Hellwig, linux-fsdevel,
	lsf-pc, Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Tue 25-03-25 08:07:52, Dave Chinner wrote:
> On Mon, Mar 24, 2025 at 05:02:54PM -0400, James Bottomley wrote:
> > On Tue, 2025-03-25 at 07:50 +1100, Dave Chinner wrote:
> > > On Mon, Mar 24, 2025 at 12:38:20PM +0100, Jan Kara wrote:
> > > > On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> > > > > On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> > > > > [...]
> > > > > > Let me digest all that and see if we have more hope this time
> > > > > > around.
> > > > > 
> > > > > OK, I think I've gone over it all.  The biggest problem with
> > > > > resurrecting the patch was bugs in ext3, which isn't a problem
> > > > > now.  Most of the suspend system has been rearchitected to
> > > > > separate suspending user space processes from kernel ones.  The
> > > > > sync it currently does occurs before even user processes are
> > > > > frozen.  I think (as most of the original proposals did) that we
> > > > > just do freeze all supers (using the reverse list) after user
> > > > > processes are frozen but just before kernel threads are (this
> > > > > shouldn't perturb the image allocation in hibernate, which was
> > > > > another source of bugs in xfs).
> > > > 
> > > > So as far as my memory serves the fundamental problem with this
> > > > approach was FUSE - once userspace is frozen, you cannot write to
> > > > FUSE filesystems so filesystem freezing of FUSE would block if
> > > > userspace is already suspended. You may even have a setup like:
> > > > 
> > > > bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> > > > another fs
> > > > 
> > > > So you really have to be careful to freeze this stack without
> > > > causing deadlocks. So you need to be freezing userspace after
> > > > filesystems are frozen but then you have to deal with the fact that
> > > > parts of your userspace will be blocked in the kernel (trying to do
> > > > some write) waiting for the filesystem to thaw. But it might be
> > > > tractable these days since I have a vague recollection that system
> > > > suspend is now able to gracefully handle even tasks in
> > > > uninterruptible sleep.
> > > 
> > > I thought we largely solved this problem with userspace flusher
> > > threads being able to call prctl(PR_IO_FLUSHER) to tell the kernel
> > > they are part of the IO stack and so need to be considered
> > > special from the POV of memory allocation and write (dirty page)
> > > throttling.
> > > 
> > > Maybe hibernate needs to be aware of these userspace flusher
> > > tasks and only suspend them after filesystems are frozen instead
> > > of when userspace is initially halted?
> > 
> > I can confirm it's not.  Its check for kernel thread is in
> > kernel/power/process.c:try_to_freeze_tasks().  It really only uses the
> > PF_KTHREAD flag in differentiating between user and kernel threads.
> > 
> > But what I heard in the session was that we should freeze filesystems
> > before any tasks because that means tasks touching the frozen fs freeze
> > themselves.
> 
> But that's exactly the behaviour that leads to FUSE based deadlocks,
> is it not? i.e. freeze the backing fs, then try to freeze the FUSE
> filesystem and the freeze blocks forever trying to write to the
> frozen backing fs....
> 
> What am I missing here?

I don't think that creates FUSE based deadlocks. Whan you describe is
generally a problem with the order of how filesystems are frozen and can
happen with loop devices as well. If you leave userspace running and freeze
filesystems in proper order (happens to be reverse ordering of superblock
list), then you should freeze all filesystems without deadlocking.

If I remember correctly, the problem in the past was, that if you leave
userspace running while freezing filesystems, some processes may enter
uninterruptible sleep waiting for fs to be thawed and in the past suspend
code was not able to hibernate such processes. But I think this obstacle
has been removed couple of years ago as now we could use TASK_FREEZABLE
flag in sb_start_write() -> percpu_rwsem_wait and thus allow tasks blocked
on frozen filesystem to be hibernated.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-25 13:42                   ` Jan Kara
@ 2025-03-26  2:36                     ` James Bottomley
  2025-03-26 14:59                       ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2025-03-26  2:36 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On Tue, 2025-03-25 at 14:42 +0100, Jan Kara wrote:
[...]
> If I remember correctly, the problem in the past was, that if you
> leave userspace running while freezing filesystems, some processes
> may enter uninterruptible sleep waiting for fs to be thawed and in
> the past suspend code was not able to hibernate such processes. But I
> think this obstacle has been removed couple of years ago as now we
> could use TASK_FREEZABLE flag in sb_start_write() ->
> percpu_rwsem_wait and thus allow tasks blocked on frozen filesystem
> to be hibernated.

I tested this and we do indeed deadlock hibernation on the processes
touching the filesystem (systemd-journald actually).   But if I make
this change:

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 6083883c4fe0..720418720bbc 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -156,7 +156,7 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 	spin_unlock_irq(&sem->waiters.lock);
 
 	while (wait) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 		if (!smp_load_acquire(&wq_entry.private))
 			break;
 		schedule();

Then everything will work, with no lockdep problems (thanks,
Christian).  Is that the change you want me to make or should
sb_start_write be using a special freezable version of
percpu_rwsem_wait()?

Regards,

James


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-26  2:36                     ` James Bottomley
@ 2025-03-26 14:59                       ` Jan Kara
  2025-03-26 15:25                         ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2025-03-26 14:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Tue 25-03-25 22:36:56, James Bottomley wrote:
> On Tue, 2025-03-25 at 14:42 +0100, Jan Kara wrote:
> [...]
> > If I remember correctly, the problem in the past was, that if you
> > leave userspace running while freezing filesystems, some processes
> > may enter uninterruptible sleep waiting for fs to be thawed and in
> > the past suspend code was not able to hibernate such processes. But I
> > think this obstacle has been removed couple of years ago as now we
> > could use TASK_FREEZABLE flag in sb_start_write() ->
> > percpu_rwsem_wait and thus allow tasks blocked on frozen filesystem
> > to be hibernated.
> 
> I tested this and we do indeed deadlock hibernation on the processes
> touching the filesystem (systemd-journald actually).   But if I make
> this change:
> 
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index 6083883c4fe0..720418720bbc 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -156,7 +156,7 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
>  	spin_unlock_irq(&sem->waiters.lock);
>  
>  	while (wait) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
>  		if (!smp_load_acquire(&wq_entry.private))
>  			break;
>  		schedule();
> 
> Then everything will work, with no lockdep problems (thanks,
> Christian).  Is that the change you want me to make or should
> sb_start_write be using a special freezable version of
> percpu_rwsem_wait()?

I was thinking about this. The possible problem with this may be that a
task waiting in percpu_rwsem_wait() is hibernated and if it holds another
lock (e.g. some mutex) and there's another task waiting for this mutex,
then hibernation fails because that other task cannot be hibernated. With
sb_start_write() specifically, this is usually not a problem because this
is the outermoust lock we take. The only catch here would be if a process
is blocked in a write page fault for a frozen filesystem. Then we are
holding mmap_sem for the process so hibernation could fail this way. But
I'd guess this is rare enough that we could live with that possibility.

So to summarize I think we may need to introduce freezable variant of
percpu_rwsem_down_read() and use it in sb_start_write().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-26 14:59                       ` Jan Kara
@ 2025-03-26 15:25                         ` James Bottomley
  2025-03-27 14:28                           ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2025-03-26 15:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Wed, 2025-03-26 at 15:59 +0100, Jan Kara wrote:
[...]
> So to summarize I think we may need to introduce freezable variant of
> percpu_rwsem_down_read() and use it in sb_start_write().

Aye, aye, sir! and thanks for making the can of worms bigger ...

This is what I came up with for freezable variants of the
sb_write_start().  I'm still building the kernel (laptop only ...) so
I'll let you know in an hour or so if it actually works.

Regards,

James

---

diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd84d1c3b8af..ce21d81c6e34 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level)
 
 static inline void __sb_start_write(struct super_block *sb, int level)
 {
-	percpu_down_read(sb->s_writers.rw_sem + level - 1);
+	percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1,
+				   level == SB_FREEZE_WRITE);
 }
 
 static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index c012df33a9f0..a55fe709b832 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -42,9 +42,10 @@ is_static struct percpu_rw_semaphore name = {				\
 #define DEFINE_STATIC_PERCPU_RWSEM(name)	\
 	__DEFINE_PERCPU_RWSEM(name, static)
 
-extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool);
+extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool, bool);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+static inline void percpu_down_read_internal(struct percpu_rw_semaphore *sem,
+					     bool freezable)
 {
 	might_sleep();
 
@@ -62,7 +63,7 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
 	if (likely(rcu_sync_is_idle(&sem->rss)))
 		this_cpu_inc(*sem->read_count);
 	else
-		__percpu_down_read(sem, false); /* Unconditional memory barrier */
+		__percpu_down_read(sem, false, freezable); /* Unconditional memory barrier */
 	/*
 	 * The preempt_enable() prevents the compiler from
 	 * bleeding the critical section out.
@@ -70,6 +71,17 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
 	preempt_enable();
 }
 
+static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+	percpu_down_read_internal(sem, false);
+}
+
+static inline void percpu_down_read_freezable(struct percpu_rw_semaphore *sem,
+					      bool freeze)
+{
+	percpu_down_read_internal(sem, freeze);
+}
+
 static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 {
 	bool ret = true;
@@ -81,7 +93,7 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
 	if (likely(rcu_sync_is_idle(&sem->rss)))
 		this_cpu_inc(*sem->read_count);
 	else
-		ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */
+		ret = __percpu_down_read(sem, true, false); /* Unconditional memory barrier */
 	preempt_enable();
 	/*
 	 * The barrier() from preempt_enable() prevents the compiler from
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 6083883c4fe0..890837b73476 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -138,7 +138,8 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
 	return !reader; /* wake (readers until) 1 writer */
 }
 
-static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
+static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader,
+			      bool freeze)
 {
 	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
 	bool wait;
@@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 	spin_unlock_irq(&sem->waiters.lock);
 
 	while (wait) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE |
+				  freeze ? TASK_FREEZABLE : 0);
 		if (!smp_load_acquire(&wq_entry.private))
 			break;
 		schedule();
@@ -164,7 +166,8 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 	__set_current_state(TASK_RUNNING);
 }
 
-bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
+bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try,
+				bool freeze)
 {
 	if (__percpu_down_read_trylock(sem))
 		return true;
@@ -174,7 +177,7 @@ bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 
 	trace_contention_begin(sem, LCB_F_PERCPU | LCB_F_READ);
 	preempt_enable();
-	percpu_rwsem_wait(sem, /* .reader = */ true);
+	percpu_rwsem_wait(sem, /* .reader = */ true, freeze);
 	preempt_disable();
 	trace_contention_end(sem, 0);
 
@@ -237,7 +240,7 @@ void __sched percpu_down_write(struct percpu_rw_semaphore *sem)
 	 */
 	if (!__percpu_down_write_trylock(sem)) {
 		trace_contention_begin(sem, LCB_F_PERCPU | LCB_F_WRITE);
-		percpu_rwsem_wait(sem, /* .reader = */ false);
+		percpu_rwsem_wait(sem, /* .reader = */ false, false);
 		contended = true;
 	}
 

		

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-26 15:25                         ` James Bottomley
@ 2025-03-27 14:28                           ` James Bottomley
  0 siblings, 0 replies; 19+ messages in thread
From: James Bottomley @ 2025-03-27 14:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, lsf-pc,
	Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Wed, 2025-03-26 at 11:25 -0400, James Bottomley wrote:
> On Wed, 2025-03-26 at 15:59 +0100, Jan Kara wrote:
> [...]
> > So to summarize I think we may need to introduce freezable variant
> > of
> > percpu_rwsem_down_read() and use it in sb_start_write().
> 
> Aye, aye, sir! and thanks for making the can of worms bigger ...
> 
> This is what I came up with for freezable variants of the
> sb_write_start().  I'm still building the kernel (laptop only ...) so
> I'll let you know in an hour or so if it actually works.

Slightly longer than an hour, but I can confirm this all works.  I've
also tested it with filesystem on loop on filesystem (with ext4 as
upper and lower) and it hibernates just fine running some fio stress.

I've posted what I'm currently working with here:

https://lore.kernel.org/all/20250327140613.25178-1-James.Bottomley@HansenPartnership.com/

So people can see what I'm currently playing with.

Regards,

James


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-24 19:28               ` Jan Kara
@ 2025-03-27 14:55                 ` Eric Sandeen
  2025-03-27 17:30                   ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2025-03-27 14:55 UTC (permalink / raw)
  To: Jan Kara, James Bottomley
  Cc: Christoph Hellwig, linux-fsdevel, lsf-pc, Rafael J. Wysocki,
	Pavel Machek, Len Brown, linux-pm

On 3/24/25 2:28 PM, Jan Kara wrote:
> On Mon 24-03-25 10:34:56, James Bottomley wrote:
>> On Mon, 2025-03-24 at 12:38 +0100, Jan Kara wrote:
>>> On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
>>>> On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
>>>> [...]
>>>>> Let me digest all that and see if we have more hope this time
>>>>> around.
>>>>
>>>> OK, I think I've gone over it all.  The biggest problem with
>>>> resurrecting the patch was bugs in ext3, which isn't a problem now.
>>>> Most of the suspend system has been rearchitected to separate
>>>> suspending user space processes from kernel ones.  The sync it
>>>> currently does occurs before even user processes are frozen.  I
>>>> think
>>>> (as most of the original proposals did) that we just do freeze all
>>>> supers (using the reverse list) after user processes are frozen but
>>>> just before kernel threads are (this shouldn't perturb the image
>>>> allocation in hibernate, which was another source of bugs in xfs).
>>>
>>> So as far as my memory serves the fundamental problem with this
>>> approach was FUSE - once userspace is frozen, you cannot write to
>>> FUSE filesystems so filesystem freezing of FUSE would block if
>>> userspace is already suspended. You may even have a setup like:
>>>
>>> bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
>>> another fs
>>>
>>> So you really have to be careful to freeze this stack without causing
>>> deadlocks.
>>
>> Ah, so that explains why the sys_sync() sits in suspend resume *before*
>> freezing userspace ... that always appeared odd to me.
>>
>>>  So you need to be freezing userspace after filesystems are
>>> frozen but then you have to deal with the fact that parts of your
>>> userspace will be blocked in the kernel (trying to do some write)
>>> waiting for the filesystem to thaw. But it might be tractable these
>>> days since I have a vague recollection that system suspend is now
>>> able to gracefully handle even tasks in uninterruptible sleep.
>>
>> There is another thing I thought about: we don't actually have to
>> freeze across the sleep.  It might be possible simply to invoke
>> freeze/thaw where sys_sync() is now done to get a better on stable
>> storage image?  That should have fewer deadlock issues.
> 
> Well, there's not going to be a huge difference between doing sync(2) and
> doing freeze+thaw for each filesystem. After you thaw the filesystem
> drivers usually mark that the fs is in inconsistent state and that triggers
> journal replay / fsck on next mount.

For XFS, IIRC we only do that (mark the log dirty) so that we will process
orphan inodes if we crash while frozen, which today happens only during log
replay. I tried to remove that behavior long ago but didn't get very far.
(Since then maybe we have grown other reasons to mark dirty, not sure.)

https://lore.kernel.org/linux-xfs/83696ce6-4054-0e77-b4b8-e82a1a9fbbc3@redhat.com/

Does ext4 mark it dirty too? I actually thought it left a clean journal when
freezing.

Thanks,
-Eric
 
> 								Honza


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Filesystem Suspend Resume
  2025-03-27 14:55                 ` Eric Sandeen
@ 2025-03-27 17:30                   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2025-03-27 17:30 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, James Bottomley, Christoph Hellwig, linux-fsdevel,
	lsf-pc, Rafael J. Wysocki, Pavel Machek, Len Brown, linux-pm

On Thu 27-03-25 09:55:21, Eric Sandeen wrote:
> On 3/24/25 2:28 PM, Jan Kara wrote:
> > On Mon 24-03-25 10:34:56, James Bottomley wrote:
> >> On Mon, 2025-03-24 at 12:38 +0100, Jan Kara wrote:
> >>> On Fri 21-03-25 13:00:24, James Bottomley via Lsf-pc wrote:
> >>>> On Fri, 2025-03-21 at 08:34 -0400, James Bottomley wrote:
> >>>> [...]
> >>>>> Let me digest all that and see if we have more hope this time
> >>>>> around.
> >>>>
> >>>> OK, I think I've gone over it all.  The biggest problem with
> >>>> resurrecting the patch was bugs in ext3, which isn't a problem now.
> >>>> Most of the suspend system has been rearchitected to separate
> >>>> suspending user space processes from kernel ones.  The sync it
> >>>> currently does occurs before even user processes are frozen.  I
> >>>> think
> >>>> (as most of the original proposals did) that we just do freeze all
> >>>> supers (using the reverse list) after user processes are frozen but
> >>>> just before kernel threads are (this shouldn't perturb the image
> >>>> allocation in hibernate, which was another source of bugs in xfs).
> >>>
> >>> So as far as my memory serves the fundamental problem with this
> >>> approach was FUSE - once userspace is frozen, you cannot write to
> >>> FUSE filesystems so filesystem freezing of FUSE would block if
> >>> userspace is already suspended. You may even have a setup like:
> >>>
> >>> bdev <- fs <- FUSE filesystem <- loopback file <- loop device <-
> >>> another fs
> >>>
> >>> So you really have to be careful to freeze this stack without causing
> >>> deadlocks.
> >>
> >> Ah, so that explains why the sys_sync() sits in suspend resume *before*
> >> freezing userspace ... that always appeared odd to me.
> >>
> >>>  So you need to be freezing userspace after filesystems are
> >>> frozen but then you have to deal with the fact that parts of your
> >>> userspace will be blocked in the kernel (trying to do some write)
> >>> waiting for the filesystem to thaw. But it might be tractable these
> >>> days since I have a vague recollection that system suspend is now
> >>> able to gracefully handle even tasks in uninterruptible sleep.
> >>
> >> There is another thing I thought about: we don't actually have to
> >> freeze across the sleep.  It might be possible simply to invoke
> >> freeze/thaw where sys_sync() is now done to get a better on stable
> >> storage image?  That should have fewer deadlock issues.
> > 
> > Well, there's not going to be a huge difference between doing sync(2) and
> > doing freeze+thaw for each filesystem. After you thaw the filesystem
> > drivers usually mark that the fs is in inconsistent state and that triggers
> > journal replay / fsck on next mount.
> 
> For XFS, IIRC we only do that (mark the log dirty) so that we will process
> orphan inodes if we crash while frozen, which today happens only during log
> replay. I tried to remove that behavior long ago but didn't get very far.
> (Since then maybe we have grown other reasons to mark dirty, not sure.)
> 
> https://lore.kernel.org/linux-xfs/83696ce6-4054-0e77-b4b8-e82a1a9fbbc3@redhat.com/
> 
> Does ext4 mark it dirty too? I actually thought it left a clean journal when
> freezing.

The journal is completely checkpointed (thus emptied) while freezing but
thawing marks the superblock as requiring replay again and also background
filesystem threads (like lazy init, periodic superblock stats update, etc.)
can start creating transactions in the journal.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-03-27 17:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <0a76e074ef262ca857c61175dd3d0dc06b67ec42.camel@HansenPartnership.com>
     [not found] ` <Z9xG2l8lm7ha3Pf2@infradead.org>
     [not found]   ` <acae7a99f8acb0ebf408bb6fc82ab53fb687559c.camel@HansenPartnership.com>
2025-03-21  5:23     ` [LSF/MM/BPF TOPIC] Filesystem Suspend Resume Christoph Hellwig
2025-03-21 12:34       ` James Bottomley
2025-03-21 17:00         ` James Bottomley
2025-03-21 17:17           ` Lukas Wunner
2025-03-21 18:20             ` James Bottomley
2025-03-24 11:38           ` [Lsf-pc] " Jan Kara
2025-03-24 14:34             ` James Bottomley
2025-03-24 19:28               ` Jan Kara
2025-03-27 14:55                 ` Eric Sandeen
2025-03-27 17:30                   ` Jan Kara
2025-03-24 20:56               ` Dave Chinner
2025-03-24 20:50             ` Dave Chinner
2025-03-24 21:02               ` James Bottomley
2025-03-24 21:07                 ` Dave Chinner
2025-03-25 13:42                   ` Jan Kara
2025-03-26  2:36                     ` James Bottomley
2025-03-26 14:59                       ` Jan Kara
2025-03-26 15:25                         ` James Bottomley
2025-03-27 14:28                           ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox