squashfs can starve/block apps

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* squashfs can starve/block apps
@ 2025-06-26  8:09 Joakim Tjernlund (Nokia)
  2025-06-26 14:27 ` Joakim Tjernlund (Nokia)
  0 siblings, 1 reply; 4+ messages in thread
From: Joakim Tjernlund (Nokia) @ 2025-06-26  8:09 UTC (permalink / raw)
  To: linux-fsdevel, Phillip Lougher

We have an app running on a squashfs RFS(XZ compressed) and a appfs also on squashfs.
Whenever we validate an SW update image(stream a image.xz, uncompress it and on to /dev/null), 
the apps are starved/blocked and make almost no progress, system time in top goes up to 99+%
and the console also becomes unresponsive.

This feels like kernel is stuck/busy in a loop and does not let apps execute.

Kernel 5.15.185

Any ideas/pointers ?

 Jocke

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: squashfs can starve/block apps
  2025-06-26  8:09 squashfs can starve/block apps Joakim Tjernlund (Nokia)
@ 2025-06-26 14:27 ` Joakim Tjernlund (Nokia)
  2025-07-04 19:51   ` Phillip Lougher
  0 siblings, 1 reply; 4+ messages in thread
From: Joakim Tjernlund (Nokia) @ 2025-06-26 14:27 UTC (permalink / raw)
  To: linux-fsdevel, Phillip Lougher

On Thu, 2025-06-26 at 10:09 +0200, Joakim Tjernlund wrote:
> We have an app running on a squashfs RFS(XZ compressed) and a appfs also on squashfs.
> Whenever we validate an SW update image(stream a image.xz, uncompress it and on to /dev/null), 
> the apps are starved/blocked and make almost no progress, system time in top goes up to 99+%
> and the console also becomes unresponsive.
> 
> This feels like kernel is stuck/busy in a loop and does not let apps execute.
> 
> Kernel 5.15.185
> 
> Any ideas/pointers ?
> 
>  Jocke

This will reproduce the stuck behaviour we see:
 > cd /tmp (/tmp is an tmpfs)
 > wget https://fullImage.xz

So just downloading it to tmpfs will confuse squashfs, seems to
me that squashfs somehow see the xz compressed pages in page cache/VFS and
tried to do something with them.

kernel 51.15.185 (aarch64)
user space: ARM 32 bit with thumb

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: squashfs can starve/block apps
  2025-06-26 14:27 ` Joakim Tjernlund (Nokia)
@ 2025-07-04 19:51   ` Phillip Lougher
  2025-07-06  9:46     ` Joakim Tjernlund (Nokia)
  0 siblings, 1 reply; 4+ messages in thread
From: Phillip Lougher @ 2025-07-04 19:51 UTC (permalink / raw)
  To: Joakim Tjernlund (Nokia), linux-fsdevel; +Cc: phillip.lougher@gmail.com

> On 26/06/2025 15:27 BST Joakim Tjernlund (Nokia) <joakim.tjernlund@nokia.com> wrote:
> 
>  
> On Thu, 2025-06-26 at 10:09 +0200, Joakim Tjernlund wrote:
> > We have an app running on a squashfs RFS(XZ compressed) and a appfs also on squashfs.
> > Whenever we validate an SW update image(stream a image.xz, uncompress it and on to /dev/null), 
> > the apps are starved/blocked and make almost no progress, system time in top goes up to 99+%
> > and the console also becomes unresponsive.
> > 
> > This feels like kernel is stuck/busy in a loop and does not let apps execute.
> > 

I have been away at the Glastonbury festival, hence the delay in replying. But
this isn't really anything to do with Squashfs per se, and basic computer
science theory explains what is going on here.  So I'm surprised no-else has
responded.

> > Kernel 5.15.185
> > 
> > Any ideas/pointers ?

Yes,

> > 
> >  Jocke
> 
> This will reproduce the stuck behaviour we see:
>  > cd /tmp (/tmp is an tmpfs)
>  > wget https://fullImage.xz

You've identified the cause here.

> 
> So just downloading it to tmpfs will confuse squashfs, seems to
> me that squashfs somehow see the xz compressed pages in page cache/VFS and
> tried to do something with them.

But this is the completely wrong conclusion.  Squashfs doesn't "magically"
see files downloaded into a different filesystem and try to do something
with them.

What is happening is the system is thrashing, because the page cache doesn't
have enough remaining space to contain the working set of the running
application(s).

See Wikipedia article https://en.wikipedia.org/wiki/Thrashing_(computer_science)

Tmpfs filesystems (/tmp here) are not backed by physical media, and their
content are stored in the page cache.  So in effect if fullImage.xz takes
most of the page cache (system RAM), then there is no much space left to store
the pages of the applications that are running, and they constantly replace
each others pages.

To make it easy, imagine we have two processes A and B, and the page cache
doesn't have enough space to store both the pages for processes A and B.

Now:

1. Process A starts and demand-pages pages into the page cache from the
   Squashfs root filesystem.  This takes CPU resources to decompress the pages.
   Process A runs for a while and then gets descheduled.

2. Process B starts and demand-pages pages into the page cache, replacing
   Process A's pages.  It runs for a while and then gets descheduled.

3 Process A restarts and finds all its pages have gone from page cache, and so
  it has to re-demand-page the pages back.  This replaces Process B's pages.

4. Process B restarts and finds all its pages have gone from the page cache ...

In effect the system spends all it's time reading pages from the
Squashfs root filesystem, and doesn't do anything else, and hence it looks
like it has hung.

This is not a fault with Squashfs, and it will happen with any filesystem
(ext4 etc) when system memory is too small to contain the working set of
pages.

Now, to repeat what has caused this is the download of that fullImage.xz
which has filled most of the page cache (system RAM).  To prevent that
from happening, there are two obvious solutions:

1. Split fullImage.xz into pieces and only download one piece at a time.  This
   will avoid filling up the page cache and the system trashing.

2. Kill all unnecessary applications and processes before downloading
   fullImage.xz.  In doing that you reduce the working set to RAM available,
   which will again prevent thrashing.

Hope that helps.

Phillip

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: squashfs can starve/block apps
  2025-07-04 19:51   ` Phillip Lougher
@ 2025-07-06  9:46     ` Joakim Tjernlund (Nokia)
  0 siblings, 0 replies; 4+ messages in thread
From: Joakim Tjernlund (Nokia) @ 2025-07-06  9:46 UTC (permalink / raw)
  To: Phillip Lougher, linux-fsdevel; +Cc: phillip.lougher@gmail.com

On Fri, 2025-07-04 at 20:51 +0100, Phillip Lougher wrote:
>
>
> > On 26/06/2025 15:27 BST Joakim Tjernlund (Nokia) <joakim.tjernlund@nokia.com> wrote:
> >
> >
> > On Thu, 2025-06-26 at 10:09 +0200, Joakim Tjernlund wrote:
> > > We have an app running on a squashfs RFS(XZ compressed) and a appfs also on squashfs.
> > > Whenever we validate an SW update image(stream a image.xz, uncompress it and on to /dev/null),
> > > the apps are starved/blocked and make almost no progress, system time in top goes up to 99+%
> > > and the console also becomes unresponsive.
> > >
> > > This feels like kernel is stuck/busy in a loop and does not let apps execute.
> > >
>
> I have been away at the Glastonbury festival, hence the delay in replying. But
> this isn't really anything to do with Squashfs per se, and basic computer
> science theory explains what is going on here.  So I'm surprised no-else has
> responded.
>
> > > Kernel 5.15.185
> > >
> > > Any ideas/pointers ?
>
> Yes,
>
> > >
> > >  Jocke
> >
> > This will reproduce the stuck behaviour we see:
> >  > cd /tmp (/tmp is an tmpfs)
> >  > wget
> > https://fullimage.xz/
>
> You've identified the cause here.
>
> >
> > So just downloading it to tmpfs will confuse squashfs, seems to
> > me that squashfs somehow see the xz compressed pages in page cache/VFS and
> > tried to do something with them.
>
> But this is the completely wrong conclusion.  Squashfs doesn't "magically"
> see files downloaded into a different filesystem and try to do something
> with them.
>
> What is happening is the system is thrashing, because the page cache doesn't
> have enough remaining space to contain the working set of the running
> application(s).
>
> See Wikipedia article
> https://en.wikipedia.org/wiki/Thrashing_(computer_science)
>
> Tmpfs filesystems (/tmp here) are not backed by physical media, and their
> content are stored in the page cache.  So in effect if fullImage.xz takes
> most of the page cache (system RAM), then there is no much space left to store
> the pages of the applications that are running, and they constantly replace
> each others pages.
>
> To make it easy, imagine we have two processes A and B, and the page cache
> doesn't have enough space to store both the pages for processes A and B.
>
> Now:
>
> 1. Process A starts and demand-pages pages into the page cache from the
>    Squashfs root filesystem.  This takes CPU resources to decompress the pages.
>    Process A runs for a while and then gets descheduled.
>
> 2. Process B starts and demand-pages pages into the page cache, replacing
>    Process A's pages.  It runs for a while and then gets descheduled.
>
> 3 Process A restarts and finds all its pages have gone from page cache, and so
>   it has to re-demand-page the pages back.  This replaces Process B's pages.
>
> 4. Process B restarts and finds all its pages have gone from the page cache ...
>
> In effect the system spends all it's time reading pages from the
> Squashfs root filesystem, and doesn't do anything else, and hence it looks
> like it has hung.
>
> This is not a fault with Squashfs, and it will happen with any filesystem
> (ext4 etc) when system memory is too small to contain the working set of
> pages.
>
> Now, to repeat what has caused this is the download of that fullImage.xz
> which has filled most of the page cache (system RAM).  To prevent that
> from happening, there are two obvious solutions:
>
> 1. Split fullImage.xz into pieces and only download one piece at a time.  This
>    will avoid filling up the page cache and the system trashing.
>
> 2. Kill all unnecessary applications and processes before downloading
>    fullImage.xz.  In doing that you reduce the working set to RAM available,
>    which will again prevent thrashing.
>
> Hope that helps.
>
> Phillip

You are absolutely right, above was low RAM due to filling the tmpfs RAM.
But what threw me off was that I observed the same when streaming XZ to /dev/null.

After som digging I found why, some XZ options do not respect "-0" presets
w.r.t dict size and reset it back to default. Once I changed from
  "-0 --check=crc32 --arm --lzma2=lp=2,lc=2"
to
  "-0 --check=crc32 --lzma2=dict=128KiB"
I got a stable system.

Perhaps xz -l could be improved to include dict size to make this more obvious?

 Jocke


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-06  9:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26  8:09 squashfs can starve/block apps Joakim Tjernlund (Nokia)
2025-06-26 14:27 ` Joakim Tjernlund (Nokia)
2025-07-04 19:51   ` Phillip Lougher
2025-07-06  9:46     ` Joakim Tjernlund (Nokia)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).