Compressed files & the page cache

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Compressed files & the page cache
@ 2025-07-15 20:40 Matthew Wilcox
  2025-07-15 21:22 ` Boris Burkov
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Matthew Wilcox @ 2025-07-15 20:40 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs, Jaegeuk Kim,
	linux-f2fs-devel, Jan Kara, linux-fsdevel, David Woodhouse,
	Richard Weinberger, linux-mtd, David Howells, netfs,
	Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher

I've started looking at how the page cache can help filesystems handle
compressed data better.  Feedback would be appreciated!  I'll probably
say a few things which are obvious to anyone who knows how compressed
files work, but I'm trying to be explicit about my assumptions.

First, I believe that all filesystems work by compressing fixed-size
plaintext into variable-sized compressed blocks.  This would be a good
point to stop reading and tell me about counterexamples.

From what I've been reading in all your filesystems is that you want to
allocate extra pages in the page cache in order to store the excess data
retrieved along with the page that you're actually trying to read.  That's
because compressing in larger chunks leads to better compression.

There's some discrepancy between filesystems whether you need scratch
space for decompression.  Some filesystems read the compressed data into
the pagecache and decompress in-place, while other filesystems read the
compressed data into scratch pages and decompress into the page cache.

There also seems to be some discrepancy between filesystems whether the
decompression involves vmap() of all the memory allocated or whether the
decompression routines can handle doing kmap_local() on individual pages.

So, my proposal is that filesystems tell the page cache that their minimum
folio size is the compression block size.  That seems to be around 64k,
so not an unreasonable minimum allocation size.  That removes all the
extra code in filesystems to allocate extra memory in the page cache.
It means we don't attempt to track dirtiness at a sub-folio granularity
(there's no point, we have to write back the entire compressed bock
at once).  We also get a single virtually contiguous block ... if you're
willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
vmap_file() which would give us a virtually contiguous chunk of memory
(and could be trivially turned into a noop for the case of trying to
vmap a single large folio).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-15 20:40 Compressed files & the page cache Matthew Wilcox
@ 2025-07-15 21:22 ` Boris Burkov
  2025-07-15 23:32 ` Gao Xiang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Boris Burkov @ 2025-07-15 21:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs, Jaegeuk Kim,
	linux-f2fs-devel, Jan Kara, linux-fsdevel, David Woodhouse,
	Richard Weinberger, linux-mtd, David Howells, netfs,
	Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher

On Tue, Jul 15, 2025 at 09:40:42PM +0100, Matthew Wilcox wrote:
> I've started looking at how the page cache can help filesystems handle
> compressed data better.  Feedback would be appreciated!  I'll probably
> say a few things which are obvious to anyone who knows how compressed
> files work, but I'm trying to be explicit about my assumptions.
> 
> First, I believe that all filesystems work by compressing fixed-size
> plaintext into variable-sized compressed blocks.  This would be a good
> point to stop reading and tell me about counterexamples.

As far as I know, btrfs with zstd does not used fixed size plaintext. I
am going off the btrfs logic itself, not the zstd internals which I am
sadly ignorant of. We are using the streaming interface for whatever
that is worth.

Through the following callpath, the len is piped from the async_chunk\
through to zstd via the slightly weirdly named total_out parameter:

compress_file_range()
  btrfs_compress_folios()
    compression_compress_pages()
      zstd_compress_folios()
        zstd_get_btrfs_parameters() // passes len
        zstd_init_cstream() // passes len
        for-each-folio:
          zstd_compress_stream() // last folio is truncated if short
  
# bpftrace to check the size in the zstd callsite
$ sudo bpftrace -e 'fentry:zstd_init_cstream {printf("%llu\n", args.pledged_src_size);}'
Attaching 1 probe...
76800

# diff terminal, write a compressed extent with a weird source size
$ sudo dd if=/dev/zero of=/mnt/lol/foo bs=75k count=1

We do operate in terms of folios for calling zstd_compress_stream, so
that can be thought of as a fixed size plaintext block, but even so, we
pass in a short block for the last one:
$ sudo bpftrace -e 'fentry:zstd_compress_stream {printf("%llu\n", args.input->size);}'
Attaching 1 probe...
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
3072

> 
> From what I've been reading in all your filesystems is that you want to
> allocate extra pages in the page cache in order to store the excess data
> retrieved along with the page that you're actually trying to read.  That's
> because compressing in larger chunks leads to better compression.
> 
> There's some discrepancy between filesystems whether you need scratch
> space for decompression.  Some filesystems read the compressed data into
> the pagecache and decompress in-place, while other filesystems read the
> compressed data into scratch pages and decompress into the page cache.
> 
> There also seems to be some discrepancy between filesystems whether the
> decompression involves vmap() of all the memory allocated or whether the
> decompression routines can handle doing kmap_local() on individual pages.
> 
> So, my proposal is that filesystems tell the page cache that their minimum
> folio size is the compression block size.  That seems to be around 64k,

btrfs has a max uncompressed extent size of 128K, for what it's worth.
In practice, many compressed files are comprised of a large number of
compressed extents each representing a 128k plaintext extent.

Not sure if that is exactly the constant you are concerned with here, or
if it refutes your idea in any way, just figured I would mention it as
well.

> so not an unreasonable minimum allocation size.  That removes all the
> extra code in filesystems to allocate extra memory in the page cache.
> It means we don't attempt to track dirtiness at a sub-folio granularity
> (there's no point, we have to write back the entire compressed bock
> at once).  We also get a single virtually contiguous block ... if you're
> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> vmap_file() which would give us a virtually contiguous chunk of memory
> (and could be trivially turned into a noop for the case of trying to
> vmap a single large folio).
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-15 20:40 Compressed files & the page cache Matthew Wilcox
  2025-07-15 21:22 ` Boris Burkov
@ 2025-07-15 23:32 ` Gao Xiang
  2025-07-16  0:28   ` Gao Xiang
  2025-07-21  0:43   ` Barry Song
  2025-07-16  0:57 ` Qu Wenruo
  2025-07-16 22:37 ` Phillip Lougher
  3 siblings, 2 replies; 19+ messages in thread
From: Gao Xiang @ 2025-07-15 23:32 UTC (permalink / raw)
  To: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu, Barry Song

Hi Matthew,

On 2025/7/16 04:40, Matthew Wilcox wrote:
> I've started looking at how the page cache can help filesystems handle
> compressed data better.  Feedback would be appreciated!  I'll probably
> say a few things which are obvious to anyone who knows how compressed
> files work, but I'm trying to be explicit about my assumptions.
> 
> First, I believe that all filesystems work by compressing fixed-size
> plaintext into variable-sized compressed blocks.  This would be a good
> point to stop reading and tell me about counterexamples.

At least the typical EROFS compresses variable-sized plaintext (at least
one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed
blocks for efficient I/Os, which is really useful for small compression
granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually
under memory pressure so large compression granularity is almost
unacceptable in the low memory scenarios, see:
https://erofs.docs.kernel.org/en/latest/design.html

Currently EROFS works pretty well on these devices and has been
successfully deployed in billions of real devices.

> 
>  From what I've been reading in all your filesystems is that you want to
> allocate extra pages in the page cache in order to store the excess data
> retrieved along with the page that you're actually trying to read.  That's
> because compressing in larger chunks leads to better compression.
> 
> There's some discrepancy between filesystems whether you need scratch
> space for decompression.  Some filesystems read the compressed data into
> the pagecache and decompress in-place, while other filesystems read the
> compressed data into scratch pages and decompress into the page cache.
> 
> There also seems to be some discrepancy between filesystems whether the
> decompression involves vmap() of all the memory allocated or whether the
> decompression routines can handle doing kmap_local() on individual pages.
> 
> So, my proposal is that filesystems tell the page cache that their minimum
> folio size is the compression block size.  That seems to be around 64k,
> so not an unreasonable minimum allocation size.  That removes all the
> extra code in filesystems to allocate extra memory in the page cache.> It means we don't attempt to track dirtiness at a sub-folio granularity
> (there's no point, we have to write back the entire compressed bock
> at once).  We also get a single virtually contiguous block ... if you're
> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> vmap_file() which would give us a virtually contiguous chunk of memory
> (and could be trivially turned into a noop for the case of trying to
> vmap a single large folio).

I don't see this will work for EROFS because EROFS always supports
variable uncompressed extent lengths and that will break typical
EROFS use cases and on-disk formats.

Other thing is that large order folios (physical consecutive) will
caused "increase the latency on UX task with filemap_fault()"
because of high-order direct reclaims, see:
https://android-review.googlesource.com/c/kernel/common/+/3692333
so EROFS will not set min-order and always support order-0 folios.

I think EROFS will not use this new approach, vmap() interface is
always the case for us.

Thanks,
Gao Xiang

> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-15 23:32 ` Gao Xiang
@ 2025-07-16  0:28   ` Gao Xiang
  2025-07-21  1:02     ` Barry Song
  2025-07-21  0:43   ` Barry Song
  1 sibling, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2025-07-16  0:28 UTC (permalink / raw)
  To: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu, Barry Song, Qu Wenruo



On 2025/7/16 07:32, Gao Xiang wrote:
> Hi Matthew,
> 
> On 2025/7/16 04:40, Matthew Wilcox wrote:
>> I've started looking at how the page cache can help filesystems handle
>> compressed data better.  Feedback would be appreciated!  I'll probably
>> say a few things which are obvious to anyone who knows how compressed
>> files work, but I'm trying to be explicit about my assumptions.
>>
>> First, I believe that all filesystems work by compressing fixed-size
>> plaintext into variable-sized compressed blocks.  This would be a good
>> point to stop reading and tell me about counterexamples.
> 
> At least the typical EROFS compresses variable-sized plaintext (at least
> one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed
> blocks for efficient I/Os, which is really useful for small compression
> granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually
> under memory pressure so large compression granularity is almost
> unacceptable in the low memory scenarios, see:
> https://erofs.docs.kernel.org/en/latest/design.html
> 
> Currently EROFS works pretty well on these devices and has been
> successfully deployed in billions of real devices.
> 
>>
>>  From what I've been reading in all your filesystems is that you want to
>> allocate extra pages in the page cache in order to store the excess data
>> retrieved along with the page that you're actually trying to read.  That's
>> because compressing in larger chunks leads to better compression.
>>
>> There's some discrepancy between filesystems whether you need scratch
>> space for decompression.  Some filesystems read the compressed data into
>> the pagecache and decompress in-place, while other filesystems read the
>> compressed data into scratch pages and decompress into the page cache.
>>
>> There also seems to be some discrepancy between filesystems whether the
>> decompression involves vmap() of all the memory allocated or whether the
>> decompression routines can handle doing kmap_local() on individual pages.
>>
>> So, my proposal is that filesystems tell the page cache that their minimum
>> folio size is the compression block size.  That seems to be around 64k,
>> so not an unreasonable minimum allocation size.  That removes all the
>> extra code in filesystems to allocate extra memory in the page cache.> It means we don't attempt to track dirtiness at a sub-folio granularity
>> (there's no point, we have to write back the entire compressed bock
>> at once).  We also get a single virtually contiguous block ... if you're
>> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
>> vmap_file() which would give us a virtually contiguous chunk of memory
>> (and could be trivially turned into a noop for the case of trying to
>> vmap a single large folio).
> 
> I don't see this will work for EROFS because EROFS always supports
> variable uncompressed extent lengths and that will break typical
> EROFS use cases and on-disk formats.
> 
> Other thing is that large order folios (physical consecutive) will
> caused "increase the latency on UX task with filemap_fault()"
> because of high-order direct reclaims, see:
> https://android-review.googlesource.com/c/kernel/common/+/3692333
> so EROFS will not set min-order and always support order-0 folios.
> 
> I think EROFS will not use this new approach, vmap() interface is
> always the case for us.

... high-order folios can cause side effects on embedded devices
like routers and IoT devices, which still have MiBs of memory (and I
believe this won't change due to their use cases) but they also use
Linux kernel for quite long time.  In short, I don't think enabling
large folios for those devices is very useful, let alone limiting
the minimum folio order for them (It would make the filesystem not
suitable any more for those users.  At least that is what I never
want to do).  And I believe this is different from the current LBS
support to match hardware characteristics or LBS atomic write
requirement.

BTW, AFAIK, there are also compression optimization tricks related
to COW (like what Btrfs currently does) or write optimizations,
which would also break this.

For example, recompressing an entire compressed extent when a user
updates just one specific file block (consider random data updates)
is inefficient. Filesystems may write the block as uncompressed data
initially (since recompressing the whole extent would be CPU-intensive
and cause write amplification) and then consider recompressing it
during background garbage collection or when there are enough blocks
have been written to justify recompression of the original extent.

The Btrfs COW case was also pointed out by Wenruo in the previous
thread:
https://lore.kernel.org/r/62f5f68d-7e3f-9238-5417-c64d8dcf2214@gmx.com

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-15 20:40 Compressed files & the page cache Matthew Wilcox
  2025-07-15 21:22 ` Boris Burkov
  2025-07-15 23:32 ` Gao Xiang
@ 2025-07-16  0:57 ` Qu Wenruo
  2025-07-16  1:16   ` Gao Xiang
  2025-07-16 22:37 ` Phillip Lougher
  3 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2025-07-16  0:57 UTC (permalink / raw)
  To: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher



在 2025/7/16 06:10, Matthew Wilcox 写道:
> I've started looking at how the page cache can help filesystems handle
> compressed data better.  Feedback would be appreciated!  I'll probably
> say a few things which are obvious to anyone who knows how compressed
> files work, but I'm trying to be explicit about my assumptions.
> 
> First, I believe that all filesystems work by compressing fixed-size
> plaintext into variable-sized compressed blocks.  This would be a good
> point to stop reading and tell me about counterexamples.

I don't think it's the case for btrfs, unless your "fixed-size" means 
block size, and in that case, a single block won't be compressed at all...

In btrfs, we support compressing the plaintext from 2 blocks to 128KiB 
(the 128KiB limit is an artificial one).

> 
>  From what I've been reading in all your filesystems is that you want to
> allocate extra pages in the page cache in order to store the excess data
> retrieved along with the page that you're actually trying to read.  That's
> because compressing in larger chunks leads to better compression.

We don't. We just grab dirty pages up to 128KiB, and we can handle 
smaller ranges, as small as two blocks.

> 
> There's some discrepancy between filesystems whether you need scratch
> space for decompression.  Some filesystems read the compressed data into
> the pagecache and decompress in-place, while other filesystems read the
> compressed data into scratch pages and decompress into the page cache.

Btrfs goes the scratch pages way. Decompression in-place looks a little 
tricky to me. E.g. what if there is only one compressed page, and it 
decompressed to 4 pages.

Won't the plaintext over-write the compressed data halfway?

> 
> There also seems to be some discrepancy between filesystems whether the
> decompression involves vmap() of all the memory allocated or whether the
> decompression routines can handle doing kmap_local() on individual pages.

Btrfs is the later case.

All the decompression/compression routines all support swapping 
input/output buffer when one of them is full.
So kmap_local() is completely feasible.

Thanks,
Qu

> 
> So, my proposal is that filesystems tell the page cache that their minimum
> folio size is the compression block size.  That seems to be around 64k,
> so not an unreasonable minimum allocation size.  That removes all the
> extra code in filesystems to allocate extra memory in the page cache.
> It means we don't attempt to track dirtiness at a sub-folio granularity
> (there's no point, we have to write back the entire compressed bock
> at once).  We also get a single virtually contiguous block ... if you're
> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> vmap_file() which would give us a virtually contiguous chunk of memory
> (and could be trivially turned into a noop for the case of trying to
> vmap a single large folio).
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-16  0:57 ` Qu Wenruo
@ 2025-07-16  1:16   ` Gao Xiang
  2025-07-16  4:54     ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2025-07-16  1:16 UTC (permalink / raw)
  To: Qu Wenruo, Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher

...

> 
>>
>> There's some discrepancy between filesystems whether you need scratch
>> space for decompression.  Some filesystems read the compressed data into
>> the pagecache and decompress in-place, while other filesystems read the
>> compressed data into scratch pages and decompress into the page cache.
> 
> Btrfs goes the scratch pages way. Decompression in-place looks a little tricky to me. E.g. what if there is only one compressed page, and it decompressed to 4 pages.

Decompression in-place mainly optimizes full decompression (so that CPU
cache line won't be polluted by temporary buffers either), in fact,
EROFS supports the hybird way.

> 
> Won't the plaintext over-write the compressed data halfway?

Personally I'm very familiar with LZ4, LZMA, and DEFLATE
algorithm internals, and I also have experience to build LZMA,
DEFLATE compressors.

It's totally workable for LZ4, in short it will read the compressed
data at the end of the decompressed buffers, and the proper margin
can make this almost always succeed.  In practice, many Android
devices already use EROFS for almost 7 years and it works very well
to reduce extra memory overhead and help overall runtime performance.

In short, I don't think EROFS will change since it's already
optimal and gaining more and more users.

> 
>>
>> There also seems to be some discrepancy between filesystems whether the
>> decompression involves vmap() of all the memory allocated or whether the
>> decompression routines can handle doing kmap_local() on individual pages.
> 
> Btrfs is the later case.
> 
> All the decompression/compression routines all support swapping input/output buffer when one of them is full.
> So kmap_local() is completely feasible.

I think one of the btrfs supported algorithm LZO is not, because the
fastest LZ77-family algorithms like LZ4, LZO just operates on virtual
consecutive buffers and treat the decompressed buffer as LZ77 sliding
window.

So that either you need to allocate another temporary consecutive
buffer (I believe that is what btrfs does) or use vmap() approach,
EROFS is interested in the vmap() one.

Thanks,
Gao Xiang

> 
> Thanks,
> Qu
> 
>>
>> So, my proposal is that filesystems tell the page cache that their minimum
>> folio size is the compression block size.  That seems to be around 64k,
>> so not an unreasonable minimum allocation size.  That removes all the
>> extra code in filesystems to allocate extra memory in the page cache.
>> It means we don't attempt to track dirtiness at a sub-folio granularity
>> (there's no point, we have to write back the entire compressed bock
>> at once).  We also get a single virtually contiguous block ... if you're
>> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
>> vmap_file() which would give us a virtually contiguous chunk of memory
>> (and could be trivially turned into a noop for the case of trying to
>> vmap a single large folio).
>>
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-16  1:16   ` Gao Xiang
@ 2025-07-16  4:54     ` Qu Wenruo
  2025-07-16  5:40       ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2025-07-16  4:54 UTC (permalink / raw)
  To: Gao Xiang, Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher



在 2025/7/16 10:46, Gao Xiang 写道:
> ...
> 
>>
>>>
>>> There's some discrepancy between filesystems whether you need scratch
>>> space for decompression.  Some filesystems read the compressed data into
>>> the pagecache and decompress in-place, while other filesystems read the
>>> compressed data into scratch pages and decompress into the page cache.
>>
>> Btrfs goes the scratch pages way. Decompression in-place looks a 
>> little tricky to me. E.g. what if there is only one compressed page, 
>> and it decompressed to 4 pages.
> 
> Decompression in-place mainly optimizes full decompression (so that CPU
> cache line won't be polluted by temporary buffers either), in fact,
> EROFS supports the hybird way.
> 
>>
>> Won't the plaintext over-write the compressed data halfway?
> 
> Personally I'm very familiar with LZ4, LZMA, and DEFLATE
> algorithm internals, and I also have experience to build LZMA,
> DEFLATE compressors.
> 
> It's totally workable for LZ4, in short it will read the compressed
> data at the end of the decompressed buffers, and the proper margin
> can make this almost always succeed.

I guess that's why btrfs can not go that way.

Due to data COW, we're totally possible to hit a case that we only want 
to read out one single plaintext block from a compressed data extent 
(the compressed size can even be larger than one block).

In that case such in-place decompression will definitely not work.

[...]

>> All the decompression/compression routines all support swapping input/ 
>> output buffer when one of them is full.
>> So kmap_local() is completely feasible.
> 
> I think one of the btrfs supported algorithm LZO is not,

It is, the tricky part is btrfs is implementing its own TLV structure 
for LZO compression.

And btrfs does extra padding to ensure no TLV (compressed data + header) 
structure will cross block boundary.

So btrfs LZO compression is still able to swap out input/output halfway, 
mostly due to the btrfs' specific design.

Thanks,
Qu

> because the
> fastest LZ77-family algorithms like LZ4, LZO just operates on virtual
> consecutive buffers and treat the decompressed buffer as LZ77 sliding
> window.
> 
> So that either you need to allocate another temporary consecutive
> buffer (I believe that is what btrfs does) or use vmap() approach,
> EROFS is interested in the vmap() one.
> 
> Thanks,
> Gao Xiang
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> So, my proposal is that filesystems tell the page cache that their 
>>> minimum
>>> folio size is the compression block size.  That seems to be around 64k,
>>> so not an unreasonable minimum allocation size.  That removes all the
>>> extra code in filesystems to allocate extra memory in the page cache.
>>> It means we don't attempt to track dirtiness at a sub-folio granularity
>>> (there's no point, we have to write back the entire compressed bock
>>> at once).  We also get a single virtually contiguous block ... if you're
>>> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
>>> vmap_file() which would give us a virtually contiguous chunk of memory
>>> (and could be trivially turned into a noop for the case of trying to
>>> vmap a single large folio).
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-16  4:54     ` Qu Wenruo
@ 2025-07-16  5:40       ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2025-07-16  5:40 UTC (permalink / raw)
  To: Qu Wenruo, Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher

On 2025/7/16 12:54, Qu Wenruo wrote:
> 
> 
> 在 2025/7/16 10:46, Gao Xiang 写道:
>> ...
>>
>>>
>>>>
>>>> There's some discrepancy between filesystems whether you need scratch
>>>> space for decompression.  Some filesystems read the compressed data into
>>>> the pagecache and decompress in-place, while other filesystems read the
>>>> compressed data into scratch pages and decompress into the page cache.
>>>
>>> Btrfs goes the scratch pages way. Decompression in-place looks a little tricky to me. E.g. what if there is only one compressed page, and it decompressed to 4 pages.
>>
>> Decompression in-place mainly optimizes full decompression (so that CPU
>> cache line won't be polluted by temporary buffers either), in fact,
>> EROFS supports the hybird way.
>>
>>>
>>> Won't the plaintext over-write the compressed data halfway?
>>
>> Personally I'm very familiar with LZ4, LZMA, and DEFLATE
>> algorithm internals, and I also have experience to build LZMA,
>> DEFLATE compressors.
>>
>> It's totally workable for LZ4, in short it will read the compressed
>> data at the end of the decompressed buffers, and the proper margin
>> can make this almost always succeed.
> 
> I guess that's why btrfs can not go that way.
> 
> Due to data COW, we're totally possible to hit a case that we only want to read out one single plaintext block from a compressed data extent (the compressed size can even be larger than one block).
> 
> In that case such in-place decompression will definitely not work.

Ok, I think it's mainly due to btrfs compression design.  Another point
is that decompression inplace can also be used for multi-shot interfaces
(as you said, "swapping input/ output buffer when one of them is full")
like deflate, lzma and zstd. Because you can know when the decompressed
buffers and compressed buffers are overlapped since APIs are multi-shot,
and only copy the overlapped compressed data to some additional temprary
buffers (and they can be shared among multiple compressed extents).

It has less overhead than allocating temporary buffers to keep compressed
data during the whole I/O process (again, because it just uses very small
number buffers during decompression process), especially for slow (even
network) storage devices.

I do understand Btrfs may not consider this because of different target
users, but one of EROFS main use cases is low overhead decompression
under the memory pressure (maybe + cheap storage), LZ4 + inplace
decompression is useful.

Anyway, I'm not advocating inplace decompression in any case.  I think
unlike plain text, encoded data has various approaches to organize
on disk and utilize page cache.  Due to different on-disk design and
target users, there will be different usage mode.

As for EROFS, we already natively supports compressed large folios
since 6.11, and order-0 folio is always our use cases, so I don't
think this will give extra benefits to users.

> 
> [...]
> 
>>> All the decompression/compression routines all support swapping input/ output buffer when one of them is full.
>>> So kmap_local() is completely feasible.
>>
>> I think one of the btrfs supported algorithm LZO is not,
> 
> It is, the tricky part is btrfs is implementing its own TLV structure for LZO compression.
> 
> And btrfs does extra padding to ensure no TLV (compressed data + header) structure will cross block boundary.
> 
> So btrfs LZO compression is still able to swap out input/output halfway, mostly due to the btrfs' specific design.

Ok, it seems much like a btrfs-specific design, because it's much
like per-block compression for LZO instead, and it will increase
the compressed size, I know btrfs may not care, but it's not the
EROFS case anyway.

Thanks,
Gao Xiang

> 
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-15 20:40 Compressed files & the page cache Matthew Wilcox
                   ` (2 preceding siblings ...)
  2025-07-16  0:57 ` Qu Wenruo
@ 2025-07-16 22:37 ` Phillip Lougher
  2025-07-17  2:49   ` Eric Biggers
  3 siblings, 1 reply; 19+ messages in thread
From: Phillip Lougher @ 2025-07-16 22:37 UTC (permalink / raw)
  To: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs



On 15/07/2025 21:40, Matthew Wilcox wrote:
> I've started looking at how the page cache can help filesystems handle
> compressed data better.  Feedback would be appreciated!  I'll probably
> say a few things which are obvious to anyone who knows how compressed
> files work, but I'm trying to be explicit about my assumptions.
> 
> First, I believe that all filesystems work by compressing fixed-size
> plaintext into variable-sized compressed blocks.  This would be a good
> point to stop reading and tell me about counterexamples.

For Squashfs Yes.

> 
>>From what I've been reading in all your filesystems is that you want to
> allocate extra pages in the page cache in order to store the excess data
> retrieved along with the page that you're actually trying to read.  That's
> because compressing in larger chunks leads to better compression.
> 

Yes.

> There's some discrepancy between filesystems whether you need scratch
> space for decompression.  Some filesystems read the compressed data into
> the pagecache and decompress in-place, while other filesystems read the
> compressed data into scratch pages and decompress into the page cache.
> 

Squashfs uses scratch pages.

> There also seems to be some discrepancy between filesystems whether the
> decompression involves vmap() of all the memory allocated or whether the
> decompression routines can handle doing kmap_local() on individual pages.
> 

Squashfs does both, and this depends on whether the decompression
algorithm implementation in the kernel is multi-shot or single-shot.

The zlib/xz/zstd decompressors are multi-shot, in that you can call them
multiply, giving them an extra input or output buffer when it runs out.
This means you can get them to output into a 4K page at a time, without
requiring the pages to be contiguous.  kmap_local() can be called on each
page before passing it to the decompressor.

The lzo/lz4 decompressors are single-shot, they expect to be called once,
with a single contiguous input buffer containing the data to be
decompressed, and a single contiguous output buffer large enough to hold
all the uncompressed data.

> So, my proposal is that filesystems tell the page cache that their minimum
> folio size is the compression block size.  That seems to be around 64k,
> so not an unreasonable minimum allocation size.  That removes all the
> extra code in filesystems to allocate extra memory in the page cache.
> It means we don't attempt to track dirtiness at a sub-folio granularity
> (there's no point, we have to write back the entire compressed bock
> at once).  We also get a single virtually contiguous block ... if you're
> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> vmap_file() which would give us a virtually contiguous chunk of memory
> (and could be trivially turned into a noop for the case of trying to
> vmap a single large folio).
> 

The compression block size in Squashfs can be 4K to 1M in size.

Phillip

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-16 22:37 ` Phillip Lougher
@ 2025-07-17  2:49   ` Eric Biggers
  2025-07-17  3:18     ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Biggers @ 2025-07-17  2:49 UTC (permalink / raw)
  To: Phillip Lougher
  Cc: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs

On Wed, Jul 16, 2025 at 11:37:28PM +0100, Phillip Lougher wrote:
> > There also seems to be some discrepancy between filesystems whether the
> > decompression involves vmap() of all the memory allocated or whether the
> > decompression routines can handle doing kmap_local() on individual pages.
> > 
> 
> Squashfs does both, and this depends on whether the decompression
> algorithm implementation in the kernel is multi-shot or single-shot.
> 
> The zlib/xz/zstd decompressors are multi-shot, in that you can call them
> multiply, giving them an extra input or output buffer when it runs out.
> This means you can get them to output into a 4K page at a time, without
> requiring the pages to be contiguous.  kmap_local() can be called on each
> page before passing it to the decompressor.

While those compression libraries do provide streaming APIs, it's sort
of an illusion.  They still need the uncompressed data in a virtually
contiguous buffer for the LZ77 match finding and copying to work.  So,
internally they copy the uncompressed data into a virtually contiguous
buffer.  I suspect that vmap() (or vm_map_ram() which is what f2fs uses)
is actually more efficient than these streaming APIs, since it avoids
the internal copy.  But it would need to be measured.

> > So, my proposal is that filesystems tell the page cache that their minimum
> > folio size is the compression block size.  That seems to be around 64k,
> > so not an unreasonable minimum allocation size.  That removes all the
> > extra code in filesystems to allocate extra memory in the page cache.
> > It means we don't attempt to track dirtiness at a sub-folio granularity
> > (there's no point, we have to write back the entire compressed bock
> > at once).  We also get a single virtually contiguous block ... if you're
> > willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> > vmap_file() which would give us a virtually contiguous chunk of memory
> > (and could be trivially turned into a noop for the case of trying to
> > vmap a single large folio).

... but of course, if we could get a virtually contiguous buffer
"for free" (at least in the !HIGHMEM case) as in the above proposal,
that would clearly be the best option.

- Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-17  2:49   ` Eric Biggers
@ 2025-07-17  3:18     ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2025-07-17  3:18 UTC (permalink / raw)
  To: Eric Biggers, Phillip Lougher
  Cc: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs



On 2025/7/17 10:49, Eric Biggers wrote:
> On Wed, Jul 16, 2025 at 11:37:28PM +0100, Phillip Lougher wrote:

...

> buffer.  I suspect that vmap() (or vm_map_ram() which is what f2fs uses)
> is actually more efficient than these streaming APIs, since it avoids
> the internal copy.  But it would need to be measured.

Of course vm_map_ram() (that is what erofs relies on first for
decompression in tree since 2018, then the f2fs one) will be
efficient for decompression and avoid polluting unnecessary
caching (considering typical PIPT or VIPT.)

Especially for large compressed extents such as 1MiB, another
memcpy() will cause much extra overhead over lz4.

But as for gzip, xz and zstd, they just implement internal lz77
dictionaries then memcpy for streaming APIs.  Since those
algorithms are relatively slow (for example Zstd still relies
on Huffman and FSE), I don't think it causes much difference
to avoid memcpy() in the whole I/O path (because Huffman tree
and FSE table are already slow), but lz4 matters.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-15 23:32 ` Gao Xiang
  2025-07-16  0:28   ` Gao Xiang
@ 2025-07-21  0:43   ` Barry Song
  1 sibling, 0 replies; 19+ messages in thread
From: Barry Song @ 2025-07-21  0:43 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu

On Wed, Jul 16, 2025 at 7:32 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
[...]
>
> I don't see this will work for EROFS because EROFS always supports
> variable uncompressed extent lengths and that will break typical
> EROFS use cases and on-disk formats.
>
> Other thing is that large order folios (physical consecutive) will
> caused "increase the latency on UX task with filemap_fault()"
> because of high-order direct reclaims, see:
> https://android-review.googlesource.com/c/kernel/common/+/3692333
> so EROFS will not set min-order and always support order-0 folios.

Regarding Hailong's Android hook, it's essentially a complaint about
the GFP mask used to allocate large folios for files. I'm wondering
why the page cache hasn't adopted the same approach that's used for
anon large folios:

    gfp = vma_thp_gfp_mask(vma);

Another concern might be that the allocation order is too large,
which could lead to memory fragmentation and waste. Ideally, we'd
have "small" large folios—say, with order <= 4—to strike a better
balance.

>
> I think EROFS will not use this new approach, vmap() interface is
> always the case for us.
>
> Thanks,
> Gao Xiang
>
> >
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-16  0:28   ` Gao Xiang
@ 2025-07-21  1:02     ` Barry Song
  2025-07-21  3:14       ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2025-07-21  1:02 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu, Qu Wenruo

On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/16 07:32, Gao Xiang wrote:
> > Hi Matthew,
> >
> > On 2025/7/16 04:40, Matthew Wilcox wrote:
> >> I've started looking at how the page cache can help filesystems handle
> >> compressed data better.  Feedback would be appreciated!  I'll probably
> >> say a few things which are obvious to anyone who knows how compressed
> >> files work, but I'm trying to be explicit about my assumptions.
> >>
> >> First, I believe that all filesystems work by compressing fixed-size
> >> plaintext into variable-sized compressed blocks.  This would be a good
> >> point to stop reading and tell me about counterexamples.
> >
> > At least the typical EROFS compresses variable-sized plaintext (at least
> > one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed
> > blocks for efficient I/Os, which is really useful for small compression
> > granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually
> > under memory pressure so large compression granularity is almost
> > unacceptable in the low memory scenarios, see:
> > https://erofs.docs.kernel.org/en/latest/design.html
> >
> > Currently EROFS works pretty well on these devices and has been
> > successfully deployed in billions of real devices.
> >
> >>
> >>  From what I've been reading in all your filesystems is that you want to
> >> allocate extra pages in the page cache in order to store the excess data
> >> retrieved along with the page that you're actually trying to read.  That's
> >> because compressing in larger chunks leads to better compression.
> >>
> >> There's some discrepancy between filesystems whether you need scratch
> >> space for decompression.  Some filesystems read the compressed data into
> >> the pagecache and decompress in-place, while other filesystems read the
> >> compressed data into scratch pages and decompress into the page cache.
> >>
> >> There also seems to be some discrepancy between filesystems whether the
> >> decompression involves vmap() of all the memory allocated or whether the
> >> decompression routines can handle doing kmap_local() on individual pages.
> >>
> >> So, my proposal is that filesystems tell the page cache that their minimum
> >> folio size is the compression block size.  That seems to be around 64k,
> >> so not an unreasonable minimum allocation size.  That removes all the
> >> extra code in filesystems to allocate extra memory in the page cache.> It means we don't attempt to track dirtiness at a sub-folio granularity
> >> (there's no point, we have to write back the entire compressed bock
> >> at once).  We also get a single virtually contiguous block ... if you're
> >> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> >> vmap_file() which would give us a virtually contiguous chunk of memory
> >> (and could be trivially turned into a noop for the case of trying to
> >> vmap a single large folio).
> >
> > I don't see this will work for EROFS because EROFS always supports
> > variable uncompressed extent lengths and that will break typical
> > EROFS use cases and on-disk formats.
> >
> > Other thing is that large order folios (physical consecutive) will
> > caused "increase the latency on UX task with filemap_fault()"
> > because of high-order direct reclaims, see:
> > https://android-review.googlesource.com/c/kernel/common/+/3692333
> > so EROFS will not set min-order and always support order-0 folios.
> >
> > I think EROFS will not use this new approach, vmap() interface is
> > always the case for us.
>
> ... high-order folios can cause side effects on embedded devices
> like routers and IoT devices, which still have MiBs of memory (and I
> believe this won't change due to their use cases) but they also use
> Linux kernel for quite long time.  In short, I don't think enabling
> large folios for those devices is very useful, let alone limiting
> the minimum folio order for them (It would make the filesystem not
> suitable any more for those users.  At least that is what I never
> want to do).  And I believe this is different from the current LBS
> support to match hardware characteristics or LBS atomic write
> requirement.

Given the difficulty of allocating large folios, it's always a good
idea to have order-0 as a fallback. While I agree with your point,
I have a slightly different perspective — enabling large folios for
those devices might be beneficial, but the maximum order should
remain small. I'm referring to "small" large folios.

Still, even with those, allocation can be difficult — especially
since so many other allocations (which aren't large folios) can cause
fragmentation. So having order-0 as a fallback remains important.

It seems we're missing a mechanism to enable "small" large folios
for files. For anon large folios, we do have sysfs knobs—though they
don’t seem to be universally appreciated. :-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-21  1:02     ` Barry Song
@ 2025-07-21  3:14       ` Gao Xiang
  2025-07-21 10:25         ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2025-07-21  3:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, Chris Mason, Josef Bacik, David Sterba,
	linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu, linux-erofs,
	Jaegeuk Kim, linux-f2fs-devel, Jan Kara, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu, Qu Wenruo

Hi Barry,

On 2025/7/21 09:02, Barry Song wrote:
> On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>

...

>>
>> ... high-order folios can cause side effects on embedded devices
>> like routers and IoT devices, which still have MiBs of memory (and I
>> believe this won't change due to their use cases) but they also use
>> Linux kernel for quite long time.  In short, I don't think enabling
>> large folios for those devices is very useful, let alone limiting
>> the minimum folio order for them (It would make the filesystem not
>> suitable any more for those users.  At least that is what I never
>> want to do).  And I believe this is different from the current LBS
>> support to match hardware characteristics or LBS atomic write
>> requirement.
> 
> Given the difficulty of allocating large folios, it's always a good
> idea to have order-0 as a fallback. While I agree with your point,
> I have a slightly different perspective — enabling large folios for
> those devices might be beneficial, but the maximum order should
> remain small. I'm referring to "small" large folios.

Yeah, agreed. Having a way to limit the maximum order for those small
devices (rather than disabling it completely) would be helpful.  At
least "small" large folios could still provide benefits when memory
pressure is light.

Thanks,
Gao Xiang

> 
> Still, even with those, allocation can be difficult — especially
> since so many other allocations (which aren't large folios) can cause
> fragmentation. So having order-0 as a fallback remains important.
> 
> It seems we're missing a mechanism to enable "small" large folios
> for files. For anon large folios, we do have sysfs knobs—though they
> don’t seem to be universally appreciated. :-)
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-21  3:14       ` Gao Xiang
@ 2025-07-21 10:25         ` Jan Kara
  2025-07-21 11:36           ` Qu Wenruo
  2025-07-21 11:40           ` Gao Xiang
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Kara @ 2025-07-21 10:25 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Barry Song, Matthew Wilcox, Chris Mason, Josef Bacik,
	David Sterba, linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu,
	linux-erofs, Jaegeuk Kim, linux-f2fs-devel, Jan Kara,
	linux-fsdevel, David Woodhouse, Richard Weinberger, linux-mtd,
	David Howells, netfs, Paulo Alcantara, Konstantin Komarov, ntfs3,
	Steve French, linux-cifs, Phillip Lougher, Hailong Liu, Qu Wenruo

On Mon 21-07-25 11:14:02, Gao Xiang wrote:
> Hi Barry,
> 
> On 2025/7/21 09:02, Barry Song wrote:
> > On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > > 
> 
> ...
> 
> > > 
> > > ... high-order folios can cause side effects on embedded devices
> > > like routers and IoT devices, which still have MiBs of memory (and I
> > > believe this won't change due to their use cases) but they also use
> > > Linux kernel for quite long time.  In short, I don't think enabling
> > > large folios for those devices is very useful, let alone limiting
> > > the minimum folio order for them (It would make the filesystem not
> > > suitable any more for those users.  At least that is what I never
> > > want to do).  And I believe this is different from the current LBS
> > > support to match hardware characteristics or LBS atomic write
> > > requirement.
> > 
> > Given the difficulty of allocating large folios, it's always a good
> > idea to have order-0 as a fallback. While I agree with your point,
> > I have a slightly different perspective — enabling large folios for
> > those devices might be beneficial, but the maximum order should
> > remain small. I'm referring to "small" large folios.
> 
> Yeah, agreed. Having a way to limit the maximum order for those small
> devices (rather than disabling it completely) would be helpful.  At
> least "small" large folios could still provide benefits when memory
> pressure is light.

Well, in the page cache you can tune not only the minimum but also the
maximum order of a folio being allocated for each inode. Btrfs and ext4
already use this functionality. So in principle the functionality is there,
it is "just" a question of proper user interfaces or automatic logic to
tune this limit.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-21 10:25         ` Jan Kara
@ 2025-07-21 11:36           ` Qu Wenruo
  2025-07-21 11:52             ` Gao Xiang
  2025-07-22  3:54             ` Barry Song
  2025-07-21 11:40           ` Gao Xiang
  1 sibling, 2 replies; 19+ messages in thread
From: Qu Wenruo @ 2025-07-21 11:36 UTC (permalink / raw)
  To: Jan Kara, Gao Xiang
  Cc: Barry Song, Matthew Wilcox, Chris Mason, Josef Bacik,
	David Sterba, linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu,
	linux-erofs, Jaegeuk Kim, linux-f2fs-devel, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu



在 2025/7/21 19:55, Jan Kara 写道:
> On Mon 21-07-25 11:14:02, Gao Xiang wrote:
>> Hi Barry,
>>
>> On 2025/7/21 09:02, Barry Song wrote:
>>> On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
[...]
>>> Given the difficulty of allocating large folios, it's always a good
>>> idea to have order-0 as a fallback. While I agree with your point,
>>> I have a slightly different perspective — enabling large folios for
>>> those devices might be beneficial, but the maximum order should
>>> remain small. I'm referring to "small" large folios.
>>
>> Yeah, agreed. Having a way to limit the maximum order for those small
>> devices (rather than disabling it completely) would be helpful.  At
>> least "small" large folios could still provide benefits when memory
>> pressure is light.
> 
> Well, in the page cache you can tune not only the minimum but also the
> maximum order of a folio being allocated for each inode. Btrfs and ext4
> already use this functionality. So in principle the functionality is there,
> it is "just" a question of proper user interfaces or automatic logic to
> tune this limit.
> 
> 								Honza

And enabling large folios doesn't mean all fs operations will grab an 
unnecessarily large folio.

For buffered write, all those filesystem will only try to get folios as 
large as necessary, not overly large.

This means if the user space program is always doing buffered IO in a 
power-of-two unit (and aligned offset of course), the folio size will 
match the buffer size perfectly (if we have enough memory).

So for properly aligned buffered writes, large folios won't really cause 
  unnecessarily large folios, meanwhile brings all the benefits.


Although I'm not familiar enough with filemap to comment on folio read 
and readahead...

Thanks,
Qu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-21 10:25         ` Jan Kara
  2025-07-21 11:36           ` Qu Wenruo
@ 2025-07-21 11:40           ` Gao Xiang
  1 sibling, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2025-07-21 11:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Barry Song, Matthew Wilcox, Chris Mason, Josef Bacik,
	David Sterba, linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu,
	linux-erofs, Jaegeuk Kim, linux-f2fs-devel, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu, Qu Wenruo

Hi Jan,

On 2025/7/21 18:25, Jan Kara wrote:
> On Mon 21-07-25 11:14:02, Gao Xiang wrote:
>> Hi Barry,
>>
>> On 2025/7/21 09:02, Barry Song wrote:
>>> On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>
>> ...
>>
>>>>
>>>> ... high-order folios can cause side effects on embedded devices
>>>> like routers and IoT devices, which still have MiBs of memory (and I
>>>> believe this won't change due to their use cases) but they also use
>>>> Linux kernel for quite long time.  In short, I don't think enabling
>>>> large folios for those devices is very useful, let alone limiting
>>>> the minimum folio order for them (It would make the filesystem not
>>>> suitable any more for those users.  At least that is what I never
>>>> want to do).  And I believe this is different from the current LBS
>>>> support to match hardware characteristics or LBS atomic write
>>>> requirement.
>>>
>>> Given the difficulty of allocating large folios, it's always a good
>>> idea to have order-0 as a fallback. While I agree with your point,
>>> I have a slightly different perspective — enabling large folios for
>>> those devices might be beneficial, but the maximum order should
>>> remain small. I'm referring to "small" large folios.
>>
>> Yeah, agreed. Having a way to limit the maximum order for those small
>> devices (rather than disabling it completely) would be helpful.  At
>> least "small" large folios could still provide benefits when memory
>> pressure is light.
> 
> Well, in the page cache you can tune not only the minimum but also the
> maximum order of a folio being allocated for each inode. Btrfs and ext4
> already use this functionality. So in principle the functionality is there,
> it is "just" a question of proper user interfaces or automatic logic to
> tune this limit.

Yes, I took a quick glance of the current ext4 and btrfs cases
weeks ago which use this to fulfill the journal reservation
for example.

but considering that specific memory overhead use cases (to
limit maximum large folio order for small devices), it sounds
more like a generic page cache user interface for all
filesystems instead, and in the effective maximum order should
combine these two maximum numbers.

Thanks,
Gao Xiang

> 
> 								Honza


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-21 11:36           ` Qu Wenruo
@ 2025-07-21 11:52             ` Gao Xiang
  2025-07-22  3:54             ` Barry Song
  1 sibling, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2025-07-21 11:52 UTC (permalink / raw)
  To: Qu Wenruo, Jan Kara
  Cc: Barry Song, Matthew Wilcox, Chris Mason, Josef Bacik,
	David Sterba, linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu,
	linux-erofs, Jaegeuk Kim, linux-f2fs-devel, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu



On 2025/7/21 19:36, Qu Wenruo wrote:
> 
> 
> 在 2025/7/21 19:55, Jan Kara 写道:
>> On Mon 21-07-25 11:14:02, Gao Xiang wrote:
>>> Hi Barry,
>>>
>>> On 2025/7/21 09:02, Barry Song wrote:
>>>> On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> [...]
>>>> Given the difficulty of allocating large folios, it's always a good
>>>> idea to have order-0 as a fallback. While I agree with your point,
>>>> I have a slightly different perspective — enabling large folios for
>>>> those devices might be beneficial, but the maximum order should
>>>> remain small. I'm referring to "small" large folios.
>>>
>>> Yeah, agreed. Having a way to limit the maximum order for those small
>>> devices (rather than disabling it completely) would be helpful.  At
>>> least "small" large folios could still provide benefits when memory
>>> pressure is light.
>>
>> Well, in the page cache you can tune not only the minimum but also the
>> maximum order of a folio being allocated for each inode. Btrfs and ext4
>> already use this functionality. So in principle the functionality is there,
>> it is "just" a question of proper user interfaces or automatic logic to
>> tune this limit.
>>
>>                                 Honza
> 
> And enabling large folios doesn't mean all fs operations will grab an unnecessarily large folio.
> 
> For buffered write, all those filesystem will only try to get folios as large as necessary, not overly large.
> 
> This means if the user space program is always doing buffered IO in a power-of-two unit (and aligned offset of course), the folio size will match the buffer size perfectly (if we have enough memory).
> 
> So for properly aligned buffered writes, large folios won't really cause  unnecessarily large folios, meanwhile brings all the benefits.

That really depends on the user behavior & I/O pattern and
could cause unexpected spike.

Anyway, IMHO, how to limit the maximum order may be useful
for small devices if large folios is enabled.  When direct
reclaim is the common case, it might be too late.

Thanks,
Gao Xiang

> 
> 
> Although I'm not familiar enough with filemap to comment on folio read and readahead...
> 
> Thanks,
> Qu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Compressed files & the page cache
  2025-07-21 11:36           ` Qu Wenruo
  2025-07-21 11:52             ` Gao Xiang
@ 2025-07-22  3:54             ` Barry Song
  1 sibling, 0 replies; 19+ messages in thread
From: Barry Song @ 2025-07-22  3:54 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Jan Kara, Gao Xiang, Matthew Wilcox, Chris Mason, Josef Bacik,
	David Sterba, linux-btrfs, Nicolas Pitre, Gao Xiang, Chao Yu,
	linux-erofs, Jaegeuk Kim, linux-f2fs-devel, linux-fsdevel,
	David Woodhouse, Richard Weinberger, linux-mtd, David Howells,
	netfs, Paulo Alcantara, Konstantin Komarov, ntfs3, Steve French,
	linux-cifs, Phillip Lougher, Hailong Liu

On Mon, Jul 21, 2025 at 7:37 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2025/7/21 19:55, Jan Kara 写道:
> > On Mon 21-07-25 11:14:02, Gao Xiang wrote:
> >> Hi Barry,
> >>
> >> On 2025/7/21 09:02, Barry Song wrote:
> >>> On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> [...]
> >>> Given the difficulty of allocating large folios, it's always a good
> >>> idea to have order-0 as a fallback. While I agree with your point,
> >>> I have a slightly different perspective — enabling large folios for
> >>> those devices might be beneficial, but the maximum order should
> >>> remain small. I'm referring to "small" large folios.
> >>
> >> Yeah, agreed. Having a way to limit the maximum order for those small
> >> devices (rather than disabling it completely) would be helpful.  At
> >> least "small" large folios could still provide benefits when memory
> >> pressure is light.
> >
> > Well, in the page cache you can tune not only the minimum but also the
> > maximum order of a folio being allocated for each inode. Btrfs and ext4
> > already use this functionality. So in principle the functionality is there,
> > it is "just" a question of proper user interfaces or automatic logic to
> > tune this limit.
> >
> >                                                               Honza
>
> And enabling large folios doesn't mean all fs operations will grab an
> unnecessarily large folio.
>
> For buffered write, all those filesystem will only try to get folios as
> large as necessary, not overly large.
>
> This means if the user space program is always doing buffered IO in a
> power-of-two unit (and aligned offset of course), the folio size will
> match the buffer size perfectly (if we have enough memory).
>
> So for properly aligned buffered writes, large folios won't really cause
>   unnecessarily large folios, meanwhile brings all the benefits.

I don't think this captures the full picture. For example, in memory
reclamation, if any single subpage is hot, the entire large folio is
treated as hot and cannot be reclaimed. So I’m not convinced that
"filesystems will only try to get folios as large as necessary" is the
right policy.

Large folios are a good idea, but the lack of control over their maximum
size limits their practical applicability. When an embedded device enables
large folios and only observes performance regressions, the immediate
reaction is often to disable the feature entirely. This, in turn, harms the
adoption and development of large folios.

>
> Although I'm not familiar enough with filemap to comment on folio read
> and readahead...
>
> Thanks,
> Qu

Best Regards
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-07-22  3:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 20:40 Compressed files & the page cache Matthew Wilcox
2025-07-15 21:22 ` Boris Burkov
2025-07-15 23:32 ` Gao Xiang
2025-07-16  0:28   ` Gao Xiang
2025-07-21  1:02     ` Barry Song
2025-07-21  3:14       ` Gao Xiang
2025-07-21 10:25         ` Jan Kara
2025-07-21 11:36           ` Qu Wenruo
2025-07-21 11:52             ` Gao Xiang
2025-07-22  3:54             ` Barry Song
2025-07-21 11:40           ` Gao Xiang
2025-07-21  0:43   ` Barry Song
2025-07-16  0:57 ` Qu Wenruo
2025-07-16  1:16   ` Gao Xiang
2025-07-16  4:54     ` Qu Wenruo
2025-07-16  5:40       ` Gao Xiang
2025-07-16 22:37 ` Phillip Lougher
2025-07-17  2:49   ` Eric Biggers
2025-07-17  3:18     ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).