From: Boris Burkov <boris@bur.io>
To: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
David Sterba <dsterba@suse.com>,
linux-btrfs@vger.kernel.org, Nicolas Pitre <nico@fluxnic.net>,
Gao Xiang <xiang@kernel.org>, Chao Yu <chao@kernel.org>,
linux-erofs@lists.ozlabs.org, Jaegeuk Kim <jaegeuk@kernel.org>,
linux-f2fs-devel@lists.sourceforge.net, Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org,
David Woodhouse <dwmw2@infradead.org>,
Richard Weinberger <richard@nod.at>,
linux-mtd@lists.infradead.org,
David Howells <dhowells@redhat.com>,
netfs@lists.linux.dev, Paulo Alcantara <pc@manguebit.org>,
Konstantin Komarov <almaz.alexandrovich@paragon-software.com>,
ntfs3@lists.linux.dev, Steve French <sfrench@samba.org>,
linux-cifs@vger.kernel.org,
Phillip Lougher <phillip@squashfs.org.uk>
Subject: Re: Compressed files & the page cache
Date: Tue, 15 Jul 2025 14:22:33 -0700 [thread overview]
Message-ID: <20250715212233.GA1680311@zen.localdomain> (raw)
In-Reply-To: <aHa8ylTh0DGEQklt@casper.infradead.org>
On Tue, Jul 15, 2025 at 09:40:42PM +0100, Matthew Wilcox wrote:
> I've started looking at how the page cache can help filesystems handle
> compressed data better. Feedback would be appreciated! I'll probably
> say a few things which are obvious to anyone who knows how compressed
> files work, but I'm trying to be explicit about my assumptions.
>
> First, I believe that all filesystems work by compressing fixed-size
> plaintext into variable-sized compressed blocks. This would be a good
> point to stop reading and tell me about counterexamples.
As far as I know, btrfs with zstd does not used fixed size plaintext. I
am going off the btrfs logic itself, not the zstd internals which I am
sadly ignorant of. We are using the streaming interface for whatever
that is worth.
Through the following callpath, the len is piped from the async_chunk\
through to zstd via the slightly weirdly named total_out parameter:
compress_file_range()
btrfs_compress_folios()
compression_compress_pages()
zstd_compress_folios()
zstd_get_btrfs_parameters() // passes len
zstd_init_cstream() // passes len
for-each-folio:
zstd_compress_stream() // last folio is truncated if short
# bpftrace to check the size in the zstd callsite
$ sudo bpftrace -e 'fentry:zstd_init_cstream {printf("%llu\n", args.pledged_src_size);}'
Attaching 1 probe...
76800
# diff terminal, write a compressed extent with a weird source size
$ sudo dd if=/dev/zero of=/mnt/lol/foo bs=75k count=1
We do operate in terms of folios for calling zstd_compress_stream, so
that can be thought of as a fixed size plaintext block, but even so, we
pass in a short block for the last one:
$ sudo bpftrace -e 'fentry:zstd_compress_stream {printf("%llu\n", args.input->size);}'
Attaching 1 probe...
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
3072
>
> From what I've been reading in all your filesystems is that you want to
> allocate extra pages in the page cache in order to store the excess data
> retrieved along with the page that you're actually trying to read. That's
> because compressing in larger chunks leads to better compression.
>
> There's some discrepancy between filesystems whether you need scratch
> space for decompression. Some filesystems read the compressed data into
> the pagecache and decompress in-place, while other filesystems read the
> compressed data into scratch pages and decompress into the page cache.
>
> There also seems to be some discrepancy between filesystems whether the
> decompression involves vmap() of all the memory allocated or whether the
> decompression routines can handle doing kmap_local() on individual pages.
>
> So, my proposal is that filesystems tell the page cache that their minimum
> folio size is the compression block size. That seems to be around 64k,
btrfs has a max uncompressed extent size of 128K, for what it's worth.
In practice, many compressed files are comprised of a large number of
compressed extents each representing a 128k plaintext extent.
Not sure if that is exactly the constant you are concerned with here, or
if it refutes your idea in any way, just figured I would mention it as
well.
> so not an unreasonable minimum allocation size. That removes all the
> extra code in filesystems to allocate extra memory in the page cache.
> It means we don't attempt to track dirtiness at a sub-folio granularity
> (there's no point, we have to write back the entire compressed bock
> at once). We also get a single virtually contiguous block ... if you're
> willing to ditch HIGHMEM support. Or there's a proposal to introduce a
> vmap_file() which would give us a virtually contiguous chunk of memory
> (and could be trivially turned into a noop for the case of trying to
> vmap a single large folio).
>
next prev parent reply other threads:[~2025-07-15 21:21 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-15 20:40 Compressed files & the page cache Matthew Wilcox
2025-07-15 21:22 ` Boris Burkov [this message]
2025-07-15 23:32 ` Gao Xiang
2025-07-16 0:28 ` Gao Xiang
2025-07-21 1:02 ` Barry Song
2025-07-21 3:14 ` Gao Xiang
2025-07-21 10:25 ` Jan Kara
2025-07-21 11:36 ` Qu Wenruo
2025-07-21 11:52 ` Gao Xiang
2025-07-22 3:54 ` Barry Song
2025-07-21 11:40 ` Gao Xiang
2025-07-21 0:43 ` Barry Song
2025-07-16 0:57 ` Qu Wenruo
2025-07-16 1:16 ` Gao Xiang
2025-07-16 4:54 ` Qu Wenruo
2025-07-16 5:40 ` Gao Xiang
2025-07-16 22:37 ` Phillip Lougher
2025-07-17 2:49 ` Eric Biggers
2025-07-17 3:18 ` Gao Xiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250715212233.GA1680311@zen.localdomain \
--to=boris@bur.io \
--cc=almaz.alexandrovich@paragon-software.com \
--cc=chao@kernel.org \
--cc=clm@fb.com \
--cc=dhowells@redhat.com \
--cc=dsterba@suse.com \
--cc=dwmw2@infradead.org \
--cc=jack@suse.cz \
--cc=jaegeuk@kernel.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-cifs@vger.kernel.org \
--cc=linux-erofs@lists.ozlabs.org \
--cc=linux-f2fs-devel@lists.sourceforge.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mtd@lists.infradead.org \
--cc=netfs@lists.linux.dev \
--cc=nico@fluxnic.net \
--cc=ntfs3@lists.linux.dev \
--cc=pc@manguebit.org \
--cc=phillip@squashfs.org.uk \
--cc=richard@nod.at \
--cc=sfrench@samba.org \
--cc=willy@infradead.org \
--cc=xiang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).