Linux CIFS filesystem development
 help / color / mirror / Atom feed
* Re: [f2fs-dev] Compressed files & the page cache
@ 2025-07-16  3:21 Nanzhe Zhao
  2025-07-17  1:04 ` Nanzhe Zhao
  0 siblings, 1 reply; 2+ messages in thread
From: Nanzhe Zhao @ 2025-07-16  3:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: almaz.alexandrovich, Chao Yu, clm, dhowells, dsterba, dwmw2, jack,
	jaegeuk@kernel.org, josef, linux-btrfs, linux-cifs, linux-erofs,
	linux-f2fs-devel@lists.sourceforge.net, linux-fsdevel, linux-mtd,
	netfs, nico, ntfs3, pc, phillip, richard, sfrench, xiang

Dear Matthew and other filesystem developers,

I've been experimenting with implementing large folio support for
compressed files in F2FS locally, and I'd like to describe the
situation from the F2FS perspective.

> First, I believe that all filesystems work by compressing fixed-size
> plaintext into variable-sized compressed blocks.

Well, yes. F2FS's current compression implementation does compress
fixed-size memory into variable-sized blocks. However, F2FS operates
on a fixed-size unit called a "cluster." A file is logically divided
into these clusters, and each cluster corresponds to a fixed number of
contiguous page indices. The cluster size is 4 << n pages, with n
typically defaulting to 0 (making a 4-page cluster).

F2FS can only perform compression on a per-cluster basis; it cannot
operate on a unit larger than the logical size of a cluster. So, for a
16-page folio with a 4-page cluster size, we would have to split the
folio into four separate clusters. We then perform compression on each
cluster individually and write back each compressed result to disk
separately.We cannot perform compression on the whole large chunk of
folio. In fact, the fact that a large folio can span multiple clusters
was the main headache in my attempt to implement large folio support
for F2FS compression.

Why is this the case? It's due to F2FS's current on-disk layout for
compressed data. Each cluster is prefixed by a special block address,
COMPRESS_ADDR, which separates one cluster from the next on disk.
Furthermore, after F2FS compresses the original data in a cluster, the
space freed up within that cluster remains reserved on disk; it is not
released for other files to use. You may have heard that F2FS
compression doesn't actually save space for the user—this is the
reason. In F2FS, the model is not what we might intuitively expect—a
large chunk of data being compressed into a series of tightly packed
data blocks on disk (which I assume is the model other filesystems
adopt).

So, regarding:

> So, my proposal is that filesystems tell the page cache that their minimum
> folio size is the compression block size. That seems to be around 64k,
> so not an unreasonable minimum allocation size.


F2FS doesn't have a uniform "compression block size." It purely
depends on the configured cluster size, and the resulting compressed
size is determined by the compression ratio. For example, a 4-page
cluster could be compressed down to a single block.

Regarding the folio order, perhaps we could set its maximum order to
match the cluster size, while keeping the minimum order at 0. However,
for smaller cluster sizes, this would completely limit the potential
of using larger folios. My own current implementation makes no
assumptions about the maximum folio order. As I am a student, I lack
extensive experience, so it's difficult for me to evaluate the pros
and cons of these two approaches. I believe Mr Chao Yu could provide a
more constructive suggestion on this point.

Thinking about a possible implementation for your proposal of a 64KB
size and in-place compression in the context of F2FS, I think the
possible approach may be to set the maximum folio order to 4 pages.
This would align with the default cluster size (especially relevant as
F2FS moves to support 16K pages and blocks). We could then perform
compression in-place, eliminating the need for scratch pages (which
are the compressed pages/folios in the F2FS context) and also disable
per-page dirty tracking for that folio.

However, F2FS has fallback logic for when compression fails during
writeback. The original F2FS logic still relies on per-page dirty
tracking for writes. If we were to completely remove per-page tracking
for the folio,then in compression failure case we would bear the cost
of one write amplification.

These are just my personal thoughts on your proposal. I believe Mr
Chao Yu can provide more detailed insights into the specifics of F2FS.

Best regards

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [f2fs-dev] Compressed files & the page cache
  2025-07-16  3:21 [f2fs-dev] Compressed files & the page cache Nanzhe Zhao
@ 2025-07-17  1:04 ` Nanzhe Zhao
  0 siblings, 0 replies; 2+ messages in thread
From: Nanzhe Zhao @ 2025-07-17  1:04 UTC (permalink / raw)
  To: nzzhao.sigma
  Cc: almaz.alexandrovich, clm, dhowells, dsterba, dwmw2, jack, jaegeuk,
	josef, linux-btrfs, linux-cifs, linux-erofs, linux-f2fs-devel,
	linux-fsdevel, linux-mtd, netfs, nico, ntfs3, pc, phillip,
	richard, sfrench, willy, xiang

Dear Mr.Matthew and other fs developers:
I'm very sorry.My gmail maybe be blocked for reasons I don't know.I have to change
my email domain.
> So, my proposal is that filesystems tell the page cache that their minimu=
m
> folio size is the compression block size.  That seems to be around 64k,
> so not an unreasonable minimum allocation size.
Excuse me,but could you please clarify the meaning of "compression block si=
ze"?
If you mean the minimum buffer window size that a filesystem requires
to perform one whole compress write/decompress read io(also we can
call it the granularity),which,in f2fs context we can interpret as the
cluster size.Then that means for compress files,we could not fallback
to 0 order folio in memory pressure case when setting folio's minmium
order to "compression block size"?

If that is the case,then when f2fs' cluster size was configured,the
minium order was determined(and may beyond 64KiB.Depending on how we
set the cluster size).If the cluster size was set to a large number,we
will encounter much more risk when in memory pressure case.

Well,as for the 64Kib minimum granularity,because Android now switchs
page size to 16Kib so for current f2fs compress implementation the
minimum possible granularity indeed just exactly equals 64Kib.But I do
hold a opinion that may not be a very good point for f2fs. Because
just as I know,there are lots of small random write on Android.So
instead of having a minimum granularity in 64Kib,I appreciate future
f2fs's compression's implementation should support smaller cluster
size for compression. As far as I know,storage engineers from vivo is
experimenting a dynamic cluster compression implementation.It can
adjust the cluster size within a file adaptively.(Maybe larger in some
part and smaller in other part)
They didn't publish the code now.But this design maybe more suitable
for cooperating with folios for its vary-order feature.

>  It means we don't attempt to track dirtiness at a sub-folio granularity
>
> (there's no point, we have to write back the entire compressed bock
> at once).
That DO has point for f2fs.Because we cannot control the order of
folio that readahead gave us if we don't set maximum order.A large folio can cross 
multi clusters in f2fs as I have mentioned.
Since f2fs has no buffered head or a concept of subpage as we have discussed previously,
It must rely on iomap_folio_state or a similar per folio struct to distinguish which
cluster range of this folio is dirty.
And it must distinguish a partialy dirted cluster to avoid compress write.
Besides,l do think larger folio can cross multi compressed extent in
btrfs too if I didn't misunderstand.May I ask how do btrfs deal with
the possible write amplification?


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-07-17  1:08 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16  3:21 [f2fs-dev] Compressed files & the page cache Nanzhe Zhao
2025-07-17  1:04 ` Nanzhe Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox