From: Benjamin Fischer <benjamin.fischer@cern.ch>
To: David Howells <dhowells@redhat.com>
Cc: netfs@lists.linux.dev
Subject: Re: Cachefiles slowdown caused by SEEK_HOLE
Date: Tue, 24 Jun 2025 15:36:30 +0200 [thread overview]
Message-ID: <619bca7a-d89c-4262-8c05-1ac536db0a6e@cern.ch> (raw)
In-Reply-To: <963373.1750426130@warthog.procyon.org.uk>
Hi David,
thanks for the detailed explanations.
> [...] Unfortunately, the ext4 and xfs maintainers think it's not
> a good idea to rely on the backing filesystem metadata as the backing fs is at
> liberty to punch out blocks of zeros or insert bridging blocks in order to
> better optimise the fragments list. The former would give a false negative,
> causing us to have to go get the block again and, worse, the latter would give
> a false positive, making us think we have data that we don't.
That it truly unfortunate, since their allocation bitmaps/extent trees
exactly cover the cache's needs for bookkeeping without implementing &
tuning/optimizing it twice.
> [...] I might be able to use
> FIEMAP, and certainly caching the result of the seeks ought to be fine... but
> for the small matter of the aforementioned possibility of the backing fs
> screwing with things.
This seems like reasonable compromise until the underlying problem can
be addressed in full - which seems quite challenging.
> I have a solution that used to work (give or take the odd bug in it) until I
> was persuaded to shift to creating netfslib and still have the code hanging
> around somewhere (it needs updating).
>
> Here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter-dir
>
> and, in particular, this patch:
>
> cachefiles: Implement a content-present indicator and bitmap
>
> The way it works is that you define a 'fragment size' for the cache file, say
> 256KiB or 2MiB, and then divide the file up into blocks of that size. A
> bitmap is stored in an xattr that indicates the occupancy of those blocks. If
> a file is completely stored locally, then we can dispense with the bitmap.
Using bitmaps does seems like a fairly straight forward approach, but I
think it is worthwhile to consider the details and consequences before
committing to a particular solution. Especially, since the use of
bitmaps have shown to have disadvantages, at least in filesystem contexts.
On the use of bitmaps: choosing a large block size can drastically
reduce the efficiency of the caching mechanism. For example, one of our
use-cases relies on the fine (4KiB) granularity, since one of our file
formats is internally heavily fragmented and only sparsely read. In
particular, the fine granularity of the current FSCache implementation
was the leading reason why we chose it over all of the other options -
so it's a quite distinguishing feature.
> This has a number of limitations:
>
> (1) We limit the xattr size to 512 bytes to hold down memory usage and, also,
> we can't read/modify parts of an xattr, only whole xattrs.
>
> (2) Because the xattr is limited in size, there is a limit on the number of
> blocks we can map. For 256KiB blocks, with 4096 bits in the map, we are
> limited to a maximum of 1GiB.
>
> (3) We could have multiple xattrs, each covering a different part of the
> file.
>
> (4) Setting xattrs is slow as each one is a synchronous journalled metadata
> operation.
This could be resolved by avoiding xattrs altogether and instead using
the (already) required support for sparse files: One could store the
file contents of the file at a (fixed) offset into the file (e.g. 1GiB)
and use the space before this to store the bitmap (or whatever other
information is needed for the cache). Even with a block size of 4KiB and
offset of 1GiB would still accommodate a bitmap for file up to 32TiB
which likely covers the vast majority of use cases.Would this be
feasible? Or maybe has this approach been already rejected?
> (5) We have metadata integrity issues if we want to evict an bitmap for
> memory reclaim. We need to flush (and maybe sync) data from a separate
> filesystem before writing the xattr. This might not be so bad as the
> main metadata xattr on a cachefile has a flag in it that says the object
> is under modification, so we only need to flush, sync and write back all
> the bitmaps before altering that flag - and this can be done in the
> background.
I don't quite understand the intricacies of the race conditions you
refer to here, but I imagine this might also be simplified by storing
the bitmap/metadata within the file itself.
> One reason I haven't progressed much with it is that is at the point of
> turning into its own journalling filesystem... and we already have a bunch of
> those. Further, the point of cachefiles is that it uses files on an already
> mounted filesystem so that you don't have to have a dedicated blockdev for it.
Oh yes, reimplementing something that has already been done several
times over seem like busy work best avoided.
> An alternative method that may prove fruitful is to explore the way OpenAFS
> does caching: by having a bunch of, say, 256KiB cache files and an index that
> says which part of what network file is stored in what cache file.
Such a implementation would probably be useful, but just like using an
extend list/tree instead of a bitmap, it is far from straight forward
and as such prone to implementation & tuning difficulties.
> But if you're up for having a crack at forward porting the bitmap idea, I can
> give you some guidance.
Thank you for the offer, I may take you up on it in the future, but I'm
currently short on time for such an elaborate project.
Cheers
Benjamin
next prev parent reply other threads:[~2025-06-24 13:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-30 9:03 Cachefiles slowdown caused by SEEK_HOLE Benjamin Fischer
2025-06-20 13:28 ` David Howells
2025-06-24 13:36 ` Benjamin Fischer [this message]
2025-06-24 15:19 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=619bca7a-d89c-4262-8c05-1ac536db0a6e@cern.ch \
--to=benjamin.fischer@cern.ch \
--cc=dhowells@redhat.com \
--cc=netfs@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox