* Cachefiles slowdown caused by SEEK_HOLE
@ 2025-05-30 9:03 Benjamin Fischer
2025-06-20 13:28 ` David Howells
0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Fischer @ 2025-05-30 9:03 UTC (permalink / raw)
To: David Howells; +Cc: netfs
Dear Cachefiles Maintainers,
I've observed that when using cachefiles there is extreme performance
degradation when the cache backing file (i.e. in /var/cache/fscache) is
severely fragmented.
For example, we have a 40GiB file in a NFS 4.2 mount (rsize of 1 MiB),
that fully resides in the local (fs)cache and reading starts at ~16MB/s.
Reading the backing file directly can be done at ~500MB/s - the hardware
limit. The backing file has ~200k extends and resides on an etx4
formatted SSD.
Using perf record, I found the culprit to be iomap_seek_hole (caused by
cachefiles_prepare_read) and its descendants, which account for 98% of
cpu time, which in turn is almost 100% of the wall time. So the root
cause is that SEEK_HOLE is too slow when it has to traverse lots of
extends, at least for ext4.
I've verified the time it takes to SEEK_HOLE manually and found
consistent results (66ms) which behave as expected: the further one
starts into the file the faster the seek gets. This is also reflected in
the read rate through cachefiles, which grows in a 1 over "remaining
file size" manner until it achieves the hardware-limited speed near the
end of the file.
This should also affect all other filesystems, that search for holes in
such a linear fashion - which I imagine is most, if not all, of them.
These slowdowns will mostly affect fully cached files, exactly the case
where one would expect/need the best performance. They are also
exasperated by smaller rsize or when cache read/fill patterns induce a
lot of fragmentation. Therefore I think it sensible to address this issue.
My naive impression is that using fiemap should help mitigate the
impact. One could still fall back on existing SEEK_DATA/SEEK_HOLE
behavior, in case fiemap is unavailable.
While I wouldn't mind (attempting) to contribute the necessary code, I'm
not too sure that my non-existent kernel development skills would
actually be helpful.
In any case, I wanted to bring this to your attention such that you at
least may ponder about it.
Cheers
Benjamin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Cachefiles slowdown caused by SEEK_HOLE
2025-05-30 9:03 Cachefiles slowdown caused by SEEK_HOLE Benjamin Fischer
@ 2025-06-20 13:28 ` David Howells
2025-06-24 13:36 ` Benjamin Fischer
0 siblings, 1 reply; 4+ messages in thread
From: David Howells @ 2025-06-20 13:28 UTC (permalink / raw)
To: Benjamin Fischer; +Cc: dhowells, netfs
Hi Benjamin,
> I've observed that when using cachefiles there is extreme performance
> degradation when the cache backing file (i.e. in /var/cache/fscache) is
> severely fragmented.
Yeah, I can imagine. One of the many things on my TODO list is to replace
this in some way. Unfortunately, the ext4 and xfs maintainers think it's not
a good idea to rely on the backing filesystem metadata as the backing fs is at
liberty to punch out blocks of zeros or insert bridging blocks in order to
better optimise the fragments list. The former would give a false negative,
causing us to have to go get the block again and, worse, the latter would give
a false positive, making us think we have data that we don't.
> For example, we have a 40GiB file in a NFS 4.2 mount (rsize of 1 MiB), that
> fully resides in the local (fs)cache and reading starts at ~16MB/s. Reading
> the backing file directly can be done at ~500MB/s - the hardware limit. The
> backing file has ~200k extends and resides on an etx4 formatted SSD.
>
> Using perf record, I found the culprit to be iomap_seek_hole (caused by
> cachefiles_prepare_read) and its descendants, which account for 98% of cpu
> time, which in turn is almost 100% of the wall time. So the root cause is that
> SEEK_HOLE is too slow when it has to traverse lots of extends, at least for
> ext4.
Ouch. Yeah, we have to do a SEEK_HOLE *and* a SEEK_DATA to define the limits
on a piece of occupied filespace - and that sucks. I might be able to use
FIEMAP, and certainly caching the result of the seeks ought to be fine... but
for the small matter of the aforementioned possibility of the backing fs
screwing with things.
> I've verified the time it takes to SEEK_HOLE manually and found consistent
> results (66ms) which behave as expected: the further one starts into the file
> the faster the seek gets. This is also reflected in the read rate through
> cachefiles, which grows in a 1 over "remaining file size" manner until it
> achieves the hardware-limited speed near the end of the file.
It's probably reflects the way the extent list is stored.
> This should also affect all other filesystems, that search for holes in such a
> linear fashion - which I imagine is most, if not all, of them. These slowdowns
> will mostly affect fully cached files, exactly the case where one would
> expect/need the best performance. They are also exasperated by smaller rsize
> or when cache read/fill patterns induce a lot of fragmentation. Therefore I
> think it sensible to address this issue.
>
> My naive impression is that using fiemap should help mitigate the impact. One
> could still fall back on existing SEEK_DATA/SEEK_HOLE behavior, in case fiemap
> is unavailable.
>
> While I wouldn't mind (attempting) to contribute the necessary code, I'm not
> too sure that my non-existent kernel development skills would actually be
> helpful.
>
> In any case, I wanted to bring this to your attention such that you at least
> may ponder about it.
I have a solution that used to work (give or take the odd bug in it) until I
was persuaded to shift to creating netfslib and still have the code hanging
around somewhere (it needs updating).
Here:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter-dir
and, in particular, this patch:
cachefiles: Implement a content-present indicator and bitmap
The way it works is that you define a 'fragment size' for the cache file, say
256KiB or 2MiB, and then divide the file up into blocks of that size. A
bitmap is stored in an xattr that indicates the occupancy of those blocks. If
a file is completely stored locally, then we can dispense with the bitmap.
This has a number of limitations:
(1) We limit the xattr size to 512 bytes to hold down memory usage and, also,
we can't read/modify parts of an xattr, only whole xattrs.
(2) Because the xattr is limited in size, there is a limit on the number of
blocks we can map. For 256KiB blocks, with 4096 bits in the map, we are
limited to a maximum of 1GiB.
(3) We could have multiple xattrs, each covering a different part of the
file.
(4) Setting xattrs is slow as each one is a synchronous journalled metadata
operation.
(5) We have metadata integrity issues if we want to evict an bitmap for
memory reclaim. We need to flush (and maybe sync) data from a separate
filesystem before writing the xattr. This might not be so bad as the
main metadata xattr on a cachefile has a flag in it that says the object
is under modification, so we only need to flush, sync and write back all
the bitmaps before altering that flag - and this can be done in the
background.
(6) When it comes to the data itself, we have to create or download an entire
block in one go in order to cache it (which actually improves the
performance in some circumstances). This ought to be easier with
multipage folios - but the new readahead algorithm adds some irritations
of its own, and we can end up with competing readaheads that cause
caching to fail at the point where they meet if it doesn't align.
We don't necessarily have to write back the entire block if we only
changed, say, one byte.
One reason I haven't progressed much with it is that is at the point of
turning into its own journalling filesystem... and we already have a bunch of
those. Further, the point of cachefiles is that it uses files on an already
mounted filesystem so that you don't have to have a dedicated blockdev for it.
An alternative method that may prove fruitful is to explore the way OpenAFS
does caching: by having a bunch of, say, 256KiB cache files and an index that
says which part of what network file is stored in what cache file.
But if you're up for having a crack at forward porting the bitmap idea, I can
give you some guidance.
David
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Cachefiles slowdown caused by SEEK_HOLE
2025-06-20 13:28 ` David Howells
@ 2025-06-24 13:36 ` Benjamin Fischer
2025-06-24 15:19 ` David Howells
0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Fischer @ 2025-06-24 13:36 UTC (permalink / raw)
To: David Howells; +Cc: netfs
Hi David,
thanks for the detailed explanations.
> [...] Unfortunately, the ext4 and xfs maintainers think it's not
> a good idea to rely on the backing filesystem metadata as the backing fs is at
> liberty to punch out blocks of zeros or insert bridging blocks in order to
> better optimise the fragments list. The former would give a false negative,
> causing us to have to go get the block again and, worse, the latter would give
> a false positive, making us think we have data that we don't.
That it truly unfortunate, since their allocation bitmaps/extent trees
exactly cover the cache's needs for bookkeeping without implementing &
tuning/optimizing it twice.
> [...] I might be able to use
> FIEMAP, and certainly caching the result of the seeks ought to be fine... but
> for the small matter of the aforementioned possibility of the backing fs
> screwing with things.
This seems like reasonable compromise until the underlying problem can
be addressed in full - which seems quite challenging.
> I have a solution that used to work (give or take the odd bug in it) until I
> was persuaded to shift to creating netfslib and still have the code hanging
> around somewhere (it needs updating).
>
> Here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter-dir
>
> and, in particular, this patch:
>
> cachefiles: Implement a content-present indicator and bitmap
>
> The way it works is that you define a 'fragment size' for the cache file, say
> 256KiB or 2MiB, and then divide the file up into blocks of that size. A
> bitmap is stored in an xattr that indicates the occupancy of those blocks. If
> a file is completely stored locally, then we can dispense with the bitmap.
Using bitmaps does seems like a fairly straight forward approach, but I
think it is worthwhile to consider the details and consequences before
committing to a particular solution. Especially, since the use of
bitmaps have shown to have disadvantages, at least in filesystem contexts.
On the use of bitmaps: choosing a large block size can drastically
reduce the efficiency of the caching mechanism. For example, one of our
use-cases relies on the fine (4KiB) granularity, since one of our file
formats is internally heavily fragmented and only sparsely read. In
particular, the fine granularity of the current FSCache implementation
was the leading reason why we chose it over all of the other options -
so it's a quite distinguishing feature.
> This has a number of limitations:
>
> (1) We limit the xattr size to 512 bytes to hold down memory usage and, also,
> we can't read/modify parts of an xattr, only whole xattrs.
>
> (2) Because the xattr is limited in size, there is a limit on the number of
> blocks we can map. For 256KiB blocks, with 4096 bits in the map, we are
> limited to a maximum of 1GiB.
>
> (3) We could have multiple xattrs, each covering a different part of the
> file.
>
> (4) Setting xattrs is slow as each one is a synchronous journalled metadata
> operation.
This could be resolved by avoiding xattrs altogether and instead using
the (already) required support for sparse files: One could store the
file contents of the file at a (fixed) offset into the file (e.g. 1GiB)
and use the space before this to store the bitmap (or whatever other
information is needed for the cache). Even with a block size of 4KiB and
offset of 1GiB would still accommodate a bitmap for file up to 32TiB
which likely covers the vast majority of use cases.Would this be
feasible? Or maybe has this approach been already rejected?
> (5) We have metadata integrity issues if we want to evict an bitmap for
> memory reclaim. We need to flush (and maybe sync) data from a separate
> filesystem before writing the xattr. This might not be so bad as the
> main metadata xattr on a cachefile has a flag in it that says the object
> is under modification, so we only need to flush, sync and write back all
> the bitmaps before altering that flag - and this can be done in the
> background.
I don't quite understand the intricacies of the race conditions you
refer to here, but I imagine this might also be simplified by storing
the bitmap/metadata within the file itself.
> One reason I haven't progressed much with it is that is at the point of
> turning into its own journalling filesystem... and we already have a bunch of
> those. Further, the point of cachefiles is that it uses files on an already
> mounted filesystem so that you don't have to have a dedicated blockdev for it.
Oh yes, reimplementing something that has already been done several
times over seem like busy work best avoided.
> An alternative method that may prove fruitful is to explore the way OpenAFS
> does caching: by having a bunch of, say, 256KiB cache files and an index that
> says which part of what network file is stored in what cache file.
Such a implementation would probably be useful, but just like using an
extend list/tree instead of a bitmap, it is far from straight forward
and as such prone to implementation & tuning difficulties.
> But if you're up for having a crack at forward porting the bitmap idea, I can
> give you some guidance.
Thank you for the offer, I may take you up on it in the future, but I'm
currently short on time for such an elaborate project.
Cheers
Benjamin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Cachefiles slowdown caused by SEEK_HOLE
2025-06-24 13:36 ` Benjamin Fischer
@ 2025-06-24 15:19 ` David Howells
0 siblings, 0 replies; 4+ messages in thread
From: David Howells @ 2025-06-24 15:19 UTC (permalink / raw)
To: Benjamin Fischer; +Cc: dhowells, netfs
Benjamin Fischer <benjamin.fischer@cern.ch> wrote:
> > (4) Setting xattrs is slow as each one is a synchronous journalled
> > metadata operation.
> This could be resolved by avoiding xattrs altogether and instead using the
> (already) required support for sparse files: One could store the file
> contents of the file at a (fixed) offset into the file (e.g. 1GiB) and use
> the space before this to store the bitmap (or whatever other information is
> needed for the cache). Even with a block size of 4KiB and offset of 1GiB
> would still accommodate a bitmap for file up to 32TiB which likely covers
> the vast majority of use cases. Would this be feasible? Or maybe has this
> approach been already rejected?
Not completely rejected. It's just that there's a bunch of different design
compromises one can make. One of the original design decisions I made was
that I didn't want to restrict the amount of filespace one could cache; with
your suggestion, you lose the last 32TiB of the 8EiB you can access. I don't
know if this is a problem.
Also, at the time I first did this, you couldn't do DIO to/from kernel space.
That said, probably the single most effective change would be to dispense with
mapping the contents of a file if I know I have all of it in the cache. By
far and away, the most common write pattern is { open(O_TRUNC); write();
write(); write(); close(); }. Noting this with a flag in the main xattr on a
file would allow all the seeking to be skipped.
> > (5) We have metadata integrity issues if we want to evict an bitmap for
> > memory reclaim. We need to flush (and maybe sync) data from a
> > separate filesystem before writing the xattr. This might not be so
> > bad as the main metadata xattr on a cachefile has a flag in it that
> > says the object is under modification, so we only need to flush,
> > sync and write back all the bitmaps before altering that flag - and
> > this can be done in the background.
> I don't quite understand the intricacies of the race conditions you refer to
> here, but I imagine this might also be simplified by storing the
> bitmap/metadata within the file itself.
The problem is that when we reconnect a network fs file with a cache object,
we have to know that the cache object is in a good state. Imagine a scenariou
in which the machine crashes while I/O is in progress to the netfs and the
cache. We could just blow the cache away entirely; but situations have been
encountered where that really sucks.
What I currently do is use an xattr to mark a cache object as being "dirty".
When that object is opened, the dirty flag is set before we do any other
modifications to it. When the object is closed, I have to flush all
outstanding changes and metadata changes to it before I can remove the dirty
mark.
If I don't do that, and the system crashes, say, between the dirty mark being
set and the changes hitting the disk, we can end up with an unknowingly
corrupt cache object.
The xattr with the dirty mark also holds coherency info that needs updating.
For AFS, that's the data version number; for NFS, that's the ctime, mtime and
change attribute.
Storing the bitmap in the file does simplify things in a couple of ways: I can
trivially read/write parts of it (with DIO, even) and fdatasync() will flush
the bitmap content in addition to the data.
> > An alternative method that may prove fruitful is to explore the way
> > OpenAFS does caching: by having a bunch of, say, 256KiB cache files and an
> > index that says which part of what network file is stored in what cache
> > file.
>
> Such a implementation would probably be useful, but just like using an extend
> list/tree instead of a bitmap, it is far from straight forward and as such
> prone to implementation & tuning difficulties.
Yeah... there's no single right answer.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-24 15:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 9:03 Cachefiles slowdown caused by SEEK_HOLE Benjamin Fischer
2025-06-20 13:28 ` David Howells
2025-06-24 13:36 ` Benjamin Fischer
2025-06-24 15:19 ` David Howells
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox