Linux network filesystem support library
 help / color / mirror / Atom feed
From: David Howells <dhowells@redhat.com>
To: Benjamin Fischer <benjamin.fischer@cern.ch>
Cc: dhowells@redhat.com, netfs@lists.linux.dev
Subject: Re: Cachefiles slowdown caused by SEEK_HOLE
Date: Fri, 20 Jun 2025 14:28:50 +0100	[thread overview]
Message-ID: <963373.1750426130@warthog.procyon.org.uk> (raw)
In-Reply-To: <31cd8f34-1b37-4062-925a-baedec8f2f79@cern.ch>

Hi Benjamin,

> I've observed that when using cachefiles there is extreme performance
> degradation when the cache backing file (i.e. in /var/cache/fscache) is
> severely fragmented.

Yeah, I can imagine.  One of the many things on my TODO list is to replace
this in some way.  Unfortunately, the ext4 and xfs maintainers think it's not
a good idea to rely on the backing filesystem metadata as the backing fs is at
liberty to punch out blocks of zeros or insert bridging blocks in order to
better optimise the fragments list.  The former would give a false negative,
causing us to have to go get the block again and, worse, the latter would give
a false positive, making us think we have data that we don't.

> For example, we have a 40GiB file in a NFS 4.2 mount (rsize of 1 MiB), that
> fully resides in the local (fs)cache and reading starts at ~16MB/s. Reading
> the backing file directly can be done at ~500MB/s - the hardware limit. The
> backing file has ~200k extends and resides on an etx4 formatted SSD.
> 
> Using perf record, I found the culprit to be iomap_seek_hole (caused by
> cachefiles_prepare_read) and its descendants, which account for 98% of cpu
> time, which in turn is almost 100% of the wall time. So the root cause is that
> SEEK_HOLE is too slow when it has to traverse lots of extends, at least for
> ext4.

Ouch.  Yeah, we have to do a SEEK_HOLE *and* a SEEK_DATA to define the limits
on a piece of occupied filespace - and that sucks.  I might be able to use
FIEMAP, and certainly caching the result of the seeks ought to be fine... but
for the small matter of the aforementioned possibility of the backing fs
screwing with things.

> I've verified the time it takes to SEEK_HOLE manually and found consistent
> results (66ms) which behave as expected: the further one starts into the file
> the faster the seek gets. This is also reflected in the read rate through
> cachefiles, which grows in a 1 over "remaining file size" manner until it
> achieves the hardware-limited speed near the end of the file.

It's probably reflects the way the extent list is stored.

> This should also affect all other filesystems, that search for holes in such a
> linear fashion - which I imagine is most, if not all, of them. These slowdowns
> will mostly affect fully cached files, exactly the case where one would
> expect/need the best performance. They are also exasperated by smaller rsize
> or when cache read/fill patterns induce a lot of fragmentation. Therefore I
> think it sensible to address this issue.
> 
> My naive impression is that using fiemap should help mitigate the impact. One
> could still fall back on existing SEEK_DATA/SEEK_HOLE behavior, in case fiemap
> is unavailable.
> 
> While I wouldn't mind (attempting) to contribute the necessary code, I'm not
> too sure that my non-existent kernel development skills would actually be
> helpful.
> 
> In any case, I wanted to bring this to your attention such that you at least
> may ponder about it.

I have a solution that used to work (give or take the odd bug in it) until I
was persuaded to shift to creating netfslib and still have the code hanging
around somewhere (it needs updating).

Here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter-dir

and, in particular, this patch:

	cachefiles: Implement a content-present indicator and bitmap

The way it works is that you define a 'fragment size' for the cache file, say
256KiB or 2MiB, and then divide the file up into blocks of that size.  A
bitmap is stored in an xattr that indicates the occupancy of those blocks.  If
a file is completely stored locally, then we can dispense with the bitmap.

This has a number of limitations:

 (1) We limit the xattr size to 512 bytes to hold down memory usage and, also,
     we can't read/modify parts of an xattr, only whole xattrs.

 (2) Because the xattr is limited in size, there is a limit on the number of
     blocks we can map.  For 256KiB blocks, with 4096 bits in the map, we are
     limited to a maximum of 1GiB.

 (3) We could have multiple xattrs, each covering a different part of the
     file.

 (4) Setting xattrs is slow as each one is a synchronous journalled metadata
     operation.

 (5) We have metadata integrity issues if we want to evict an bitmap for
     memory reclaim.  We need to flush (and maybe sync) data from a separate
     filesystem before writing the xattr.  This might not be so bad as the
     main metadata xattr on a cachefile has a flag in it that says the object
     is under modification, so we only need to flush, sync and write back all
     the bitmaps before altering that flag - and this can be done in the
     background.

 (6) When it comes to the data itself, we have to create or download an entire
     block in one go in order to cache it (which actually improves the
     performance in some circumstances).  This ought to be easier with
     multipage folios - but the new readahead algorithm adds some irritations
     of its own, and we can end up with competing readaheads that cause
     caching to fail at the point where they meet if it doesn't align.

     We don't necessarily have to write back the entire block if we only
     changed, say, one byte.

One reason I haven't progressed much with it is that is at the point of
turning into its own journalling filesystem... and we already have a bunch of
those.  Further, the point of cachefiles is that it uses files on an already
mounted filesystem so that you don't have to have a dedicated blockdev for it.

An alternative method that may prove fruitful is to explore the way OpenAFS
does caching: by having a bunch of, say, 256KiB cache files and an index that
says which part of what network file is stored in what cache file.

But if you're up for having a crack at forward porting the bitmap idea, I can
give you some guidance.

David


  reply	other threads:[~2025-06-20 13:28 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30  9:03 Cachefiles slowdown caused by SEEK_HOLE Benjamin Fischer
2025-06-20 13:28 ` David Howells [this message]
2025-06-24 13:36   ` Benjamin Fischer
2025-06-24 15:19     ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=963373.1750426130@warthog.procyon.org.uk \
    --to=dhowells@redhat.com \
    --cc=benjamin.fischer@cern.ch \
    --cc=netfs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox