public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Christoph Hellwig <hch@infradead.org>
Cc: Dimitrios Apostolou <jimis@gmx.net>,
	linux-btrfs@vger.kernel.org, Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Anand Jain <anand.jain@oracle.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: Sequential read(8K) from compressed files are very slow
Date: Wed, 4 Jun 2025 11:03:03 -0700	[thread overview]
Message-ID: <20250604180303.GA978719@zen.localdomain> (raw)
In-Reply-To: <aD_mE1n1fmQ09klP@infradead.org>

On Tue, Jun 03, 2025 at 11:22:11PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 06:36:11PM -0700, Boris Burkov wrote:
> > However, I do observe the huge delta between bs=8k and bs=128k for
> > compressed which is interesting, even if I am doing something dumb and
> > failing to reproduce the fast uncompressed reads.
> > 
> > I also observe that the performance rapidly drops off below bs=32k.
> > Using the highly compressible file, I get 1.4GB/s with 128k, 64k, 32k
> > and then 200-400MB/s for 4k-16k.
> > 
> > IN THEORY, add_ra_bio_pages is supposed to be doing our own readahead
> > caching for compressed pages that we have read, so I think any overhead
> > we incur is not going to be making tons more IO. It will probably be in
> > that readahead caching or in some interaction with VFS readahead.
> 
> > However, I do see that in the 8k case, we are repeatedly calling
> > btrfs_readahead() while in the 128k case, we only call btrfs_readahead
> > ~2500 times, and the rest of the time we loop inside btrfs_readahead
> > calling btrfs_do_readpage.
> 
> Btw, I found the way add_ra_bio_pages in btrfs always a little
> odd.  The core readahead code provides a readahead_expand() that should
> do something similar, but more efficiently.  The difference is that it
> only works for actual readahead calls and not ->read_folio, but the
> latter is pretty much a last resort these days.
> 

Some more evidence that our add_ra_bio_pages is at least a big part of
where things are going slow:

stats from an 8K run:
$ sudo bpftrace readahead.bt
Attaching 4 probes...

@add_ra_delay_ms: 19450
@add_ra_delay_ns: 19450937640
@add_ra_delay_s: 19

@ra_sz_freq[8]: 81920
@ra_sz_hist:
[8, 16)            81920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


stats from a 128K run:
$ sudo bpftrace readahead.bt
Attaching 4 probes...

@add_ra_delay_ms: 15
@add_ra_delay_ns: 15333301
@add_ra_delay_s: 0

@ra_sz_freq[512]: 1
@ra_sz_freq[256]: 1
@ra_sz_freq[128]: 2
@ra_sz_freq[1024]: 2559
@ra_sz_hist:
[128, 256)             2 |                                                    |
[256, 512)             1 |                                                    |
[512, 1K)              1 |                                                    |
[1K, 2K)            2559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


so we are spending 19 seconds (vs 0) in add_ra_bio_pages and calling
btrfs_readahead() 81920 times with 8 pages vs 2559 times with 1024
pages.

The total time difference is ~30s on my setup, so there are still ~10
seconds unaccounted for in my analysis here, though.

> > I removed all the extent locking as an experiment, as it is not really
> > needed for safety in this single threaded test and did see an
> > improvement but not full parity between 8k and 128k for the compressed
> > file. I'll keep poking at the other sources of overhead in the builtin
> > readahead logic and in calling btrfs_readahead more looping inside it.
> > 
> > I'll keep trying that to see if I can get a full reproduction and try to
> > actually explain the difference.
> > 
> > If you are able to use some kind of tracing to see if these findings
> > hold on your system, I think that would be interesting.
> > 
> > Also, if someone more knowledgeable in how the generic readahead works
> > like Christoph could give me a hint to how to hack up the 8k case to
> > make fewer calls to btrfs_readahead, I think that could help bolster the
> > theory.
> 
> I'm not really a readahead expert, but I added who I suspect is (willy).
> But my guess is using readahead_expand is the right answer here, but
> another thing to look at might if the compressed extent handling in
> btrfs somehow messes up the read ahead window calculations.
> 

  reply	other threads:[~2025-06-04 18:03 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03 19:56 Sequential read(8K) from compressed files are very slow Dimitrios Apostolou
2025-06-04  1:36 ` Boris Burkov
2025-06-04  6:22   ` Christoph Hellwig
2025-06-04 18:03     ` Boris Burkov [this message]
2025-06-04 21:49       ` Boris Burkov
2025-06-05  4:35         ` Christoph Hellwig
2025-06-05 17:09       ` Dimitrios Apostolou
2025-06-07  0:37         ` Boris Burkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250604180303.GA978719@zen.localdomain \
    --to=boris@bur.io \
    --cc=anand.jain@oracle.com \
    --cc=hch@infradead.org \
    --cc=jimis@gmx.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox