public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Dimitrios Apostolou <jimis@gmx.net>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-btrfs@vger.kernel.org, Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Anand Jain <anand.jain@oracle.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: Sequential read(8K) from compressed files are very slow
Date: Fri, 6 Jun 2025 17:37:43 -0700	[thread overview]
Message-ID: <20250607003743.GA4182169@zen.localdomain> (raw)
In-Reply-To: <d934d1ea-4e3e-71ef-8b42-698ccd747799@gmx.net>

On Thu, Jun 05, 2025 at 07:09:07PM +0200, Dimitrios Apostolou wrote:
> Hi Boris, thank you for investigating! I've been chasing this for years and
> I was hitting a wall, the bottleneck was not obvious at all when looking
> from outside the kernel. I've started a few threads before but they were
> fruitless.

Happy to help, it's an interesting problem!

> 
> On Wed, 4 Jun 2025, Boris Burkov wrote:
> 
> > 
> > stats from an 8K run:
> > $ sudo bpftrace readahead.bt
> > Attaching 4 probes...
> > 
> > @add_ra_delay_ms: 19450
> > @add_ra_delay_ns: 19450937640
> > @add_ra_delay_s: 19
> > 
> > @ra_sz_freq[8]: 81920
> > @ra_sz_hist:
> > [8, 16)            81920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > 
> > stats from a 128K run:
> > $ sudo bpftrace readahead.bt
> > Attaching 4 probes...
> > 
> > @add_ra_delay_ms: 15
> > @add_ra_delay_ns: 15333301
> > @add_ra_delay_s: 0
> > 
> > @ra_sz_freq[512]: 1
> > @ra_sz_freq[256]: 1
> > @ra_sz_freq[128]: 2
> > @ra_sz_freq[1024]: 2559
> > @ra_sz_hist:
> > [128, 256)             2 |                                                    |
> > [256, 512)             1 |                                                    |
> > [512, 1K)              1 |                                                    |
> > [1K, 2K)            2559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > 
> > so we are spending 19 seconds (vs 0) in add_ra_bio_pages and calling
> > btrfs_readahead() 81920 times with 8 pages vs 2559 times with 1024
> > pages.
> 
> I specifically like the bpftrace utility you are using, it opens up new
> possibilities without custom kernel compiles, so I want to experiment. Could
> you please include the script you used for this histogram?
> 

Unfortunately, I modified the script a bunch since using it. So I don't
have that exact one lying around.

But the features necessary are basically:

fentry:YOUR_FUNC {
        $val = args->YOUR_ARG->YOUR_FIELD->ANOTHER_FIELD; // whatever is relevant
        @h = hist($val);
}

And a bpftrace and kernel built with enough debugging features like BTF
to support it.

The quickstart oneliners here:
https://github.com/bpftrace/bpftrace/blob/master/docs/tutorial_one_liners.md
and the full manual:
https://github.com/bpftrace/bpftrace/blob/master/man/adoc/bpftrace.adoc
I generally find to be quite useful in practice while hacking with my
scripts. You can also refer to a bunch of my examples here if you like,
but caveat emptor on quality :)
This one is a good recent example using fentry and args:
https://github.com/boryas/scripts/blob/main/bt/compr-leak.bt
but that directory has many others from over the years.

If you are reading older scripts, they will often use kprobe/kretprobe
isntead of fentry/fexit, FYI.

> > 
> > The total time difference is ~30s on my setup, so there are still ~10
> > seconds unaccounted for in my analysis here, though.
> 
> This is outstanding. I expect such improvement will give a *huge* boost to
> postgresql workloads on compressed filesystems. By huge I mean 5-10x for
> sequential table scans.

As long as it doesn't regress other workloads too much! Fingers crossed,
and working on further perf testing :)

> 
> I'm also wondering, in the past I was trying to see if it makes any
> difference to tweak the setting /sys/block/sdX/queue/read_ahead_kb but
> couldn't see any substantial change. Do you see it affecting your results,
> with your patch applied? Or is btrfs following different code paths and
> completely ignoring that?
> 

Sorry, haven't gotten to testing this yet.

> > 
> > > > I removed all the extent locking as an experiment, as it is not really
> > > > needed for safety in this single threaded test and did see an
> > > > improvement but not full parity between 8k and 128k for the compressed
> > > > file. I'll keep poking at the other sources of overhead in the builtin
> > > > readahead logic and in calling btrfs_readahead more looping inside it.
> 
> Since your findings indicate that the issue is probably lock contention, you
> might want to try /proc/lock_stat. It requires a kernel built with
> CONFIG_LOCK_STAT, which is what blocks me at the moment, but it might be
> easier for you if you already compile it for developing btrfs. Docs at:
> 
> https://docs.kernel.org/locking/lockstat.html

I think it might be more running the algorithms of the extent range locking
(basically an rb tree storing a set of bits on ranges) rather than contention
on the lock itself. And the "lock" is more like waiting for an event
from this data structure, so I don't think it would show up in lockstat out
of the box. But appreciate the tip!

Boris

> 
> 
> Thank you,
> Dimitris
> 

      reply	other threads:[~2025-06-07  0:37 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03 19:56 Sequential read(8K) from compressed files are very slow Dimitrios Apostolou
2025-06-04  1:36 ` Boris Burkov
2025-06-04  6:22   ` Christoph Hellwig
2025-06-04 18:03     ` Boris Burkov
2025-06-04 21:49       ` Boris Burkov
2025-06-05  4:35         ` Christoph Hellwig
2025-06-05 17:09       ` Dimitrios Apostolou
2025-06-07  0:37         ` Boris Burkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250607003743.GA4182169@zen.localdomain \
    --to=boris@bur.io \
    --cc=anand.jain@oracle.com \
    --cc=hch@infradead.org \
    --cc=jimis@gmx.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox