public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Dimitrios Apostolou <jimis@gmx.net>
Cc: linux-btrfs@vger.kernel.org, Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Christoph Hellwig <hch@infradead.org>,
	Anand Jain <anand.jain@oracle.com>
Subject: Re: Sequential read(8K) from compressed files are very slow
Date: Tue, 3 Jun 2025 18:36:11 -0700	[thread overview]
Message-ID: <20250604013611.GA485082@zen.localdomain> (raw)
In-Reply-To: <34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net>

On Tue, Jun 03, 2025 at 09:56:22PM +0200, Dimitrios Apostolou wrote:
> Hello list,
> 
> I notice consistently that sequential reads from compressed files are slow
> when the block size is small.
> 
> It's extremely easy to reproduce, but the problem is I can't find the
> bottleneck. I suspect it is an issue of Btrfs, not doing read-ahead like it
> does with non-compressed files.
> 
> Can you reproduce it? I'd also appreciate instructions for further debugging
> it. Or tweaking my system to improve the case.

I'm actually having trouble reproducing the fast performance for
un-compressed files but with small read blocksize. I get 1.2GB/s for
compressed reads with a large bs setting (32k+) but only 600MB/s for a
10G file with 80 uncompressed extents just like yours with bs=8k and
bs=1M...

However, I do observe the huge delta between bs=8k and bs=128k for
compressed which is interesting, even if I am doing something dumb and
failing to reproduce the fast uncompressed reads.

I also observe that the performance rapidly drops off below bs=32k.
Using the highly compressible file, I get 1.4GB/s with 128k, 64k, 32k
and then 200-400MB/s for 4k-16k.

IN THEORY, add_ra_bio_pages is supposed to be doing our own readahead
caching for compressed pages that we have read, so I think any overhead
we incur is not going to be making tons more IO. It will probably be in
that readahead caching or in some interaction with VFS readahead.

To confirm this I invoked the following bpftrace script while doing the
8k and 128k reproducers:
bpftrace -e 'fentry:btrfs_submit_compressed_read {@[kstack]=count();}'

results:
128K:
@[
        _sub_D_65535_0+21531255
        _sub_D_65535_0+21531255
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_readahead+999
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_readahead.isra.0+231
        filemap_get_pages+1507
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 2562
@[
        _sub_D_65535_0+21531255
        _sub_D_65535_0+21531255
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_do_readpage+2751
        btrfs_readahead+760
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_readahead.isra.0+231
        filemap_get_pages+1507
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 79354

8K:
@[
        _sub_D_65535_0+21531247
        _sub_D_65535_0+21531247
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_readahead+999
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_get_pages+717
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 40960
@[
        _sub_D_65535_0+21531247
        _sub_D_65535_0+21531247
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_readahead+999
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_readahead.isra.0+231
        filemap_get_pages+1507
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 40960

So that is the same total number of IOs issued (81920) which is 10G
divided into 128K extents. Therefore, it doesn't appear that btrfs is
issuing more compressed IOs in the bs=8K case. I also checked metadata
IOs and nothing crazy stuck out to me.

However, I do see that in the 8k case, we are repeatedly calling
btrfs_readahead() while in the 128k case, we only call btrfs_readahead
~2500 times, and the rest of the time we loop inside btrfs_readahead
calling btrfs_do_readpage.

This incurs two bits of overhead:
locking/unlocking the extent and possibly not caching the extent_map.
Tracing btrfs_get_extent shows that both sizes call btrfs_get_extent
81920 times, so it's not extent caching.

But the slow variant does call the lock/unlock extent functions 79k times
more.

I then instrumented them to measure the overhead of those functions and
saw that the slow variant spent ~20s of wall time on this
locking/unlocking:

8K:
@lock_delay_ns: 5623087671

@lookup_delay_ns: 29422788

@unlock_delay_ns: 19067071333

128K:
@lock_delay_ns: 91195771

@lookup_delay_ns: 1302691

@unlock_delay_ns: 4911877331

I removed all the extent locking as an experiment, as it is not really
needed for safety in this single threaded test and did see an
improvement but not full parity between 8k and 128k for the compressed
file. I'll keep poking at the other sources of overhead in the builtin
readahead logic and in calling btrfs_readahead more looping inside it.

I'll keep trying that to see if I can get a full reproduction and try to
actually explain the difference.

If you are able to use some kind of tracing to see if these findings
hold on your system, I think that would be interesting.

Also, if someone more knowledgeable in how the generic readahead works
like Christoph could give me a hint to how to hack up the 8k case to
make fewer calls to btrfs_readahead, I think that could help bolster the
theory.

Thanks,
Boris

> 
> I'm CC'ing people from previous related discussion at
> https://www.spinics.net/lists/linux-btrfs/msg137840.html
> 
> 
> ** How to reproduce:
> 
> Filesystem is newly created with btrfs-progs v6.6.3, on NVMe drive with
> 1.5TB of space, mounted with compress=zstd:3. Kernel version is 6.11.
> 
> 
> 1. create 10GB compressible and incompressible files
> 
> head -c 10G /dev/zero          > highly-compressible
> head -c 10G <(od /dev/urandom) > compressible
> head -c 10G /dev/urandom       > incompressible
> 
> 
> 2. Verify they are indeed what promised
> 
> $ sudo compsize highly-compressible
> Processed 1 file, 81920 regular extents (81920 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL        3%      320M          10G          10G
> $ sudo compsize compressible
> Processed 1 file, 81965 regular extents (81965 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL       43%      4.3G          10G          10G
> zstd        43%      4.3G          10G          10G
> $ sudo compsize incompressible
> Processed 1 file, 80 regular extents (80 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       10G          10G          10G
> none       100%       10G          10G          10G
> 
> 
> 3. Measure read() speed with 8KB block size
> 
> $ sync
> $ echo 1 | sudo tee /proc/sys/vm/drop_caches
> 
> $ dd if=highly-compressible   of=/dev/null bs=8k
> 1310720+0 records in
> 1310720+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 12.9102 s, 832 MB/s
> 
> $ dd if=compressible   of=/dev/null bs=8k
> 1310720+0 records in
> 1310720+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 30.68 s, 350 MB/s
> 
> $ dd if=incompressible   of=/dev/null bs=8k
> 1310720+0 records in
> 1310720+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 3.85749 s, 2.8 GB/s
> 
> 
> The above results are repeatable. The device can give several GB/s like in
> the last case, so I would expect such high numbers or up to what the CPU can
> decompress. But CPU cores utilisation is low, so I have excluded that from
> being the bottleneck.
> 
> 
> What do you think?
> Dimitris
> 

  reply	other threads:[~2025-06-04  1:36 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03 19:56 Sequential read(8K) from compressed files are very slow Dimitrios Apostolou
2025-06-04  1:36 ` Boris Burkov [this message]
2025-06-04  6:22   ` Christoph Hellwig
2025-06-04 18:03     ` Boris Burkov
2025-06-04 21:49       ` Boris Burkov
2025-06-05  4:35         ` Christoph Hellwig
2025-06-05 17:09       ` Dimitrios Apostolou
2025-06-07  0:37         ` Boris Burkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250604013611.GA485082@zen.localdomain \
    --to=boris@bur.io \
    --cc=anand.jain@oracle.com \
    --cc=hch@infradead.org \
    --cc=jimis@gmx.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox