Sequential read(8K) from compressed files are very slow

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Sequential read(8K) from compressed files are very slow
@ 2025-06-03 19:56 Dimitrios Apostolou
  2025-06-04  1:36 ` Boris Burkov
  0 siblings, 1 reply; 8+ messages in thread
From: Dimitrios Apostolou @ 2025-06-03 19:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Qu Wenruo, Christoph Hellwig, Anand Jain

Hello list,

I notice consistently that sequential reads from compressed files are slow 
when the block size is small.

It's extremely easy to reproduce, but the problem is I can't find the 
bottleneck. I suspect it is an issue of Btrfs, not doing read-ahead like 
it does with non-compressed files.

Can you reproduce it? I'd also appreciate instructions for further 
debugging it. Or tweaking my system to improve the case.

I'm CC'ing people from previous related discussion at
https://www.spinics.net/lists/linux-btrfs/msg137840.html

** How to reproduce:

Filesystem is newly created with btrfs-progs v6.6.3, on NVMe drive with 
1.5TB of space, mounted with compress=zstd:3. Kernel version is 6.11.

1. create 10GB compressible and incompressible files

head -c 10G /dev/zero          > highly-compressible
head -c 10G <(od /dev/urandom) > compressible
head -c 10G /dev/urandom       > incompressible

2. Verify they are indeed what promised

$ sudo compsize highly-compressible
Processed 1 file, 81920 regular extents (81920 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL        3%      320M          10G          10G
$ sudo compsize compressible
Processed 1 file, 81965 regular extents (81965 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       43%      4.3G          10G          10G
zstd        43%      4.3G          10G          10G
$ sudo compsize incompressible
Processed 1 file, 80 regular extents (80 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       10G          10G          10G
none       100%       10G          10G          10G

3. Measure read() speed with 8KB block size

$ sync
$ echo 1 | sudo tee /proc/sys/vm/drop_caches

$ dd if=highly-compressible   of=/dev/null bs=8k
1310720+0 records in
1310720+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 12.9102 s, 832 MB/s

$ dd if=compressible   of=/dev/null bs=8k
1310720+0 records in
1310720+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 30.68 s, 350 MB/s

$ dd if=incompressible   of=/dev/null bs=8k
1310720+0 records in
1310720+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 3.85749 s, 2.8 GB/s

The above results are repeatable. The device can give several GB/s like in 
the last case, so I would expect such high numbers or up to what the CPU 
can decompress. But CPU cores utilisation is low, so I have excluded that 
from being the bottleneck.

What do you think?
Dimitris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-03 19:56 Sequential read(8K) from compressed files are very slow Dimitrios Apostolou
@ 2025-06-04  1:36 ` Boris Burkov
  2025-06-04  6:22   ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Burkov @ 2025-06-04  1:36 UTC (permalink / raw)
  To: Dimitrios Apostolou; +Cc: linux-btrfs, Qu Wenruo, Christoph Hellwig, Anand Jain

On Tue, Jun 03, 2025 at 09:56:22PM +0200, Dimitrios Apostolou wrote:
> Hello list,
> 
> I notice consistently that sequential reads from compressed files are slow
> when the block size is small.
> 
> It's extremely easy to reproduce, but the problem is I can't find the
> bottleneck. I suspect it is an issue of Btrfs, not doing read-ahead like it
> does with non-compressed files.
> 
> Can you reproduce it? I'd also appreciate instructions for further debugging
> it. Or tweaking my system to improve the case.

I'm actually having trouble reproducing the fast performance for
un-compressed files but with small read blocksize. I get 1.2GB/s for
compressed reads with a large bs setting (32k+) but only 600MB/s for a
10G file with 80 uncompressed extents just like yours with bs=8k and
bs=1M...

However, I do observe the huge delta between bs=8k and bs=128k for
compressed which is interesting, even if I am doing something dumb and
failing to reproduce the fast uncompressed reads.

I also observe that the performance rapidly drops off below bs=32k.
Using the highly compressible file, I get 1.4GB/s with 128k, 64k, 32k
and then 200-400MB/s for 4k-16k.

IN THEORY, add_ra_bio_pages is supposed to be doing our own readahead
caching for compressed pages that we have read, so I think any overhead
we incur is not going to be making tons more IO. It will probably be in
that readahead caching or in some interaction with VFS readahead.

To confirm this I invoked the following bpftrace script while doing the
8k and 128k reproducers:
bpftrace -e 'fentry:btrfs_submit_compressed_read {@[kstack]=count();}'

results:
128K:
@[
        _sub_D_65535_0+21531255
        _sub_D_65535_0+21531255
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_readahead+999
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_readahead.isra.0+231
        filemap_get_pages+1507
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 2562
@[
        _sub_D_65535_0+21531255
        _sub_D_65535_0+21531255
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_do_readpage+2751
        btrfs_readahead+760
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_readahead.isra.0+231
        filemap_get_pages+1507
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 79354

8K:
@[
        _sub_D_65535_0+21531247
        _sub_D_65535_0+21531247
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_readahead+999
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_get_pages+717
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 40960
@[
        _sub_D_65535_0+21531247
        _sub_D_65535_0+21531247
        bpf_trampoline_225485939039+67
        btrfs_submit_compressed_read+5
        submit_one_bio+261
        btrfs_readahead+999
        read_pages+381
        page_cache_ra_unbounded+841
        filemap_readahead.isra.0+231
        filemap_get_pages+1507
        filemap_read+726
        vfs_read+1643
        ksys_read+239
        do_syscall_64+76
        entry_SYSCALL_64_after_hwframe+118
]: 40960

So that is the same total number of IOs issued (81920) which is 10G
divided into 128K extents. Therefore, it doesn't appear that btrfs is
issuing more compressed IOs in the bs=8K case. I also checked metadata
IOs and nothing crazy stuck out to me.

However, I do see that in the 8k case, we are repeatedly calling
btrfs_readahead() while in the 128k case, we only call btrfs_readahead
~2500 times, and the rest of the time we loop inside btrfs_readahead
calling btrfs_do_readpage.

This incurs two bits of overhead:
locking/unlocking the extent and possibly not caching the extent_map.
Tracing btrfs_get_extent shows that both sizes call btrfs_get_extent
81920 times, so it's not extent caching.

But the slow variant does call the lock/unlock extent functions 79k times
more.

I then instrumented them to measure the overhead of those functions and
saw that the slow variant spent ~20s of wall time on this
locking/unlocking:

8K:
@lock_delay_ns: 5623087671

@lookup_delay_ns: 29422788

@unlock_delay_ns: 19067071333

128K:
@lock_delay_ns: 91195771

@lookup_delay_ns: 1302691

@unlock_delay_ns: 4911877331

I removed all the extent locking as an experiment, as it is not really
needed for safety in this single threaded test and did see an
improvement but not full parity between 8k and 128k for the compressed
file. I'll keep poking at the other sources of overhead in the builtin
readahead logic and in calling btrfs_readahead more looping inside it.

I'll keep trying that to see if I can get a full reproduction and try to
actually explain the difference.

If you are able to use some kind of tracing to see if these findings
hold on your system, I think that would be interesting.

Also, if someone more knowledgeable in how the generic readahead works
like Christoph could give me a hint to how to hack up the 8k case to
make fewer calls to btrfs_readahead, I think that could help bolster the
theory.

Thanks,
Boris

> 
> I'm CC'ing people from previous related discussion at
> https://www.spinics.net/lists/linux-btrfs/msg137840.html
> 
> 
> ** How to reproduce:
> 
> Filesystem is newly created with btrfs-progs v6.6.3, on NVMe drive with
> 1.5TB of space, mounted with compress=zstd:3. Kernel version is 6.11.
> 
> 
> 1. create 10GB compressible and incompressible files
> 
> head -c 10G /dev/zero          > highly-compressible
> head -c 10G <(od /dev/urandom) > compressible
> head -c 10G /dev/urandom       > incompressible
> 
> 
> 2. Verify they are indeed what promised
> 
> $ sudo compsize highly-compressible
> Processed 1 file, 81920 regular extents (81920 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL        3%      320M          10G          10G
> $ sudo compsize compressible
> Processed 1 file, 81965 regular extents (81965 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL       43%      4.3G          10G          10G
> zstd        43%      4.3G          10G          10G
> $ sudo compsize incompressible
> Processed 1 file, 80 regular extents (80 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       10G          10G          10G
> none       100%       10G          10G          10G
> 
> 
> 3. Measure read() speed with 8KB block size
> 
> $ sync
> $ echo 1 | sudo tee /proc/sys/vm/drop_caches
> 
> $ dd if=highly-compressible   of=/dev/null bs=8k
> 1310720+0 records in
> 1310720+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 12.9102 s, 832 MB/s
> 
> $ dd if=compressible   of=/dev/null bs=8k
> 1310720+0 records in
> 1310720+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 30.68 s, 350 MB/s
> 
> $ dd if=incompressible   of=/dev/null bs=8k
> 1310720+0 records in
> 1310720+0 records out
> 10737418240 bytes (11 GB, 10 GiB) copied, 3.85749 s, 2.8 GB/s
> 
> 
> The above results are repeatable. The device can give several GB/s like in
> the last case, so I would expect such high numbers or up to what the CPU can
> decompress. But CPU cores utilisation is low, so I have excluded that from
> being the bottleneck.
> 
> 
> What do you think?
> Dimitris
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-04  1:36 ` Boris Burkov
@ 2025-06-04  6:22   ` Christoph Hellwig
  2025-06-04 18:03     ` Boris Burkov
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2025-06-04  6:22 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Dimitrios Apostolou, linux-btrfs, Qu Wenruo, Christoph Hellwig,
	Anand Jain, Matthew Wilcox

On Tue, Jun 03, 2025 at 06:36:11PM -0700, Boris Burkov wrote:
> However, I do observe the huge delta between bs=8k and bs=128k for
> compressed which is interesting, even if I am doing something dumb and
> failing to reproduce the fast uncompressed reads.
> 
> I also observe that the performance rapidly drops off below bs=32k.
> Using the highly compressible file, I get 1.4GB/s with 128k, 64k, 32k
> and then 200-400MB/s for 4k-16k.
> 
> IN THEORY, add_ra_bio_pages is supposed to be doing our own readahead
> caching for compressed pages that we have read, so I think any overhead
> we incur is not going to be making tons more IO. It will probably be in
> that readahead caching or in some interaction with VFS readahead.

> However, I do see that in the 8k case, we are repeatedly calling
> btrfs_readahead() while in the 128k case, we only call btrfs_readahead
> ~2500 times, and the rest of the time we loop inside btrfs_readahead
> calling btrfs_do_readpage.

Btw, I found the way add_ra_bio_pages in btrfs always a little
odd.  The core readahead code provides a readahead_expand() that should
do something similar, but more efficiently.  The difference is that it
only works for actual readahead calls and not ->read_folio, but the
latter is pretty much a last resort these days.

> I removed all the extent locking as an experiment, as it is not really
> needed for safety in this single threaded test and did see an
> improvement but not full parity between 8k and 128k for the compressed
> file. I'll keep poking at the other sources of overhead in the builtin
> readahead logic and in calling btrfs_readahead more looping inside it.
> 
> I'll keep trying that to see if I can get a full reproduction and try to
> actually explain the difference.
> 
> If you are able to use some kind of tracing to see if these findings
> hold on your system, I think that would be interesting.
> 
> Also, if someone more knowledgeable in how the generic readahead works
> like Christoph could give me a hint to how to hack up the 8k case to
> make fewer calls to btrfs_readahead, I think that could help bolster the
> theory.

I'm not really a readahead expert, but I added who I suspect is (willy).
But my guess is using readahead_expand is the right answer here, but
another thing to look at might if the compressed extent handling in
btrfs somehow messes up the read ahead window calculations.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-04  6:22   ` Christoph Hellwig
@ 2025-06-04 18:03     ` Boris Burkov
  2025-06-04 21:49       ` Boris Burkov
  2025-06-05 17:09       ` Dimitrios Apostolou
  0 siblings, 2 replies; 8+ messages in thread
From: Boris Burkov @ 2025-06-04 18:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dimitrios Apostolou, linux-btrfs, Qu Wenruo, Anand Jain,
	Matthew Wilcox

On Tue, Jun 03, 2025 at 11:22:11PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 06:36:11PM -0700, Boris Burkov wrote:
> > However, I do observe the huge delta between bs=8k and bs=128k for
> > compressed which is interesting, even if I am doing something dumb and
> > failing to reproduce the fast uncompressed reads.
> > 
> > I also observe that the performance rapidly drops off below bs=32k.
> > Using the highly compressible file, I get 1.4GB/s with 128k, 64k, 32k
> > and then 200-400MB/s for 4k-16k.
> > 
> > IN THEORY, add_ra_bio_pages is supposed to be doing our own readahead
> > caching for compressed pages that we have read, so I think any overhead
> > we incur is not going to be making tons more IO. It will probably be in
> > that readahead caching or in some interaction with VFS readahead.
> 
> > However, I do see that in the 8k case, we are repeatedly calling
> > btrfs_readahead() while in the 128k case, we only call btrfs_readahead
> > ~2500 times, and the rest of the time we loop inside btrfs_readahead
> > calling btrfs_do_readpage.
> 
> Btw, I found the way add_ra_bio_pages in btrfs always a little
> odd.  The core readahead code provides a readahead_expand() that should
> do something similar, but more efficiently.  The difference is that it
> only works for actual readahead calls and not ->read_folio, but the
> latter is pretty much a last resort these days.
> 

Some more evidence that our add_ra_bio_pages is at least a big part of
where things are going slow:

stats from an 8K run:
$ sudo bpftrace readahead.bt
Attaching 4 probes...

@add_ra_delay_ms: 19450
@add_ra_delay_ns: 19450937640
@add_ra_delay_s: 19

@ra_sz_freq[8]: 81920
@ra_sz_hist:
[8, 16)            81920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


stats from a 128K run:
$ sudo bpftrace readahead.bt
Attaching 4 probes...

@add_ra_delay_ms: 15
@add_ra_delay_ns: 15333301
@add_ra_delay_s: 0

@ra_sz_freq[512]: 1
@ra_sz_freq[256]: 1
@ra_sz_freq[128]: 2
@ra_sz_freq[1024]: 2559
@ra_sz_hist:
[128, 256)             2 |                                                    |
[256, 512)             1 |                                                    |
[512, 1K)              1 |                                                    |
[1K, 2K)            2559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


so we are spending 19 seconds (vs 0) in add_ra_bio_pages and calling
btrfs_readahead() 81920 times with 8 pages vs 2559 times with 1024
pages.

The total time difference is ~30s on my setup, so there are still ~10
seconds unaccounted for in my analysis here, though.

> > I removed all the extent locking as an experiment, as it is not really
> > needed for safety in this single threaded test and did see an
> > improvement but not full parity between 8k and 128k for the compressed
> > file. I'll keep poking at the other sources of overhead in the builtin
> > readahead logic and in calling btrfs_readahead more looping inside it.
> > 
> > I'll keep trying that to see if I can get a full reproduction and try to
> > actually explain the difference.
> > 
> > If you are able to use some kind of tracing to see if these findings
> > hold on your system, I think that would be interesting.
> > 
> > Also, if someone more knowledgeable in how the generic readahead works
> > like Christoph could give me a hint to how to hack up the 8k case to
> > make fewer calls to btrfs_readahead, I think that could help bolster the
> > theory.
> 
> I'm not really a readahead expert, but I added who I suspect is (willy).
> But my guess is using readahead_expand is the right answer here, but
> another thing to look at might if the compressed extent handling in
> btrfs somehow messes up the read ahead window calculations.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-04 18:03     ` Boris Burkov
@ 2025-06-04 21:49       ` Boris Burkov
  2025-06-05  4:35         ` Christoph Hellwig
  2025-06-05 17:09       ` Dimitrios Apostolou
  1 sibling, 1 reply; 8+ messages in thread
From: Boris Burkov @ 2025-06-04 21:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dimitrios Apostolou, linux-btrfs, Qu Wenruo, Anand Jain,
	Matthew Wilcox

On Wed, Jun 04, 2025 at 11:03:03AM -0700, Boris Burkov wrote:
> On Tue, Jun 03, 2025 at 11:22:11PM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 03, 2025 at 06:36:11PM -0700, Boris Burkov wrote:
> > > However, I do observe the huge delta between bs=8k and bs=128k for
> > > compressed which is interesting, even if I am doing something dumb and
> > > failing to reproduce the fast uncompressed reads.
> > > 
> > > I also observe that the performance rapidly drops off below bs=32k.
> > > Using the highly compressible file, I get 1.4GB/s with 128k, 64k, 32k
> > > and then 200-400MB/s for 4k-16k.
> > > 
> > > IN THEORY, add_ra_bio_pages is supposed to be doing our own readahead
> > > caching for compressed pages that we have read, so I think any overhead
> > > we incur is not going to be making tons more IO. It will probably be in
> > > that readahead caching or in some interaction with VFS readahead.
> > 
> > > However, I do see that in the 8k case, we are repeatedly calling
> > > btrfs_readahead() while in the 128k case, we only call btrfs_readahead
> > > ~2500 times, and the rest of the time we loop inside btrfs_readahead
> > > calling btrfs_do_readpage.
> > 
> > Btw, I found the way add_ra_bio_pages in btrfs always a little
> > odd.  The core readahead code provides a readahead_expand() that should
> > do something similar, but more efficiently.  The difference is that it
> > only works for actual readahead calls and not ->read_folio, but the
> > latter is pretty much a last resort these days.
> > 
> 
> Some more evidence that our add_ra_bio_pages is at least a big part of
> where things are going slow:
> 
> stats from an 8K run:
> $ sudo bpftrace readahead.bt
> Attaching 4 probes...
> 
> @add_ra_delay_ms: 19450
> @add_ra_delay_ns: 19450937640
> @add_ra_delay_s: 19
> 
> @ra_sz_freq[8]: 81920
> @ra_sz_hist:
> [8, 16)            81920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> 
> stats from a 128K run:
> $ sudo bpftrace readahead.bt
> Attaching 4 probes...
> 
> @add_ra_delay_ms: 15
> @add_ra_delay_ns: 15333301
> @add_ra_delay_s: 0
> 
> @ra_sz_freq[512]: 1
> @ra_sz_freq[256]: 1
> @ra_sz_freq[128]: 2
> @ra_sz_freq[1024]: 2559
> @ra_sz_hist:
> [128, 256)             2 |                                                    |
> [256, 512)             1 |                                                    |
> [512, 1K)              1 |                                                    |
> [1K, 2K)            2559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> 
> so we are spending 19 seconds (vs 0) in add_ra_bio_pages and calling
> btrfs_readahead() 81920 times with 8 pages vs 2559 times with 1024
> pages.
> 
> The total time difference is ~30s on my setup, so there are still ~10
> seconds unaccounted for in my analysis here, though.

This gets accounted in the generic vfs code, also presumably due to
inefficient readahead size.

> 
> > > I removed all the extent locking as an experiment, as it is not really
> > > needed for safety in this single threaded test and did see an
> > > improvement but not full parity between 8k and 128k for the compressed
> > > file. I'll keep poking at the other sources of overhead in the builtin
> > > readahead logic and in calling btrfs_readahead more looping inside it.
> > > 
> > > I'll keep trying that to see if I can get a full reproduction and try to
> > > actually explain the difference.
> > > 
> > > If you are able to use some kind of tracing to see if these findings
> > > hold on your system, I think that would be interesting.
> > > 
> > > Also, if someone more knowledgeable in how the generic readahead works
> > > like Christoph could give me a hint to how to hack up the 8k case to
> > > make fewer calls to btrfs_readahead, I think that could help bolster the
> > > theory.
> > 
> > I'm not really a readahead expert, but I added who I suspect is (willy).
> > But my guess is using readahead_expand is the right answer here, but
> > another thing to look at might if the compressed extent handling in
> > btrfs somehow messes up the read ahead window calculations.
> > 

So uhhh this braindead "stick it the first place it fits" patch:

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c562b589876e..f77e3b849786 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2553,7 +2553,11 @@ void btrfs_readahead(struct readahead_control *rac)
 	lock_extents_for_read(inode, start, end, &cached_state);

 	while ((folio = readahead_folio(rac)) != NULL)
+	{
 		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
+		if (em_cached)
+			readahead_expand(rac, start, em_cached->ram_bytes);
+	}

 	btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);



unlocked the perf on the compressed 8k (and 4k) case without regressing
128k or bs=1M for a file with enormous extents. I get 1.4GB/s on both.
Still less on the incompressible file (though higher than without the
patch. 900MB/s vs 600MB/s)

I don't think this really perfectly makes sense for an actually proper
version, as readahead_folio() will advance the index and we will keep
"expanding" it to be the size of the compressed extent_map. On the other
hand, if it is stupid and it works, is it stupid?

Very promising... I'm going to study this API further and try to make
a real version. Very open to ideas/advice on how the API is best meant
to be used and if this result might be "cheating" in some way.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-04 21:49       ` Boris Burkov
@ 2025-06-05  4:35         ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-06-05  4:35 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Christoph Hellwig, Dimitrios Apostolou, linux-btrfs, Qu Wenruo,
	Anand Jain, Matthew Wilcox

On Wed, Jun 04, 2025 at 02:49:19PM -0700, Boris Burkov wrote:
> unlocked the perf on the compressed 8k (and 4k) case without regressing
> 128k or bs=1M for a file with enormous extents. I get 1.4GB/s on both.
> Still less on the incompressible file (though higher than without the
> patch. 900MB/s vs 600MB/s)
> 
> I don't think this really perfectly makes sense for an actually proper
> version, as readahead_folio() will advance the index and we will keep
> "expanding" it to be the size of the compressed extent_map. On the other
> hand, if it is stupid and it works, is it stupid?

The way readahead_expand works is that it should basically be no-op
in that case as the two while loops doing the actual work are skipped.
But it might be worth double checking that this is really the case.

You'll probably also want to remove add_ra_bio_pages entirely.

> Very promising... I'm going to study this API further and try to make
> a real version. Very open to ideas/advice on how the API is best meant
> to be used and if this result might be "cheating" in some way.

As far as I can tell this is how it is intended to be used.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-04 18:03     ` Boris Burkov
  2025-06-04 21:49       ` Boris Burkov
@ 2025-06-05 17:09       ` Dimitrios Apostolou
  2025-06-07  0:37         ` Boris Burkov
  1 sibling, 1 reply; 8+ messages in thread
From: Dimitrios Apostolou @ 2025-06-05 17:09 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Christoph Hellwig, linux-btrfs, Qu Wenruo, Anand Jain,
	Matthew Wilcox

Hi Boris, thank you for investigating! I've been chasing this for years 
and I was hitting a wall, the bottleneck was not obvious at all when 
looking from outside the kernel. I've started a few threads before but 
they were fruitless.

On Wed, 4 Jun 2025, Boris Burkov wrote:

>
> stats from an 8K run:
> $ sudo bpftrace readahead.bt
> Attaching 4 probes...
>
> @add_ra_delay_ms: 19450
> @add_ra_delay_ns: 19450937640
> @add_ra_delay_s: 19
>
> @ra_sz_freq[8]: 81920
> @ra_sz_hist:
> [8, 16)            81920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
>
> stats from a 128K run:
> $ sudo bpftrace readahead.bt
> Attaching 4 probes...
>
> @add_ra_delay_ms: 15
> @add_ra_delay_ns: 15333301
> @add_ra_delay_s: 0
>
> @ra_sz_freq[512]: 1
> @ra_sz_freq[256]: 1
> @ra_sz_freq[128]: 2
> @ra_sz_freq[1024]: 2559
> @ra_sz_hist:
> [128, 256)             2 |                                                    |
> [256, 512)             1 |                                                    |
> [512, 1K)              1 |                                                    |
> [1K, 2K)            2559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
>
> so we are spending 19 seconds (vs 0) in add_ra_bio_pages and calling
> btrfs_readahead() 81920 times with 8 pages vs 2559 times with 1024
> pages.

I specifically like the bpftrace utility you are using, it opens up new 
possibilities without custom kernel compiles, so I want to experiment. 
Could you please include the script you used for this histogram?

>
> The total time difference is ~30s on my setup, so there are still ~10
> seconds unaccounted for in my analysis here, though.

This is outstanding. I expect such improvement will give a *huge* boost to 
postgresql workloads on compressed filesystems. By huge I mean 5-10x for 
sequential table scans.

I'm also wondering, in the past I was trying to see if it makes any 
difference to tweak the setting /sys/block/sdX/queue/read_ahead_kb but 
couldn't see any substantial change. Do you see it affecting your results, 
with your patch applied? Or is btrfs following different code paths and 
completely ignoring that?

>
>>> I removed all the extent locking as an experiment, as it is not really
>>> needed for safety in this single threaded test and did see an
>>> improvement but not full parity between 8k and 128k for the compressed
>>> file. I'll keep poking at the other sources of overhead in the builtin
>>> readahead logic and in calling btrfs_readahead more looping inside it.

Since your findings indicate that the issue is probably lock contention, 
you might want to try /proc/lock_stat. It requires a kernel built with 
CONFIG_LOCK_STAT, which is what blocks me at the moment, but it might be 
easier for you if you already compile it for developing btrfs. Docs at:

https://docs.kernel.org/locking/lockstat.html

Thank you,
Dimitris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Sequential read(8K) from compressed files are very slow
  2025-06-05 17:09       ` Dimitrios Apostolou
@ 2025-06-07  0:37         ` Boris Burkov
  0 siblings, 0 replies; 8+ messages in thread
From: Boris Burkov @ 2025-06-07  0:37 UTC (permalink / raw)
  To: Dimitrios Apostolou
  Cc: Christoph Hellwig, linux-btrfs, Qu Wenruo, Anand Jain,
	Matthew Wilcox

On Thu, Jun 05, 2025 at 07:09:07PM +0200, Dimitrios Apostolou wrote:
> Hi Boris, thank you for investigating! I've been chasing this for years and
> I was hitting a wall, the bottleneck was not obvious at all when looking
> from outside the kernel. I've started a few threads before but they were
> fruitless.

Happy to help, it's an interesting problem!

> 
> On Wed, 4 Jun 2025, Boris Burkov wrote:
> 
> > 
> > stats from an 8K run:
> > $ sudo bpftrace readahead.bt
> > Attaching 4 probes...
> > 
> > @add_ra_delay_ms: 19450
> > @add_ra_delay_ns: 19450937640
> > @add_ra_delay_s: 19
> > 
> > @ra_sz_freq[8]: 81920
> > @ra_sz_hist:
> > [8, 16)            81920 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > 
> > stats from a 128K run:
> > $ sudo bpftrace readahead.bt
> > Attaching 4 probes...
> > 
> > @add_ra_delay_ms: 15
> > @add_ra_delay_ns: 15333301
> > @add_ra_delay_s: 0
> > 
> > @ra_sz_freq[512]: 1
> > @ra_sz_freq[256]: 1
> > @ra_sz_freq[128]: 2
> > @ra_sz_freq[1024]: 2559
> > @ra_sz_hist:
> > [128, 256)             2 |                                                    |
> > [256, 512)             1 |                                                    |
> > [512, 1K)              1 |                                                    |
> > [1K, 2K)            2559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > 
> > 
> > so we are spending 19 seconds (vs 0) in add_ra_bio_pages and calling
> > btrfs_readahead() 81920 times with 8 pages vs 2559 times with 1024
> > pages.
> 
> I specifically like the bpftrace utility you are using, it opens up new
> possibilities without custom kernel compiles, so I want to experiment. Could
> you please include the script you used for this histogram?
> 

Unfortunately, I modified the script a bunch since using it. So I don't
have that exact one lying around.

But the features necessary are basically:

fentry:YOUR_FUNC {
        $val = args->YOUR_ARG->YOUR_FIELD->ANOTHER_FIELD; // whatever is relevant
        @h = hist($val);
}

And a bpftrace and kernel built with enough debugging features like BTF
to support it.

The quickstart oneliners here:
https://github.com/bpftrace/bpftrace/blob/master/docs/tutorial_one_liners.md
and the full manual:
https://github.com/bpftrace/bpftrace/blob/master/man/adoc/bpftrace.adoc
I generally find to be quite useful in practice while hacking with my
scripts. You can also refer to a bunch of my examples here if you like,
but caveat emptor on quality :)
This one is a good recent example using fentry and args:
https://github.com/boryas/scripts/blob/main/bt/compr-leak.bt
but that directory has many others from over the years.

If you are reading older scripts, they will often use kprobe/kretprobe
isntead of fentry/fexit, FYI.

> > 
> > The total time difference is ~30s on my setup, so there are still ~10
> > seconds unaccounted for in my analysis here, though.
> 
> This is outstanding. I expect such improvement will give a *huge* boost to
> postgresql workloads on compressed filesystems. By huge I mean 5-10x for
> sequential table scans.

As long as it doesn't regress other workloads too much! Fingers crossed,
and working on further perf testing :)

> 
> I'm also wondering, in the past I was trying to see if it makes any
> difference to tweak the setting /sys/block/sdX/queue/read_ahead_kb but
> couldn't see any substantial change. Do you see it affecting your results,
> with your patch applied? Or is btrfs following different code paths and
> completely ignoring that?
> 

Sorry, haven't gotten to testing this yet.

> > 
> > > > I removed all the extent locking as an experiment, as it is not really
> > > > needed for safety in this single threaded test and did see an
> > > > improvement but not full parity between 8k and 128k for the compressed
> > > > file. I'll keep poking at the other sources of overhead in the builtin
> > > > readahead logic and in calling btrfs_readahead more looping inside it.
> 
> Since your findings indicate that the issue is probably lock contention, you
> might want to try /proc/lock_stat. It requires a kernel built with
> CONFIG_LOCK_STAT, which is what blocks me at the moment, but it might be
> easier for you if you already compile it for developing btrfs. Docs at:
> 
> https://docs.kernel.org/locking/lockstat.html

I think it might be more running the algorithms of the extent range locking
(basically an rb tree storing a set of bits on ranges) rather than contention
on the lock itself. And the "lock" is more like waiting for an event
from this data structure, so I don't think it would show up in lockstat out
of the box. But appreciate the tip!

Boris

> 
> 
> Thank you,
> Dimitris
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-07  0:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 19:56 Sequential read(8K) from compressed files are very slow Dimitrios Apostolou
2025-06-04  1:36 ` Boris Burkov
2025-06-04  6:22   ` Christoph Hellwig
2025-06-04 18:03     ` Boris Burkov
2025-06-04 21:49       ` Boris Burkov
2025-06-05  4:35         ` Christoph Hellwig
2025-06-05 17:09       ` Dimitrios Apostolou
2025-06-07  0:37         ` Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox