linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs sequential 8K read()s from compressed files are not merging
@ 2023-07-10 18:56 Dimitrios Apostolou
  2023-07-17 14:11 ` Dimitrios Apostolou
  0 siblings, 1 reply; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-07-10 18:56 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

I discovered this issue because of very slow sequential read speed in
Postgresql, which performs all reads using blocking pread() calls of 8192
size (postgres' default page size). I verified reads are similarly slow
when I read files using dd bs=8k. Here are my measurements:

Reading a 1GB postgres file using dd (which uses read() internally) in 8K
and 32K chunks:

     # dd if=4156889.4 of=/dev/null bs=8k
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.18829 s, 174 MB/s

     # dd if=4156889.4 of=/dev/null bs=8k    # 2nd run, data is cached
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.287623 s, 3.7 GB/s

     # dd if=4156889.8 of=/dev/null bs=32k
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02688 s, 1.0 GB/s

     # dd if=4156889.8 of=/dev/null bs=32k    # 2nd run, data is cached
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.264049 s, 4.1 GB/s

Notice that the read rate (after transparent decompression) with bs=8k is
174MB/s (I see ~20MB/s on the device), slow and similar to what Postgresql
does. With bs=32k the rate increases to 1GB/s (I see ~80MB/s on the
device, but the time is very short to register properly). The device limit
is 1GB/s, of course I'm not expecting to reach this while decompressing.
The cached reads are fast in both cases, I'm guessing the kernel
buffercache contains the decompressed blocks.

The above results have been verified with multiple runs. The kernel is
5.15 Ubuntu LTS and the block device is an LVM logical volume on a high
performance DAS system, but I verified the same behaviour on a separate
system with kernel 6.3.9 and btrfs directly on a local spinning disk.
Btrfs filesystem is mounted with compress=zstd:3 and the files have been
defragmented prior to running the commands.

Focusing on the cold cache cases, iostat gives interesting insight: For
both postgres doing sequential scan and for dd with bs=8k, the kernel
block layer does not appear to merge the I/O requests. `iostat -x` shows
16 (sectors?) average read request size, 0 merged requests, and very high
reads/s IOPS number.

The dd commands with bs=32k block size show fewer IOPS on `iostat -x`,
higher speed, larger average block size and high number of merged
requests.  To me it appears as btrfs is doing read-ahead only when the
read block is large.

Example output for some random second out of dd bs=8k:

     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
     sdc           1313.00     20.93     2.00   0.15    0.53    16.32

with dd bs=32k:

     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
     sdc            290.00     76.44  4528.00  93.98    1.71   269.92

*On the same filesystem, doing dd bs=8k reads from a file that has not
been compressed by the filesystem I get 1GB/s throughput, which is the
limit of my device. This is what makes me believe it's an issue with btrfs
compression.*

Is this a bug or known behaviour?

Thanks in advance,
Dimitris


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-31  0:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-10 18:56 btrfs sequential 8K read()s from compressed files are not merging Dimitrios Apostolou
2023-07-17 14:11 ` Dimitrios Apostolou
2023-07-26 10:59   ` (PING) " Dimitrios Apostolou
2023-07-26 12:54     ` Christoph Hellwig
2023-07-26 13:44       ` Dimitrios Apostolou
2023-08-29 13:02       ` Dimitrios Apostolou
2023-08-30 11:54         ` Qu Wenruo
2023-08-30 18:18           ` Dimitrios Apostolou
2023-08-31  0:22             ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).