[BUG?] bcachefs performance: read is way too slow when a file has no overwrite.

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
@ 2024-09-06 15:43 David Wang
  2024-09-06 17:38 ` Kent Overstreet
  0 siblings, 1 reply; 14+ messages in thread
From: David Wang @ 2024-09-06 15:43 UTC (permalink / raw)
  To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel


Hi,

I notice a very strange performance issue:
When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad:
	fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test  --bs=4k --iodepth=64 --size=1G --readwrite=randread  --runtime=600 --numjobs=8 --time_based=1
	...
	Run status group 0 (all jobs):
	   READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec

But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s,
almost 10-times better!

This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance.
(I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.)
Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think.


I made some profiling, when read the file without any overwritten to it:

	io_submit_one(98.339% 2635814/2680333)
	    aio_read(96.756% 2550297/2635814)
		bch2_read_iter(98.190% 2504125/2550297)
		    __bch2_read(70.217% 1758320/2504125)
			__bch2_read_extent(74.571% 1311194/1758320)
			    bch2_bio_alloc_pages_pool(72.933% 956297/1311194)  <-----This stands out
			    submit_bio_noacct_nocheck(11.074% 145207/1311194)
			    bio_alloc_bioset(3.823% 50126/1311194)
			    bch2_bkey_pick_read_device(2.157% 28281/1311194)
			    bio_associate_blkg(1.668% 21877/1311194)
				...

And when the file was thoroughly overwritten, by a previous readwrite FIO session, the profiling is:

	io_submit_one(97.596% 12373330/12678072)
	    aio_read(94.856% 11736821/12373330)
		bch2_read_iter(94.817% 11128518/11736821)
		    __bch2_read(70.841% 7883577/11128518)
			__bch2_read_extent(35.572% 2804346/7883577)
			    submit_bio_noacct_nocheck(46.356% 1299974/2804346)
			    bch2_bkey_pick_read_device(8.972% 251601/2804346)
			    bio_associate_blkg(8.067% 226227/2804346)
			    submit_bio_noacct(7.005% 196432/2804346)
			    bch2_trans_unlock(6.241% 175020/2804346)
			    bch2_can_narrow_extent_crcs(3.714% 104157/2804346)
			    local_clock(1.873% 52513/2804346)
			    submit_bio(1.355% 37997/2804346)
				...

Both profilings have sample 10-minutes duration, and same sample frequency.
Base on the difference between total sample count, 2680333 vs 12678072,
 I would suspect bch2_bio_alloc_pages_pool would incur lots of locking.

Here more detail for bch2_bio_alloc_pages_pool:

	bch2_bio_alloc_pages_pool(72.933% 956297/1311194)
	    alloc_pages_mpol_noprof(82.644% 790323/956297)
		__alloc_pages_noprof(89.562% 707833/790323)
		    get_page_from_freelist(79.801% 564855/707833)
			__rmqueue_pcplist(24.713% 139593/564855)
			post_alloc_hook(15.045% 84983/564855)
				...
		    __next_zones_zonelist(3.578% 25323/707833)
			...
		policy_nodemask(3.352% 26495/790323)
			...
	    bio_add_page(10.740% 102710/956297)



Thanks~
David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-06 15:43 [BUG?] bcachefs performance: read is way too slow when a file has no overwrite David Wang
@ 2024-09-06 17:38 ` Kent Overstreet
  2024-09-07 10:34   ` David Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Kent Overstreet @ 2024-09-06 17:38 UTC (permalink / raw)
  To: David Wang; +Cc: linux-bcachefs, linux-kernel

On Fri, Sep 06, 2024 at 11:43:54PM GMT, David Wang wrote:
> 
> Hi,
> 
> I notice a very strange performance issue:
> When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad:
> 	fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test  --bs=4k --iodepth=64 --size=1G --readwrite=randread  --runtime=600 --numjobs=8 --time_based=1
> 	...
> 	Run status group 0 (all jobs):
> 	   READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec
> 
> But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s,
> almost 10-times better!
> 
> This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance.
> (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.)
> Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think.

That's because checksums are at extent granularity, not block: if you're
doing O_DIRECT reads that are smaller than the writes the data was
written with, performance will be bad because we have to read the entire
extent to verify the checksum.

block granular checksums will come at some point, as an optional feature
(most of the time you don't want them, and you'd prefer more compact
metadata)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-06 17:38 ` Kent Overstreet
@ 2024-09-07 10:34   ` David Wang
  2024-09-09 13:37     ` Kent Overstreet
  2024-09-24 11:08     ` David Wang
  0 siblings, 2 replies; 14+ messages in thread
From: David Wang @ 2024-09-07 10:34 UTC (permalink / raw)
  To: kent.overstreet; +Cc: 00107082, linux-bcachefs, linux-kernel

At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Fri, Sep 06, 2024 at 11:43:54PM GMT, David Wang wrote:
>> 
>> Hi,
>> 
>> I notice a very strange performance issue:
>> When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad:
>> 	fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test  --bs=4k --iodepth=64 --size=1G --readwrite=randread  --runtime=600 --numjobs=8 --time_based=1
>> 	...
>> 	Run status group 0 (all jobs):
>> 	   READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec
>> 
>> But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s,
>> almost 10-times better!
>> 
>> This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance.
>> (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.)
>> Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think.
>
>That's because checksums are at extent granularity, not block: if you're
>doing O_DIRECT reads that are smaller than the writes the data was
>written with, performance will be bad because we have to read the entire
>extent to verify the checksum.


>
>block granular checksums will come at some point, as an optional feature
>(most of the time you don't want them, and you'd prefer more compact
>metadata)

Hi, I made further tests combining different write and read size, the results
are not confirming the explanation for O_DIRECT.

Without O_DIRECT (fio  --direct=0....), the average read bandwidth
is improved, but with a very big standard deviation:
+--------------------+----------+----------+----------+----------+
| prepare-write\read |    1k    |    4k    |    8K    |   16K    |
+--------------------+----------+----------+----------+----------+
|         1K         | 328MiB/s | 395MiB/s | 465MiB/s |          |
|         4K         | 193MiB/s | 219MiB/s | 274MiB/s | 392MiB/s |
|         8K         | 251MiB/s | 280MiB/s | 368MiB/s | 435MiB/s |
|        16K         | 302MiB/s | 380MiB/s | 464MiB/s | 577MiB/s |
+--------------------+----------+----------+----------+----------+
(Rows are write size when preparing the test files, and columns are read size for fio test.)

And with O_DIRECT, the result is:
+--------------------+-----------+-----------+----------+----------+
| prepare-write\read |     1k    |     4k    |    8K    |   16K    |
+--------------------+-----------+-----------+----------+----------+
|         1K         | 24.1MiB/s | 96.5MiB/s | 193MiB/s |          |
|         4K         | 14.4MiB/s | 57.6MiB/s | 116MiB/s | 230MiB/s |
|         8K         | 24.6MiB/s | 97.6MiB/s | 192MiB/s | 309MiB/s |
|        16K         | 26.4MiB/s |  104MiB/s | 206MiB/s | 402MiB/s |
+--------------------+-----------+-----------+----------+----------+

code to prepare the test files:
	#define KN 8 //<- adjust this for each row
	char name[32];
	char buf[1024*KN];
	int main() {
		int i, m = 1024*1024/KN, k, df;
		for (i=0; i<8; i++) {
			sprintf(name, "test.%d.0", i);
			fd = open(name, O_CREAT|O_DIRECT|O_SYNC|O_TRUNC|O_WRONLY);
			for (k=0; k<m; k++) write(fd, buf, sizeof(buf));
			close(fd);
		}
		return 0;
	}

Based on the result:
1. The row with prepare-write size 4K stands out, here.
When files were prepaired with write size 4K, the afterwards
 read performance is worse.  (I did double check the result,
but it is possible that I miss some affecting factors.);
2. Without O_DIRECT, read performance seems correlated with the difference
 between read size and prepare write size, but with O_DIRECT, correlation is not obvious.

And, to mention it again, if I overwrite the files **thoroughly** with fio write test
(using same size), the read performance afterwards would be very good:

	# overwrite the files with randwrite, block size 8k
	$ fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test  --bs=8k --iodepth=64 --size=1G --readwrite=randwrite  --runtime=300 --numjobs=8 --time_based=1
	# test the read performance with randread, block size 8k
	$ fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test  --bs=8k --iodepth=64 --size=1G --readwrite=randread  --runtime=300 --numjobs=8 --time_based=1
	...
	Run status group 0 (all jobs):
	   READ: bw=964MiB/s (1011MB/s), 116MiB/s-123MiB/s (121MB/s-129MB/s), io=283GiB (303GB), run=300004-300005msec



FYI
David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-07 10:34   ` David Wang
@ 2024-09-09 13:37     ` Kent Overstreet
  2024-09-12  2:39       ` David Wang
  2024-09-21 16:02       ` David Wang
  2024-09-24 11:08     ` David Wang
  1 sibling, 2 replies; 14+ messages in thread
From: Kent Overstreet @ 2024-09-09 13:37 UTC (permalink / raw)
  To: David Wang; +Cc: linux-bcachefs, linux-kernel

On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:
> At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >On Fri, Sep 06, 2024 at 11:43:54PM GMT, David Wang wrote:
> >> 
> >> Hi,
> >> 
> >> I notice a very strange performance issue:
> >> When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad:
> >> 	fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test  --bs=4k --iodepth=64 --size=1G --readwrite=randread  --runtime=600 --numjobs=8 --time_based=1
> >> 	...
> >> 	Run status group 0 (all jobs):
> >> 	   READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec
> >> 
> >> But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s,
> >> almost 10-times better!
> >> 
> >> This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance.
> >> (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.)
> >> Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think.
> >
> >That's because checksums are at extent granularity, not block: if you're
> >doing O_DIRECT reads that are smaller than the writes the data was
> >written with, performance will be bad because we have to read the entire
> >extent to verify the checksum.
> 
> 
> >
> >block granular checksums will come at some point, as an optional feature
> >(most of the time you don't want them, and you'd prefer more compact
> >metadata)
> 
> Hi, I made further tests combining different write and read size, the results
> are not confirming the explanation for O_DIRECT.
> 
> Without O_DIRECT (fio  --direct=0....), the average read bandwidth
> is improved, but with a very big standard deviation:
> +--------------------+----------+----------+----------+----------+
> | prepare-write\read |    1k    |    4k    |    8K    |   16K    |
> +--------------------+----------+----------+----------+----------+
> |         1K         | 328MiB/s | 395MiB/s | 465MiB/s |          |
> |         4K         | 193MiB/s | 219MiB/s | 274MiB/s | 392MiB/s |
> |         8K         | 251MiB/s | 280MiB/s | 368MiB/s | 435MiB/s |
> |        16K         | 302MiB/s | 380MiB/s | 464MiB/s | 577MiB/s |
> +--------------------+----------+----------+----------+----------+
> (Rows are write size when preparing the test files, and columns are read size for fio test.)
> 
> And with O_DIRECT, the result is:
> +--------------------+-----------+-----------+----------+----------+
> | prepare-write\read |     1k    |     4k    |    8K    |   16K    |
> +--------------------+-----------+-----------+----------+----------+
> |         1K         | 24.1MiB/s | 96.5MiB/s | 193MiB/s |          |
> |         4K         | 14.4MiB/s | 57.6MiB/s | 116MiB/s | 230MiB/s |
> |         8K         | 24.6MiB/s | 97.6MiB/s | 192MiB/s | 309MiB/s |
> |        16K         | 26.4MiB/s |  104MiB/s | 206MiB/s | 402MiB/s |
> +--------------------+-----------+-----------+----------+----------+
> 
> code to prepare the test files:
> 	#define KN 8 //<- adjust this for each row
> 	char name[32];
> 	char buf[1024*KN];
> 	int main() {
> 		int i, m = 1024*1024/KN, k, df;
> 		for (i=0; i<8; i++) {
> 			sprintf(name, "test.%d.0", i);
> 			fd = open(name, O_CREAT|O_DIRECT|O_SYNC|O_TRUNC|O_WRONLY);
> 			for (k=0; k<m; k++) write(fd, buf, sizeof(buf));
> 			close(fd);
> 		}
> 		return 0;
> 	}
> 
> Based on the result:
> 1. The row with prepare-write size 4K stands out, here.
> When files were prepaired with write size 4K, the afterwards
>  read performance is worse.  (I did double check the result,
> but it is possible that I miss some affecting factors.);

On small blocksize tests you should be looking at IOPS, not MB/s.

Prepare-write size is the column?

Another factor is that we do merge extents (including checksums); so if
the preparet-write is done sequentially we won't actually be ending up
with extents of the same size as what we wrote.

I believe there's a knob somewhere to turn off extent merging (module
parameter? it's intended for debugging).

> 2. Without O_DIRECT, read performance seems correlated with the difference
>  between read size and prepare write size, but with O_DIRECT, correlation is not obvious.

So the O_DIRECT and buffered IO paths are very different (in every
filesystem) - you're looking at very different things. They are both
subject to the checksum granularity issue, but in buffered mode we round
up reads to extent size, when filling into the page cache.

Big standard deviation (high tail latency?) is something we'd want to
track down. There's a bunch of time_stats in sysfs, but they're mostly
for the write paths. If you're trying to identify where the latencies
are coming from, we can look at adding some new time stats to isolate.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-09 13:37     ` Kent Overstreet
@ 2024-09-12  2:39       ` David Wang
  2024-09-12  7:52         ` David Wang
  2024-09-21 16:02       ` David Wang
  1 sibling, 1 reply; 14+ messages in thread
From: David Wang @ 2024-09-12  2:39 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel


Hi, 
At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:

>> 
>> Based on the result:
>> 1. The row with prepare-write size 4K stands out, here.
>> When files were prepaired with write size 4K, the afterwards
>>  read performance is worse.  (I did double check the result,
>> but it is possible that I miss some affecting factors.);
>
>On small blocksize tests you should be looking at IOPS, not MB/s.
>
>Prepare-write size is the column?
Each row is for a specific prepare-write size indicated by first column. 

>
>Another factor is that we do merge extents (including checksums); so if
>the preparet-write is done sequentially we won't actually be ending up
>with extents of the same size as what we wrote.
>
>I believe there's a knob somewhere to turn off extent merging (module
>parameter? it's intended for debugging).

I made some debug, when performance is bad, the conditions
bvec_iter_sectors(iter) != pick.crc.uncompressed_size and 
bvec_iter_sectors(iter) != pick.crc.live_size are "almost" always both "true",
while when performance is good (after "thorough" write), they are only little
percent (~350 out of 1000000)  to be true.

And if those conditions are "true", "bounce" would be set and code seems to run
on a time consuming path.

I suspect "merely read" could never change those conditions, but "write" can?

>
>> 2. Without O_DIRECT, read performance seems correlated with the difference
>>  between read size and prepare write size, but with O_DIRECT, correlation is not obvious.
>
>So the O_DIRECT and buffered IO paths are very different (in every
>filesystem) - you're looking at very different things. They are both
>subject to the checksum granularity issue, but in buffered mode we round
>up reads to extent size, when filling into the page cache.
>
>Big standard deviation (high tail latency?) is something we'd want to
>track down. There's a bunch of time_stats in sysfs, but they're mostly
>for the write paths. If you're trying to identify where the latencies
>are coming from, we can look at adding some new time stats to isolate.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-12  2:39       ` David Wang
@ 2024-09-12  7:52         ` David Wang
  0 siblings, 0 replies; 14+ messages in thread
From: David Wang @ 2024-09-12  7:52 UTC (permalink / raw)
  To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel

Hi, 

> I made some debug, when performance is bad, the conditions
> bvec_iter_sectors(iter) != pick.crc.uncompressed_size and 
> bvec_iter_sectors(iter) != pick.crc.live_size are "almost" always both "true",
> while when performance is good (after "thorough" write), they are only little
> percent (~350 out of 1000000)  to be true.
> 
> And if those conditions are "true", "bounce" would be set and code seems to run
> on a time consuming path.
> 
> I suspect "merely read" could never change those conditions, but "write" can?
> 

More update: 

1. Without a "thorough" write, it seems no matter what the prepare write size is,
crc.compressed_size is always 128 sectors = 64K?
2. With a "thorough" write with 4K block size, crc.compressed_size mostly descreases to 4K,
only a few crc.compressed_size left with 8/12/16/20K...
3. If a 4K-thorough-write followed by 40K-thorough-write, crc.compressed_size then 
increases to 40K, and 4K direct read suffers again....
4. A 40K-through-write followed by 256K-thorough-write, crc.compressed_size only
increase to 64K, I guess 64K is maximum crc.compressed_size.

So I think current conclusion is:
1. The initial crc.compressed_size is always 64K when file was created/prepared.
2. Afterward writes can change crc size based on write size. (optimized for write?)
3. Direct read performance is sensitive to this crc size, more test result:
	+-----------+--------+----------+
	| rand read |  IOPS  |    BW    |
	+-----------+--------+----------+
	|   4K !E   | 24.7K  | 101MB/s  |
	|   16K !E  | 24.7K  | 404MB/s  |
	|   64K !E  | 24.7K  | 1617MB/s |
	|    4K E   | ~220K  | ~900MB/s |
	|   16K E   |  ~55K  | ~900MB/s |
	|   64K E   | ~13.8K | ~900MB/s |
	+-----------+--------+----------+
E stands for the event that a "thorough" 4k write happened before the test.
Or put it more specific:
E: lots of rand 4k-write, crc.compressed_size = 4K
!E: file was just created, crc.compressed_size = 64K

The behavior seems reasonable from write's point of view, but for read it
dose not sounds good....If a mmaped readonly file, page in less than
16 pages, those extra data would waste lots of disk bandwidth.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-09 13:37     ` Kent Overstreet
  2024-09-12  2:39       ` David Wang
@ 2024-09-21 16:02       ` David Wang
  2024-09-21 16:12         ` Kent Overstreet
  1 sibling, 1 reply; 14+ messages in thread
From: David Wang @ 2024-09-21 16:02 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel

Hi, 

At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:

>
>Big standard deviation (high tail latency?) is something we'd want to
>track down. There's a bunch of time_stats in sysfs, but they're mostly
>for the write paths. If you're trying to identify where the latencies
>are coming from, we can look at adding some new time stats to isolate.

About performance, I have a theory based on some observation I made recently:
When user space app make a 4k(8 sectors) direct write, 
bcachefs would initiate a write request of ~11 sectors, including the checksum data, right?
This may not be a good offset+size pattern of block layer for performance.  
(I did get a very-very bad performance on ext4 if write with 5K size.)

So I think, would it be feasible to make checksum sectors on a 4/8 sector boundary?
This will waste more diskspace, but may make block layer happy?

Thanks
David  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-21 16:02       ` David Wang
@ 2024-09-21 16:12         ` Kent Overstreet
  2024-09-22  1:39           ` David Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Kent Overstreet @ 2024-09-21 16:12 UTC (permalink / raw)
  To: David Wang; +Cc: linux-bcachefs, linux-kernel

On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote:
> Hi, 
> 
> At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:
> 
> >
> >Big standard deviation (high tail latency?) is something we'd want to
> >track down. There's a bunch of time_stats in sysfs, but they're mostly
> >for the write paths. If you're trying to identify where the latencies
> >are coming from, we can look at adding some new time stats to isolate.
> 
> About performance, I have a theory based on some observation I made recently:
> When user space app make a 4k(8 sectors) direct write, 
> bcachefs would initiate a write request of ~11 sectors, including the checksum data, right?
> This may not be a good offset+size pattern of block layer for performance.  
> (I did get a very-very bad performance on ext4 if write with 5K size.)

The checksum isn't inline with the data, it's stored with the pointer -
so if you're seeing 11 sector writes, something really odd is going
on...

I would suggest doing some testing with data checksums off first, to
isolate the issue; then it sounds like that IO pattern needs to be
looked at.

Check the extents btree in debugfs as well, to make sure the extents are
getting written out as you think they are.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-21 16:12         ` Kent Overstreet
@ 2024-09-22  1:39           ` David Wang
  2024-09-22  8:31             ` David Wang
  0 siblings, 1 reply; 14+ messages in thread
From: David Wang @ 2024-09-22  1:39 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel

Hi, 

At 2024-09-22 00:12:01, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote:
>> Hi, 
>> 
>> At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:
>> 
>> >
>> >Big standard deviation (high tail latency?) is something we'd want to
>> >track down. There's a bunch of time_stats in sysfs, but they're mostly
>> >for the write paths. If you're trying to identify where the latencies
>> >are coming from, we can look at adding some new time stats to isolate.
>> 
>> About performance, I have a theory based on some observation I made recently:
>> When user space app make a 4k(8 sectors) direct write, 
>> bcachefs would initiate a write request of ~11 sectors, including the checksum data, right?
>> This may not be a good offset+size pattern of block layer for performance.  
>> (I did get a very-very bad performance on ext4 if write with 5K size.)
>
>The checksum isn't inline with the data, it's stored with the pointer -
>so if you're seeing 11 sector writes, something really odd is going
>on...
>

.... This is really contradict with my observation:
1. fio stats yields a average 50K IOPS for a 400 seconds random direct write test.
2. from /proc/diskstatas, average "Field 5 -- # of writes completed"  per second is also 50K
(Here I conclude the performance issue is not caused by extra IOPS for checksum.)
3.  from "Field 10 -- # of milliseconds spent doing I/Os",  average disk "busy" time per second is about ~0.9second, similar to the result of ext4 test.
(Here I conclude the performance issue it not caused by not pushing disk device too hard.)
4. delta(Field 7 -- # of sectors written) / delta(Field 5 -- # of writes completed)  for 5 minutes interval is 11 sectors/write.
(This is why I draw the theory that the checksum is with raw data......I thought is was a reasonable...)

I will make some debug code to collect sector number patterns.

>I would suggest doing some testing with data checksums off first, to
>isolate the issue; then it sounds like that IO pattern needs to be
>looked at.

I will try it. 

>
>Check the extents btree in debugfs as well, to make sure the extents are
>getting written out as you think they are.

Thanks
David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite
  2024-09-22  1:39           ` David Wang
@ 2024-09-22  8:31             ` David Wang
  2024-09-22  8:47               ` David Wang
  0 siblings, 1 reply; 14+ messages in thread
From: David Wang @ 2024-09-22  8:31 UTC (permalink / raw)
  To: 00107082, kent.overstreet; +Cc: linux-bcachefs, linux-kernel

>Hi, 
>
>At 2024-09-22 00:12:01, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>>On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote:
>>> Hi, 
>>> 
>>> At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>>> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:
>>> 
>>> >
>>> >Big standard deviation (high tail latency?) is something we'd want to
>>> >track down. There's a bunch of time_stats in sysfs, but they're mostly
>>> >for the write paths. If you're trying to identify where the latencies
>>> >are coming from, we can look at adding some new time stats to isolate.
>>> 
>>> About performance, I have a theory based on some observation I made recently:
>>> When user space app make a 4k(8 sectors) direct write, 
>>> bcachefs would initiate a write request of ~11 sectors, including the checksum data, right?
>>> This may not be a good offset+size pattern of block layer for performance.  
>>> (I did get a very-very bad performance on ext4 if write with 5K size.)
>>
>>The checksum isn't inline with the data, it's stored with the pointer -
>>so if you're seeing 11 sector writes, something really odd is going
>>on...
>>
>
>.... This is really contradict with my observation:
>1. fio stats yields a average 50K IOPS for a 400 seconds random direct write test.
>2. from /proc/diskstatas, average "Field 5 -- # of writes completed"  per second is also 50K
>(Here I conclude the performance issue is not caused by extra IOPS for checksum.)
>3.  from "Field 10 -- # of milliseconds spent doing I/Os",  average disk "busy" time per second is about ~0.9second, similar to the result of ext4 test.
>(Here I conclude the performance issue it not caused by not pushing disk device too hard.)
>4. delta(Field 7 -- # of sectors written) / delta(Field 5 -- # of writes completed)  for 5 minutes interval is 11 sectors/write.
>(This is why I draw the theory that the checksum is with raw data......I thought is was a reasonable...)
>
>I will make some debug code to collect sector number patterns.
>

I collected sector numbers at the beginning of submit_bio in block/blk-core.c,
It turns out my guess was totally wrong, the user data is 8-sectors clean, the ~11 sectors
I observed was just average sector per write. Sorry, I assumed too much, I thought each user write
would be companied by a checksum-write.....
And during a stress direct-4K-write test, the top-20 write sector number pattern is:
	+---------+------------+
	| sectors | percentage |
	+---------+------------+
	|    8    |  97.637%   |
	|    1    |   0.813%   |   
	|   510   |   0.315%   |  <== large <--journal_write_submit
	|    4    |   0.123%   |
	|    3    |   0.118%   |
	|    2    |   0.117%   |
	|   508   |   0.113%   |  <==
	|   509   |   0.094%   |  <==
	|    5    |   0.075%   |
	|    6    |   0.037%   |
	|   507   |   0.032%   |  <==
	|    14   |   0.024%   |
	|    13   |   0.020%   |
	|    11   |   0.020%   |
	|    15   |   0.020%   |
	|    10   |   0.020%   |
	|    16   |   0.018%   |
	|    12   |   0.018%   |
	|    7    |   0.017%   |
	|    20   |   0.017%   |
	+---------+------------+

btree_io write pattern, collected from btree_node_write_endio, 
is kind of uniform/flat distributed, not on block-friendly size
boundaries (I think):
	+---------+------------+
	| sectors | percentage |
	+---------+------------+
	|    1    |   9.021%   |
	|    3    |   1.440%   |
	|    4    |   1.249%   |
	|    2    |   1.157%   |
	|    5    |   0.804%   |
	|    6    |   0.409%   |
	|    14   |   0.259%   |
	|    15   |   0.253%   |
	|    16   |   0.228%   |
	|    7    |   0.226%   |
	|    11   |   0.223%   |
	|    10   |   0.223%   |
	|    13   |   0.222%   |
	|    9    |   0.213%   |
	|    12   |   0.202%   |
	|    41   |   0.194%   |
	|    17   |   0.183%   |
	|    8    |   0.182%   |
	|    18   |   0.167%   |
	|    20   |   0.167%   |
	|    19   |   0.163%   |
	|    21   |   0.160%   |
	|   205   |   0.158%   |
	|    22   |   0.145%   |
	|    23   |   0.117%   |
	|    24   |   0.093%   |
	|    51   |   0.089%   |
	|    25   |   0.080%   |
	|   204   |   0.079%   |
	+---------+------------+


Now, it seems to be that journal_io's big trunk of IO and btree_io's
irregular IO size would be the main causing factors for halving direct-4K-write
 user-io bandwidth, compared with ext4.


Maybe btree_io's irregular IO size could be regularized?

> 
>
>
>>I would suggest doing some testing with data checksums off first, to
>>isolate the issue; then it sounds like that IO pattern needs to be
>>looked at.
>
>I will try it. 

I format  partition with
`sudo bcachefs format --metadata_checksum=none --data_checksum=none /dev/nvme0n1p1`

It dosen't have significant help with write performance:
"IOPS=53.3k, BW=208MiB/s" --> "IOPS=55.3k, BW=216MiB/s",
and btree write's irregular IO size pattern still shows up.

But it help improve direct-4k-read performance significantly, I guess that would be expected
considering no extra data needs to be fetched for each read.

> 
>>
>>Check the extents btree in debugfs as well, to make sure the extents are
>>getting written out as you think they are.


Thanks
David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite
  2024-09-22  8:31             ` David Wang
@ 2024-09-22  8:47               ` David Wang
  0 siblings, 0 replies; 14+ messages in thread
From: David Wang @ 2024-09-22  8:47 UTC (permalink / raw)
  To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel


At 2024-09-22 16:31:48, "David Wang" <00107082@163.com> wrote:
>>Hi, 
>>

>btree_io write pattern, collected from btree_node_write_endio, 
>is kind of uniform/flat distributed, not on block-friendly size
>boundaries (I think):
>	+---------+------------+
>	| sectors | percentage |
>	+---------+------------+
>	|    1    |   9.021%   |
>	|    3    |   1.440%   |
>	|    4    |   1.249%   |
>	|    2    |   1.157%   |
>	|    5    |   0.804%   |
>	|    6    |   0.409%   |
>	|    14   |   0.259%   |
>	|    15   |   0.253%   |
>	|    16   |   0.228%   |
>	|    7    |   0.226%   |
>	|    11   |   0.223%   |
>	|    10   |   0.223%   |
>	|    13   |   0.222%   |
>	|    9    |   0.213%   |
>	|    12   |   0.202%   |
>	|    41   |   0.194%   |
>	|    17   |   0.183%   |
>	|    8    |   0.182%   |
>	|    18   |   0.167%   |
>	|    20   |   0.167%   |
>	|    19   |   0.163%   |
>	|    21   |   0.160%   |
>	|   205   |   0.158%   |
>	|    22   |   0.145%   |
>	|    23   |   0.117%   |
>	|    24   |   0.093%   |
>	|    51   |   0.089%   |
>	|    25   |   0.080%   |
>	|   204   |   0.079%   |
>	+---------+------------+
>

Oops...wrong weight used to calculate percentage, it should be
+---------+------------+
| sectors | percentage |
+---------+------------+
|    1    |  45.105%   |
|    3    |   7.200%   |
|    4    |   6.244%   |
|    2    |   5.785%   |
|    5    |   4.018%   |
|    6    |   2.045%   |
|    14   |   1.296%   |
|    15   |   1.264%   |
|    16   |   1.141%   |
|    7    |   1.129%   |
|    11   |   1.117%   |
|    10   |   1.113%   |
|    13   |   1.111%   |
|    9    |   1.065%   |
|    12   |   1.011%   |
|    41   |   0.971%   |
|    17   |   0.913%   |
|    8    |   0.912%   |
|    18   |   0.836%   |
|    20   |   0.835%   |
|    19   |   0.812%   |
|    21   |   0.799%   |
|   205   |   0.791%   |
|    22   |   0.724%   |
|    23   |   0.587%   |
|    24   |   0.465%   |
|    51   |   0.443%   |
|    25   |   0.398%   |
|   204   |   0.396%   |
+---------+------------+


David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-07 10:34   ` David Wang
  2024-09-09 13:37     ` Kent Overstreet
@ 2024-09-24 11:08     ` David Wang
  2024-09-24 11:30       ` Kent Overstreet
  1 sibling, 1 reply; 14+ messages in thread
From: David Wang @ 2024-09-24 11:08 UTC (permalink / raw)
  To: kent.overstreet; +Cc: 00107082, linux-bcachefs, linux-kernel

Hi, 

At 2024-09-07 18:34:37, "David Wang" <00107082@163.com> wrote:
>At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>>That's because checksums are at extent granularity, not block: if you're
>>doing O_DIRECT reads that are smaller than the writes the data was
>>written with, performance will be bad because we have to read the entire
>>extent to verify the checksum.
>
>

>Based on the result:
>1. The row with prepare-write size 4K stands out, here.
>When files were prepaired with write size 4K, the afterwards
> read performance is worse.  (I did double check the result,
>but it is possible that I miss some affecting factors.);
>2. Without O_DIRECT, read performance seems correlated with the difference
> between read size and prepare write size, but with O_DIRECT, correlation is not obvious.
>
>And, to mention it again, if I overwrite the files **thoroughly** with fio write test
>(using same size), the read performance afterwards would be very good:
>

Update some IO pattern (bio start address and size, in sectors, address&=-address),
between bcachefs and block layer:

4K-Direct-Read a file created by loop of `write(fd, buf, 1024*4)`:
+--------------------------+--------+--------+--------+--------+---------+
|       offset\size        |   1    |   6    |   7    |   8    |   128   |
+--------------------------+--------+--------+--------+--------+---------+
|                        1 | 0.015% | 0.003% |   -    |   -    |    -    |
|                       10 | 0.008% | 0.001% |   -    | 0.000% |    -    |
|                      100 | 0.003% | 0.001% | 0.000% |   -    |    -    |
|                     1000 | 0.002% | 0.000% |   -    |   -    |    -    |
|                    10000 | 0.001% | 0.000% |   -    |   -    |    -    |
|                   100000 | 0.000% |   -    |   -    |   -    |    -    |
|                  1000000 | 0.000% |   -    |   -    |   -    |    -    |
|                 10000000 | 0.000% |   -    |   -    |   -    | 49.989% |
|                100000000 | 0.001% |   -    |   -    |   -    | 24.994% |
|               1000000000 |   -    |   -    |   -    |   -    | 12.486% |
|              10000000000 |   -    |   -    |   -    |   -    |  6.253% |
|             100000000000 |   -    |   -    |   -    |   -    |  3.120% |
|            1000000000000 |   -    | 0.000% |   -    |   -    |  1.561% |
|           10000000000000 |   -    |   -    |   -    |   -    |  0.781% |
|          100000000000000 |   -    |   -    |   -    |   -    |  0.391% |
|         1000000000000000 |   -    |   -    |   -    |   -    |  0.195% |
|        10000000000000000 |   -    |   -    |   -    |   -    |  0.098% |
|       100000000000000000 |   -    |   -    |   -    |   -    |  0.049% |
|      1000000000000000000 |   -    |   -    |   -    |   -    |  0.024% |
|     10000000000000000000 |   -    |   -    |   -    |   -    |  0.013% |
|    100000000000000000000 |   -    |   -    |   -    |   -    |  0.006% |
|  10000000000000000000000 |   -    |   -    |   -    |   -    |  0.006% |
+--------------------------+--------+--------+--------+--------+---------+

4K-Direct-Read a file created by `dd if=/dev/urandom ...`
+--------------------------+---------+
|       offset\size        |   128   |
+--------------------------+---------+
|                 10000000 | 50.003% |
|                100000000 | 24.993% |
|               1000000000 | 12.508% |
|              10000000000 |  6.252% |
|             100000000000 |  3.118% |
|            1000000000000 |  1.561% |
|           10000000000000 |  0.782% |
|          100000000000000 |  0.391% |
|         1000000000000000 |  0.196% |
|        10000000000000000 |  0.098% |
|       100000000000000000 |  0.049% |
|      1000000000000000000 |  0.025% |
|     10000000000000000000 |  0.012% |
|    100000000000000000000 |  0.006% |
|   1000000000000000000000 |  0.006% |
+--------------------------+---------+

4K-Direct-Read a file which is *overwritten* by random fio 4k-direct-write for 10 minutes
+--------------------------+---------+--------+--------+
|       offset\size        |    8    |   16   |   24   |
+--------------------------+---------+--------+--------+
|                     1000 | 49.912% | 0.028% | 0.004% |
|                    10000 | 25.024% | 0.018% | 0.001% |
|                   100000 | 12.507% | 0.012% | 0.001% |
|                  1000000 |  6.273% | 0.002% | 0.001% |
|                 10000000 |  3.121% | 0.002% |   -    |
|                100000000 |  1.548% |   -    |   -    |
|               1000000000 |  0.778% | 0.001% |   -    |
|              10000000000 |  0.386% |   -    |   -    |
|             100000000000 |  0.194% |   -    |   -    |
|            1000000000000 |  0.098% |   -    |   -    |
|           10000000000000 |  0.046% |   -    |   -    |
|          100000000000000 |  0.023% |   -    |   -    |
|         1000000000000000 |  0.011% |   -    |   -    |
|        10000000000000000 |  0.006% |   -    |   -    |
|       100000000000000000 |  0.003% |   -    |   -    |
|      1000000000000000000 |  0.002% |   -    |   -    |
|     10000000000000000000 |  0.001% |   -    |   -    |
|  10000000000000000000000 |  0.000% |   -    |   -    |
+--------------------------+---------+--------+--------+


Those read of 1 sector size in the first IO pattern may need attention? (@Kent)
(The file was created via following code:
	#define _GNU_SOURCE
	#include <stdio.h>
	#include <fcntl.h>
	#include <unistd.h>

	#define KN 4
	char name[32];
	char buf[1024*KN];
	int main() {
		int i, m = 1024*1024/KN, k, fd;
		for (i=0; i<1; i++) {
			sprintf(name, "test.%d.0", i);
			fd = open(name, O_CREAT|O_DIRECT|O_SYNC|O_TRUNC|O_WRONLY);
			for (k=0; k<m; k++) write(fd, buf, sizeof(buf));
			close(fd);
		}
		return 0;
	}

I also collected latency between FS and BIO (submit_bio --> bio_endio),
 and did not observe difference between bcachefs and ext4, when extension size is mostly 4K.
On my SSD, one 4K-direct-read test even shows bcachefs usage is better:
 average 171086ns for ext4, 133304ns for bcachefs.

But the overall performance, from fio's point of view,
bcachefs is only half of ext4's, and cpu usage is much lower
than ext4: 60%- vs 90%+. 
(The bottleneck should be within bcachefs, I guess? But don't have
any idea of how to measure it.)

Glad to hear those new patches for 6.12,
https://lore.kernel.org/lkml/CAHk-=wh+atcBWa34mDdG1bFGRc28eJas3tP+9QrYXX6C7BX0JQ@mail.gmail.com/T/#m27c78e1f04c556ab064bec06520b8d7fcf4518c5
really looks promising, looking forward to test it next week~!!


Thanks
David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-24 11:08     ` David Wang
@ 2024-09-24 11:30       ` Kent Overstreet
  2024-09-24 12:38         ` David Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Kent Overstreet @ 2024-09-24 11:30 UTC (permalink / raw)
  To: David Wang; +Cc: linux-bcachefs, linux-kernel

On Tue, Sep 24, 2024 at 07:08:07PM GMT, David Wang wrote:
> Hi, 
> 
> At 2024-09-07 18:34:37, "David Wang" <00107082@163.com> wrote:
> >At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >>That's because checksums are at extent granularity, not block: if you're
> >>doing O_DIRECT reads that are smaller than the writes the data was
> >>written with, performance will be bad because we have to read the entire
> >>extent to verify the checksum.
> >
> >
> 
> >Based on the result:
> >1. The row with prepare-write size 4K stands out, here.
> >When files were prepaired with write size 4K, the afterwards
> > read performance is worse.  (I did double check the result,
> >but it is possible that I miss some affecting factors.);
> >2. Without O_DIRECT, read performance seems correlated with the difference
> > between read size and prepare write size, but with O_DIRECT, correlation is not obvious.
> >
> >And, to mention it again, if I overwrite the files **thoroughly** with fio write test
> >(using same size), the read performance afterwards would be very good:
> >
> 
> Update some IO pattern (bio start address and size, in sectors, address&=-address),
> between bcachefs and block layer:
> 
> 4K-Direct-Read a file created by loop of `write(fd, buf, 1024*4)`:

You're still testing small reads to big extents. Flip off data
checksumming if you want to test that, or wait for block granular
checksums to land.

I already explained what's going on, so this isn't very helpful.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite.
  2024-09-24 11:30       ` Kent Overstreet
@ 2024-09-24 12:38         ` David Wang
  0 siblings, 0 replies; 14+ messages in thread
From: David Wang @ 2024-09-24 12:38 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel



At 2024-09-24 19:30:44, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Tue, Sep 24, 2024 at 07:08:07PM GMT, David Wang wrote:
>> Hi, 
>> 
>> At 2024-09-07 18:34:37, "David Wang" <00107082@163.com> wrote:
>> >At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >>That's because checksums are at extent granularity, not block: if you're
>> >>doing O_DIRECT reads that are smaller than the writes the data was
>> >>written with, performance will be bad because we have to read the entire
>> >>extent to verify the checksum.
>> >
>> >
>> 
>> >Based on the result:
>> >1. The row with prepare-write size 4K stands out, here.
>> >When files were prepaired with write size 4K, the afterwards
>> > read performance is worse.  (I did double check the result,
>> >but it is possible that I miss some affecting factors.);
>> >2. Without O_DIRECT, read performance seems correlated with the difference
>> > between read size and prepare write size, but with O_DIRECT, correlation is not obvious.
>> >
>> >And, to mention it again, if I overwrite the files **thoroughly** with fio write test
>> >(using same size), the read performance afterwards would be very good:
>> >
>> 
>> Update some IO pattern (bio start address and size, in sectors, address&=-address),
>> between bcachefs and block layer:
>> 
>> 4K-Direct-Read a file created by loop of `write(fd, buf, 1024*4)`:
>
>You're still testing small reads to big extents. Flip off data
>checksumming if you want to test that, or wait for block granular
>checksums to land.
>
>I already explained what's going on, so this isn't very helpful.

Hi, 

I do understand it now, sorry for bothering.
Mostly I wanted to explain to myself why the difference.... 

Beside that, just want to mention there is some io size of '1 sector', feel strange about it...


David

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-09-24 12:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-06 15:43 [BUG?] bcachefs performance: read is way too slow when a file has no overwrite David Wang
2024-09-06 17:38 ` Kent Overstreet
2024-09-07 10:34   ` David Wang
2024-09-09 13:37     ` Kent Overstreet
2024-09-12  2:39       ` David Wang
2024-09-12  7:52         ` David Wang
2024-09-21 16:02       ` David Wang
2024-09-21 16:12         ` Kent Overstreet
2024-09-22  1:39           ` David Wang
2024-09-22  8:31             ` David Wang
2024-09-22  8:47               ` David Wang
2024-09-24 11:08     ` David Wang
2024-09-24 11:30       ` Kent Overstreet
2024-09-24 12:38         ` David Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox