* [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. @ 2024-09-06 15:43 David Wang 2024-09-06 17:38 ` Kent Overstreet 0 siblings, 1 reply; 14+ messages in thread From: David Wang @ 2024-09-06 15:43 UTC (permalink / raw) To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel Hi, I notice a very strange performance issue: When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad: fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randread --runtime=600 --numjobs=8 --time_based=1 ... Run status group 0 (all jobs): READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s, almost 10-times better! This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance. (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.) Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think. I made some profiling, when read the file without any overwritten to it: io_submit_one(98.339% 2635814/2680333) aio_read(96.756% 2550297/2635814) bch2_read_iter(98.190% 2504125/2550297) __bch2_read(70.217% 1758320/2504125) __bch2_read_extent(74.571% 1311194/1758320) bch2_bio_alloc_pages_pool(72.933% 956297/1311194) <-----This stands out submit_bio_noacct_nocheck(11.074% 145207/1311194) bio_alloc_bioset(3.823% 50126/1311194) bch2_bkey_pick_read_device(2.157% 28281/1311194) bio_associate_blkg(1.668% 21877/1311194) ... And when the file was thoroughly overwritten, by a previous readwrite FIO session, the profiling is: io_submit_one(97.596% 12373330/12678072) aio_read(94.856% 11736821/12373330) bch2_read_iter(94.817% 11128518/11736821) __bch2_read(70.841% 7883577/11128518) __bch2_read_extent(35.572% 2804346/7883577) submit_bio_noacct_nocheck(46.356% 1299974/2804346) bch2_bkey_pick_read_device(8.972% 251601/2804346) bio_associate_blkg(8.067% 226227/2804346) submit_bio_noacct(7.005% 196432/2804346) bch2_trans_unlock(6.241% 175020/2804346) bch2_can_narrow_extent_crcs(3.714% 104157/2804346) local_clock(1.873% 52513/2804346) submit_bio(1.355% 37997/2804346) ... Both profilings have sample 10-minutes duration, and same sample frequency. Base on the difference between total sample count, 2680333 vs 12678072, I would suspect bch2_bio_alloc_pages_pool would incur lots of locking. Here more detail for bch2_bio_alloc_pages_pool: bch2_bio_alloc_pages_pool(72.933% 956297/1311194) alloc_pages_mpol_noprof(82.644% 790323/956297) __alloc_pages_noprof(89.562% 707833/790323) get_page_from_freelist(79.801% 564855/707833) __rmqueue_pcplist(24.713% 139593/564855) post_alloc_hook(15.045% 84983/564855) ... __next_zones_zonelist(3.578% 25323/707833) ... policy_nodemask(3.352% 26495/790323) ... bio_add_page(10.740% 102710/956297) Thanks~ David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-06 15:43 [BUG?] bcachefs performance: read is way too slow when a file has no overwrite David Wang @ 2024-09-06 17:38 ` Kent Overstreet 2024-09-07 10:34 ` David Wang 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2024-09-06 17:38 UTC (permalink / raw) To: David Wang; +Cc: linux-bcachefs, linux-kernel On Fri, Sep 06, 2024 at 11:43:54PM GMT, David Wang wrote: > > Hi, > > I notice a very strange performance issue: > When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad: > fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randread --runtime=600 --numjobs=8 --time_based=1 > ... > Run status group 0 (all jobs): > READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec > > But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s, > almost 10-times better! > > This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance. > (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.) > Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think. That's because checksums are at extent granularity, not block: if you're doing O_DIRECT reads that are smaller than the writes the data was written with, performance will be bad because we have to read the entire extent to verify the checksum. block granular checksums will come at some point, as an optional feature (most of the time you don't want them, and you'd prefer more compact metadata) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-06 17:38 ` Kent Overstreet @ 2024-09-07 10:34 ` David Wang 2024-09-09 13:37 ` Kent Overstreet 2024-09-24 11:08 ` David Wang 0 siblings, 2 replies; 14+ messages in thread From: David Wang @ 2024-09-07 10:34 UTC (permalink / raw) To: kent.overstreet; +Cc: 00107082, linux-bcachefs, linux-kernel At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >On Fri, Sep 06, 2024 at 11:43:54PM GMT, David Wang wrote: >> >> Hi, >> >> I notice a very strange performance issue: >> When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad: >> fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randread --runtime=600 --numjobs=8 --time_based=1 >> ... >> Run status group 0 (all jobs): >> READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec >> >> But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s, >> almost 10-times better! >> >> This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance. >> (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.) >> Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think. > >That's because checksums are at extent granularity, not block: if you're >doing O_DIRECT reads that are smaller than the writes the data was >written with, performance will be bad because we have to read the entire >extent to verify the checksum. > >block granular checksums will come at some point, as an optional feature >(most of the time you don't want them, and you'd prefer more compact >metadata) Hi, I made further tests combining different write and read size, the results are not confirming the explanation for O_DIRECT. Without O_DIRECT (fio --direct=0....), the average read bandwidth is improved, but with a very big standard deviation: +--------------------+----------+----------+----------+----------+ | prepare-write\read | 1k | 4k | 8K | 16K | +--------------------+----------+----------+----------+----------+ | 1K | 328MiB/s | 395MiB/s | 465MiB/s | | | 4K | 193MiB/s | 219MiB/s | 274MiB/s | 392MiB/s | | 8K | 251MiB/s | 280MiB/s | 368MiB/s | 435MiB/s | | 16K | 302MiB/s | 380MiB/s | 464MiB/s | 577MiB/s | +--------------------+----------+----------+----------+----------+ (Rows are write size when preparing the test files, and columns are read size for fio test.) And with O_DIRECT, the result is: +--------------------+-----------+-----------+----------+----------+ | prepare-write\read | 1k | 4k | 8K | 16K | +--------------------+-----------+-----------+----------+----------+ | 1K | 24.1MiB/s | 96.5MiB/s | 193MiB/s | | | 4K | 14.4MiB/s | 57.6MiB/s | 116MiB/s | 230MiB/s | | 8K | 24.6MiB/s | 97.6MiB/s | 192MiB/s | 309MiB/s | | 16K | 26.4MiB/s | 104MiB/s | 206MiB/s | 402MiB/s | +--------------------+-----------+-----------+----------+----------+ code to prepare the test files: #define KN 8 //<- adjust this for each row char name[32]; char buf[1024*KN]; int main() { int i, m = 1024*1024/KN, k, df; for (i=0; i<8; i++) { sprintf(name, "test.%d.0", i); fd = open(name, O_CREAT|O_DIRECT|O_SYNC|O_TRUNC|O_WRONLY); for (k=0; k<m; k++) write(fd, buf, sizeof(buf)); close(fd); } return 0; } Based on the result: 1. The row with prepare-write size 4K stands out, here. When files were prepaired with write size 4K, the afterwards read performance is worse. (I did double check the result, but it is possible that I miss some affecting factors.); 2. Without O_DIRECT, read performance seems correlated with the difference between read size and prepare write size, but with O_DIRECT, correlation is not obvious. And, to mention it again, if I overwrite the files **thoroughly** with fio write test (using same size), the read performance afterwards would be very good: # overwrite the files with randwrite, block size 8k $ fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=8k --iodepth=64 --size=1G --readwrite=randwrite --runtime=300 --numjobs=8 --time_based=1 # test the read performance with randread, block size 8k $ fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=8k --iodepth=64 --size=1G --readwrite=randread --runtime=300 --numjobs=8 --time_based=1 ... Run status group 0 (all jobs): READ: bw=964MiB/s (1011MB/s), 116MiB/s-123MiB/s (121MB/s-129MB/s), io=283GiB (303GB), run=300004-300005msec FYI David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-07 10:34 ` David Wang @ 2024-09-09 13:37 ` Kent Overstreet 2024-09-12 2:39 ` David Wang 2024-09-21 16:02 ` David Wang 2024-09-24 11:08 ` David Wang 1 sibling, 2 replies; 14+ messages in thread From: Kent Overstreet @ 2024-09-09 13:37 UTC (permalink / raw) To: David Wang; +Cc: linux-bcachefs, linux-kernel On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote: > At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: > >On Fri, Sep 06, 2024 at 11:43:54PM GMT, David Wang wrote: > >> > >> Hi, > >> > >> I notice a very strange performance issue: > >> When run `fio direct randread` test on a fresh new bcachefs, the performance is very bad: > >> fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --bs=4k --iodepth=64 --size=1G --readwrite=randread --runtime=600 --numjobs=8 --time_based=1 > >> ... > >> Run status group 0 (all jobs): > >> READ: bw=87.0MiB/s (91.2MB/s), 239B/s-14.2MiB/s (239B/s-14.9MB/s), io=1485MiB (1557MB), run=15593-17073msec > >> > >> But if the files already exist and have alreay been thoroughly overwritten, the read performance is about 850MB+/s, > >> almost 10-times better! > >> > >> This means, if I copy some file from somewhere else, and make read access only afterwards, I would get really bad performance. > >> (I copy files from other filesystem, and run fio read test on those files, the performance is indeed bad.) > >> Copy some prepared files, and make readonly usage afterwards, this usage scenario is quite normal for lots of apps, I think. > > > >That's because checksums are at extent granularity, not block: if you're > >doing O_DIRECT reads that are smaller than the writes the data was > >written with, performance will be bad because we have to read the entire > >extent to verify the checksum. > > > > > >block granular checksums will come at some point, as an optional feature > >(most of the time you don't want them, and you'd prefer more compact > >metadata) > > Hi, I made further tests combining different write and read size, the results > are not confirming the explanation for O_DIRECT. > > Without O_DIRECT (fio --direct=0....), the average read bandwidth > is improved, but with a very big standard deviation: > +--------------------+----------+----------+----------+----------+ > | prepare-write\read | 1k | 4k | 8K | 16K | > +--------------------+----------+----------+----------+----------+ > | 1K | 328MiB/s | 395MiB/s | 465MiB/s | | > | 4K | 193MiB/s | 219MiB/s | 274MiB/s | 392MiB/s | > | 8K | 251MiB/s | 280MiB/s | 368MiB/s | 435MiB/s | > | 16K | 302MiB/s | 380MiB/s | 464MiB/s | 577MiB/s | > +--------------------+----------+----------+----------+----------+ > (Rows are write size when preparing the test files, and columns are read size for fio test.) > > And with O_DIRECT, the result is: > +--------------------+-----------+-----------+----------+----------+ > | prepare-write\read | 1k | 4k | 8K | 16K | > +--------------------+-----------+-----------+----------+----------+ > | 1K | 24.1MiB/s | 96.5MiB/s | 193MiB/s | | > | 4K | 14.4MiB/s | 57.6MiB/s | 116MiB/s | 230MiB/s | > | 8K | 24.6MiB/s | 97.6MiB/s | 192MiB/s | 309MiB/s | > | 16K | 26.4MiB/s | 104MiB/s | 206MiB/s | 402MiB/s | > +--------------------+-----------+-----------+----------+----------+ > > code to prepare the test files: > #define KN 8 //<- adjust this for each row > char name[32]; > char buf[1024*KN]; > int main() { > int i, m = 1024*1024/KN, k, df; > for (i=0; i<8; i++) { > sprintf(name, "test.%d.0", i); > fd = open(name, O_CREAT|O_DIRECT|O_SYNC|O_TRUNC|O_WRONLY); > for (k=0; k<m; k++) write(fd, buf, sizeof(buf)); > close(fd); > } > return 0; > } > > Based on the result: > 1. The row with prepare-write size 4K stands out, here. > When files were prepaired with write size 4K, the afterwards > read performance is worse. (I did double check the result, > but it is possible that I miss some affecting factors.); On small blocksize tests you should be looking at IOPS, not MB/s. Prepare-write size is the column? Another factor is that we do merge extents (including checksums); so if the preparet-write is done sequentially we won't actually be ending up with extents of the same size as what we wrote. I believe there's a knob somewhere to turn off extent merging (module parameter? it's intended for debugging). > 2. Without O_DIRECT, read performance seems correlated with the difference > between read size and prepare write size, but with O_DIRECT, correlation is not obvious. So the O_DIRECT and buffered IO paths are very different (in every filesystem) - you're looking at very different things. They are both subject to the checksum granularity issue, but in buffered mode we round up reads to extent size, when filling into the page cache. Big standard deviation (high tail latency?) is something we'd want to track down. There's a bunch of time_stats in sysfs, but they're mostly for the write paths. If you're trying to identify where the latencies are coming from, we can look at adding some new time stats to isolate. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-09 13:37 ` Kent Overstreet @ 2024-09-12 2:39 ` David Wang 2024-09-12 7:52 ` David Wang 2024-09-21 16:02 ` David Wang 1 sibling, 1 reply; 14+ messages in thread From: David Wang @ 2024-09-12 2:39 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel Hi, At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote: >> >> Based on the result: >> 1. The row with prepare-write size 4K stands out, here. >> When files were prepaired with write size 4K, the afterwards >> read performance is worse. (I did double check the result, >> but it is possible that I miss some affecting factors.); > >On small blocksize tests you should be looking at IOPS, not MB/s. > >Prepare-write size is the column? Each row is for a specific prepare-write size indicated by first column. > >Another factor is that we do merge extents (including checksums); so if >the preparet-write is done sequentially we won't actually be ending up >with extents of the same size as what we wrote. > >I believe there's a knob somewhere to turn off extent merging (module >parameter? it's intended for debugging). I made some debug, when performance is bad, the conditions bvec_iter_sectors(iter) != pick.crc.uncompressed_size and bvec_iter_sectors(iter) != pick.crc.live_size are "almost" always both "true", while when performance is good (after "thorough" write), they are only little percent (~350 out of 1000000) to be true. And if those conditions are "true", "bounce" would be set and code seems to run on a time consuming path. I suspect "merely read" could never change those conditions, but "write" can? > >> 2. Without O_DIRECT, read performance seems correlated with the difference >> between read size and prepare write size, but with O_DIRECT, correlation is not obvious. > >So the O_DIRECT and buffered IO paths are very different (in every >filesystem) - you're looking at very different things. They are both >subject to the checksum granularity issue, but in buffered mode we round >up reads to extent size, when filling into the page cache. > >Big standard deviation (high tail latency?) is something we'd want to >track down. There's a bunch of time_stats in sysfs, but they're mostly >for the write paths. If you're trying to identify where the latencies >are coming from, we can look at adding some new time stats to isolate. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-12 2:39 ` David Wang @ 2024-09-12 7:52 ` David Wang 0 siblings, 0 replies; 14+ messages in thread From: David Wang @ 2024-09-12 7:52 UTC (permalink / raw) To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel Hi, > I made some debug, when performance is bad, the conditions > bvec_iter_sectors(iter) != pick.crc.uncompressed_size and > bvec_iter_sectors(iter) != pick.crc.live_size are "almost" always both "true", > while when performance is good (after "thorough" write), they are only little > percent (~350 out of 1000000) to be true. > > And if those conditions are "true", "bounce" would be set and code seems to run > on a time consuming path. > > I suspect "merely read" could never change those conditions, but "write" can? > More update: 1. Without a "thorough" write, it seems no matter what the prepare write size is, crc.compressed_size is always 128 sectors = 64K? 2. With a "thorough" write with 4K block size, crc.compressed_size mostly descreases to 4K, only a few crc.compressed_size left with 8/12/16/20K... 3. If a 4K-thorough-write followed by 40K-thorough-write, crc.compressed_size then increases to 40K, and 4K direct read suffers again.... 4. A 40K-through-write followed by 256K-thorough-write, crc.compressed_size only increase to 64K, I guess 64K is maximum crc.compressed_size. So I think current conclusion is: 1. The initial crc.compressed_size is always 64K when file was created/prepared. 2. Afterward writes can change crc size based on write size. (optimized for write?) 3. Direct read performance is sensitive to this crc size, more test result: +-----------+--------+----------+ | rand read | IOPS | BW | +-----------+--------+----------+ | 4K !E | 24.7K | 101MB/s | | 16K !E | 24.7K | 404MB/s | | 64K !E | 24.7K | 1617MB/s | | 4K E | ~220K | ~900MB/s | | 16K E | ~55K | ~900MB/s | | 64K E | ~13.8K | ~900MB/s | +-----------+--------+----------+ E stands for the event that a "thorough" 4k write happened before the test. Or put it more specific: E: lots of rand 4k-write, crc.compressed_size = 4K !E: file was just created, crc.compressed_size = 64K The behavior seems reasonable from write's point of view, but for read it dose not sounds good....If a mmaped readonly file, page in less than 16 pages, those extra data would waste lots of disk bandwidth. David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-09 13:37 ` Kent Overstreet 2024-09-12 2:39 ` David Wang @ 2024-09-21 16:02 ` David Wang 2024-09-21 16:12 ` Kent Overstreet 1 sibling, 1 reply; 14+ messages in thread From: David Wang @ 2024-09-21 16:02 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel Hi, At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote: > >Big standard deviation (high tail latency?) is something we'd want to >track down. There's a bunch of time_stats in sysfs, but they're mostly >for the write paths. If you're trying to identify where the latencies >are coming from, we can look at adding some new time stats to isolate. About performance, I have a theory based on some observation I made recently: When user space app make a 4k(8 sectors) direct write, bcachefs would initiate a write request of ~11 sectors, including the checksum data, right? This may not be a good offset+size pattern of block layer for performance. (I did get a very-very bad performance on ext4 if write with 5K size.) So I think, would it be feasible to make checksum sectors on a 4/8 sector boundary? This will waste more diskspace, but may make block layer happy? Thanks David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-21 16:02 ` David Wang @ 2024-09-21 16:12 ` Kent Overstreet 2024-09-22 1:39 ` David Wang 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2024-09-21 16:12 UTC (permalink / raw) To: David Wang; +Cc: linux-bcachefs, linux-kernel On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote: > Hi, > > At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: > >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote: > > > > >Big standard deviation (high tail latency?) is something we'd want to > >track down. There's a bunch of time_stats in sysfs, but they're mostly > >for the write paths. If you're trying to identify where the latencies > >are coming from, we can look at adding some new time stats to isolate. > > About performance, I have a theory based on some observation I made recently: > When user space app make a 4k(8 sectors) direct write, > bcachefs would initiate a write request of ~11 sectors, including the checksum data, right? > This may not be a good offset+size pattern of block layer for performance. > (I did get a very-very bad performance on ext4 if write with 5K size.) The checksum isn't inline with the data, it's stored with the pointer - so if you're seeing 11 sector writes, something really odd is going on... I would suggest doing some testing with data checksums off first, to isolate the issue; then it sounds like that IO pattern needs to be looked at. Check the extents btree in debugfs as well, to make sure the extents are getting written out as you think they are. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-21 16:12 ` Kent Overstreet @ 2024-09-22 1:39 ` David Wang 2024-09-22 8:31 ` David Wang 0 siblings, 1 reply; 14+ messages in thread From: David Wang @ 2024-09-22 1:39 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel Hi, At 2024-09-22 00:12:01, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote: >> Hi, >> >> At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote: >> >> > >> >Big standard deviation (high tail latency?) is something we'd want to >> >track down. There's a bunch of time_stats in sysfs, but they're mostly >> >for the write paths. If you're trying to identify where the latencies >> >are coming from, we can look at adding some new time stats to isolate. >> >> About performance, I have a theory based on some observation I made recently: >> When user space app make a 4k(8 sectors) direct write, >> bcachefs would initiate a write request of ~11 sectors, including the checksum data, right? >> This may not be a good offset+size pattern of block layer for performance. >> (I did get a very-very bad performance on ext4 if write with 5K size.) > >The checksum isn't inline with the data, it's stored with the pointer - >so if you're seeing 11 sector writes, something really odd is going >on... > .... This is really contradict with my observation: 1. fio stats yields a average 50K IOPS for a 400 seconds random direct write test. 2. from /proc/diskstatas, average "Field 5 -- # of writes completed" per second is also 50K (Here I conclude the performance issue is not caused by extra IOPS for checksum.) 3. from "Field 10 -- # of milliseconds spent doing I/Os", average disk "busy" time per second is about ~0.9second, similar to the result of ext4 test. (Here I conclude the performance issue it not caused by not pushing disk device too hard.) 4. delta(Field 7 -- # of sectors written) / delta(Field 5 -- # of writes completed) for 5 minutes interval is 11 sectors/write. (This is why I draw the theory that the checksum is with raw data......I thought is was a reasonable...) I will make some debug code to collect sector number patterns. >I would suggest doing some testing with data checksums off first, to >isolate the issue; then it sounds like that IO pattern needs to be >looked at. I will try it. > >Check the extents btree in debugfs as well, to make sure the extents are >getting written out as you think they are. Thanks David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite 2024-09-22 1:39 ` David Wang @ 2024-09-22 8:31 ` David Wang 2024-09-22 8:47 ` David Wang 0 siblings, 1 reply; 14+ messages in thread From: David Wang @ 2024-09-22 8:31 UTC (permalink / raw) To: 00107082, kent.overstreet; +Cc: linux-bcachefs, linux-kernel >Hi, > >At 2024-09-22 00:12:01, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >>On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote: >>> Hi, >>> >>> At 2024-09-09 21:37:35, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >>> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote: >>> >>> > >>> >Big standard deviation (high tail latency?) is something we'd want to >>> >track down. There's a bunch of time_stats in sysfs, but they're mostly >>> >for the write paths. If you're trying to identify where the latencies >>> >are coming from, we can look at adding some new time stats to isolate. >>> >>> About performance, I have a theory based on some observation I made recently: >>> When user space app make a 4k(8 sectors) direct write, >>> bcachefs would initiate a write request of ~11 sectors, including the checksum data, right? >>> This may not be a good offset+size pattern of block layer for performance. >>> (I did get a very-very bad performance on ext4 if write with 5K size.) >> >>The checksum isn't inline with the data, it's stored with the pointer - >>so if you're seeing 11 sector writes, something really odd is going >>on... >> > >.... This is really contradict with my observation: >1. fio stats yields a average 50K IOPS for a 400 seconds random direct write test. >2. from /proc/diskstatas, average "Field 5 -- # of writes completed" per second is also 50K >(Here I conclude the performance issue is not caused by extra IOPS for checksum.) >3. from "Field 10 -- # of milliseconds spent doing I/Os", average disk "busy" time per second is about ~0.9second, similar to the result of ext4 test. >(Here I conclude the performance issue it not caused by not pushing disk device too hard.) >4. delta(Field 7 -- # of sectors written) / delta(Field 5 -- # of writes completed) for 5 minutes interval is 11 sectors/write. >(This is why I draw the theory that the checksum is with raw data......I thought is was a reasonable...) > >I will make some debug code to collect sector number patterns. > I collected sector numbers at the beginning of submit_bio in block/blk-core.c, It turns out my guess was totally wrong, the user data is 8-sectors clean, the ~11 sectors I observed was just average sector per write. Sorry, I assumed too much, I thought each user write would be companied by a checksum-write..... And during a stress direct-4K-write test, the top-20 write sector number pattern is: +---------+------------+ | sectors | percentage | +---------+------------+ | 8 | 97.637% | | 1 | 0.813% | | 510 | 0.315% | <== large <--journal_write_submit | 4 | 0.123% | | 3 | 0.118% | | 2 | 0.117% | | 508 | 0.113% | <== | 509 | 0.094% | <== | 5 | 0.075% | | 6 | 0.037% | | 507 | 0.032% | <== | 14 | 0.024% | | 13 | 0.020% | | 11 | 0.020% | | 15 | 0.020% | | 10 | 0.020% | | 16 | 0.018% | | 12 | 0.018% | | 7 | 0.017% | | 20 | 0.017% | +---------+------------+ btree_io write pattern, collected from btree_node_write_endio, is kind of uniform/flat distributed, not on block-friendly size boundaries (I think): +---------+------------+ | sectors | percentage | +---------+------------+ | 1 | 9.021% | | 3 | 1.440% | | 4 | 1.249% | | 2 | 1.157% | | 5 | 0.804% | | 6 | 0.409% | | 14 | 0.259% | | 15 | 0.253% | | 16 | 0.228% | | 7 | 0.226% | | 11 | 0.223% | | 10 | 0.223% | | 13 | 0.222% | | 9 | 0.213% | | 12 | 0.202% | | 41 | 0.194% | | 17 | 0.183% | | 8 | 0.182% | | 18 | 0.167% | | 20 | 0.167% | | 19 | 0.163% | | 21 | 0.160% | | 205 | 0.158% | | 22 | 0.145% | | 23 | 0.117% | | 24 | 0.093% | | 51 | 0.089% | | 25 | 0.080% | | 204 | 0.079% | +---------+------------+ Now, it seems to be that journal_io's big trunk of IO and btree_io's irregular IO size would be the main causing factors for halving direct-4K-write user-io bandwidth, compared with ext4. Maybe btree_io's irregular IO size could be regularized? > > > >>I would suggest doing some testing with data checksums off first, to >>isolate the issue; then it sounds like that IO pattern needs to be >>looked at. > >I will try it. I format partition with `sudo bcachefs format --metadata_checksum=none --data_checksum=none /dev/nvme0n1p1` It dosen't have significant help with write performance: "IOPS=53.3k, BW=208MiB/s" --> "IOPS=55.3k, BW=216MiB/s", and btree write's irregular IO size pattern still shows up. But it help improve direct-4k-read performance significantly, I guess that would be expected considering no extra data needs to be fetched for each read. > >> >>Check the extents btree in debugfs as well, to make sure the extents are >>getting written out as you think they are. Thanks David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite 2024-09-22 8:31 ` David Wang @ 2024-09-22 8:47 ` David Wang 0 siblings, 0 replies; 14+ messages in thread From: David Wang @ 2024-09-22 8:47 UTC (permalink / raw) To: kent.overstreet; +Cc: linux-bcachefs, linux-kernel At 2024-09-22 16:31:48, "David Wang" <00107082@163.com> wrote: >>Hi, >> >btree_io write pattern, collected from btree_node_write_endio, >is kind of uniform/flat distributed, not on block-friendly size >boundaries (I think): > +---------+------------+ > | sectors | percentage | > +---------+------------+ > | 1 | 9.021% | > | 3 | 1.440% | > | 4 | 1.249% | > | 2 | 1.157% | > | 5 | 0.804% | > | 6 | 0.409% | > | 14 | 0.259% | > | 15 | 0.253% | > | 16 | 0.228% | > | 7 | 0.226% | > | 11 | 0.223% | > | 10 | 0.223% | > | 13 | 0.222% | > | 9 | 0.213% | > | 12 | 0.202% | > | 41 | 0.194% | > | 17 | 0.183% | > | 8 | 0.182% | > | 18 | 0.167% | > | 20 | 0.167% | > | 19 | 0.163% | > | 21 | 0.160% | > | 205 | 0.158% | > | 22 | 0.145% | > | 23 | 0.117% | > | 24 | 0.093% | > | 51 | 0.089% | > | 25 | 0.080% | > | 204 | 0.079% | > +---------+------------+ > Oops...wrong weight used to calculate percentage, it should be +---------+------------+ | sectors | percentage | +---------+------------+ | 1 | 45.105% | | 3 | 7.200% | | 4 | 6.244% | | 2 | 5.785% | | 5 | 4.018% | | 6 | 2.045% | | 14 | 1.296% | | 15 | 1.264% | | 16 | 1.141% | | 7 | 1.129% | | 11 | 1.117% | | 10 | 1.113% | | 13 | 1.111% | | 9 | 1.065% | | 12 | 1.011% | | 41 | 0.971% | | 17 | 0.913% | | 8 | 0.912% | | 18 | 0.836% | | 20 | 0.835% | | 19 | 0.812% | | 21 | 0.799% | | 205 | 0.791% | | 22 | 0.724% | | 23 | 0.587% | | 24 | 0.465% | | 51 | 0.443% | | 25 | 0.398% | | 204 | 0.396% | +---------+------------+ David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-07 10:34 ` David Wang 2024-09-09 13:37 ` Kent Overstreet @ 2024-09-24 11:08 ` David Wang 2024-09-24 11:30 ` Kent Overstreet 1 sibling, 1 reply; 14+ messages in thread From: David Wang @ 2024-09-24 11:08 UTC (permalink / raw) To: kent.overstreet; +Cc: 00107082, linux-bcachefs, linux-kernel Hi, At 2024-09-07 18:34:37, "David Wang" <00107082@163.com> wrote: >At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >>That's because checksums are at extent granularity, not block: if you're >>doing O_DIRECT reads that are smaller than the writes the data was >>written with, performance will be bad because we have to read the entire >>extent to verify the checksum. > > >Based on the result: >1. The row with prepare-write size 4K stands out, here. >When files were prepaired with write size 4K, the afterwards > read performance is worse. (I did double check the result, >but it is possible that I miss some affecting factors.); >2. Without O_DIRECT, read performance seems correlated with the difference > between read size and prepare write size, but with O_DIRECT, correlation is not obvious. > >And, to mention it again, if I overwrite the files **thoroughly** with fio write test >(using same size), the read performance afterwards would be very good: > Update some IO pattern (bio start address and size, in sectors, address&=-address), between bcachefs and block layer: 4K-Direct-Read a file created by loop of `write(fd, buf, 1024*4)`: +--------------------------+--------+--------+--------+--------+---------+ | offset\size | 1 | 6 | 7 | 8 | 128 | +--------------------------+--------+--------+--------+--------+---------+ | 1 | 0.015% | 0.003% | - | - | - | | 10 | 0.008% | 0.001% | - | 0.000% | - | | 100 | 0.003% | 0.001% | 0.000% | - | - | | 1000 | 0.002% | 0.000% | - | - | - | | 10000 | 0.001% | 0.000% | - | - | - | | 100000 | 0.000% | - | - | - | - | | 1000000 | 0.000% | - | - | - | - | | 10000000 | 0.000% | - | - | - | 49.989% | | 100000000 | 0.001% | - | - | - | 24.994% | | 1000000000 | - | - | - | - | 12.486% | | 10000000000 | - | - | - | - | 6.253% | | 100000000000 | - | - | - | - | 3.120% | | 1000000000000 | - | 0.000% | - | - | 1.561% | | 10000000000000 | - | - | - | - | 0.781% | | 100000000000000 | - | - | - | - | 0.391% | | 1000000000000000 | - | - | - | - | 0.195% | | 10000000000000000 | - | - | - | - | 0.098% | | 100000000000000000 | - | - | - | - | 0.049% | | 1000000000000000000 | - | - | - | - | 0.024% | | 10000000000000000000 | - | - | - | - | 0.013% | | 100000000000000000000 | - | - | - | - | 0.006% | | 10000000000000000000000 | - | - | - | - | 0.006% | +--------------------------+--------+--------+--------+--------+---------+ 4K-Direct-Read a file created by `dd if=/dev/urandom ...` +--------------------------+---------+ | offset\size | 128 | +--------------------------+---------+ | 10000000 | 50.003% | | 100000000 | 24.993% | | 1000000000 | 12.508% | | 10000000000 | 6.252% | | 100000000000 | 3.118% | | 1000000000000 | 1.561% | | 10000000000000 | 0.782% | | 100000000000000 | 0.391% | | 1000000000000000 | 0.196% | | 10000000000000000 | 0.098% | | 100000000000000000 | 0.049% | | 1000000000000000000 | 0.025% | | 10000000000000000000 | 0.012% | | 100000000000000000000 | 0.006% | | 1000000000000000000000 | 0.006% | +--------------------------+---------+ 4K-Direct-Read a file which is *overwritten* by random fio 4k-direct-write for 10 minutes +--------------------------+---------+--------+--------+ | offset\size | 8 | 16 | 24 | +--------------------------+---------+--------+--------+ | 1000 | 49.912% | 0.028% | 0.004% | | 10000 | 25.024% | 0.018% | 0.001% | | 100000 | 12.507% | 0.012% | 0.001% | | 1000000 | 6.273% | 0.002% | 0.001% | | 10000000 | 3.121% | 0.002% | - | | 100000000 | 1.548% | - | - | | 1000000000 | 0.778% | 0.001% | - | | 10000000000 | 0.386% | - | - | | 100000000000 | 0.194% | - | - | | 1000000000000 | 0.098% | - | - | | 10000000000000 | 0.046% | - | - | | 100000000000000 | 0.023% | - | - | | 1000000000000000 | 0.011% | - | - | | 10000000000000000 | 0.006% | - | - | | 100000000000000000 | 0.003% | - | - | | 1000000000000000000 | 0.002% | - | - | | 10000000000000000000 | 0.001% | - | - | | 10000000000000000000000 | 0.000% | - | - | +--------------------------+---------+--------+--------+ Those read of 1 sector size in the first IO pattern may need attention? (@Kent) (The file was created via following code: #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> #include <unistd.h> #define KN 4 char name[32]; char buf[1024*KN]; int main() { int i, m = 1024*1024/KN, k, fd; for (i=0; i<1; i++) { sprintf(name, "test.%d.0", i); fd = open(name, O_CREAT|O_DIRECT|O_SYNC|O_TRUNC|O_WRONLY); for (k=0; k<m; k++) write(fd, buf, sizeof(buf)); close(fd); } return 0; } I also collected latency between FS and BIO (submit_bio --> bio_endio), and did not observe difference between bcachefs and ext4, when extension size is mostly 4K. On my SSD, one 4K-direct-read test even shows bcachefs usage is better: average 171086ns for ext4, 133304ns for bcachefs. But the overall performance, from fio's point of view, bcachefs is only half of ext4's, and cpu usage is much lower than ext4: 60%- vs 90%+. (The bottleneck should be within bcachefs, I guess? But don't have any idea of how to measure it.) Glad to hear those new patches for 6.12, https://lore.kernel.org/lkml/CAHk-=wh+atcBWa34mDdG1bFGRc28eJas3tP+9QrYXX6C7BX0JQ@mail.gmail.com/T/#m27c78e1f04c556ab064bec06520b8d7fcf4518c5 really looks promising, looking forward to test it next week~!! Thanks David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-24 11:08 ` David Wang @ 2024-09-24 11:30 ` Kent Overstreet 2024-09-24 12:38 ` David Wang 0 siblings, 1 reply; 14+ messages in thread From: Kent Overstreet @ 2024-09-24 11:30 UTC (permalink / raw) To: David Wang; +Cc: linux-bcachefs, linux-kernel On Tue, Sep 24, 2024 at 07:08:07PM GMT, David Wang wrote: > Hi, > > At 2024-09-07 18:34:37, "David Wang" <00107082@163.com> wrote: > >At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: > >>That's because checksums are at extent granularity, not block: if you're > >>doing O_DIRECT reads that are smaller than the writes the data was > >>written with, performance will be bad because we have to read the entire > >>extent to verify the checksum. > > > > > > >Based on the result: > >1. The row with prepare-write size 4K stands out, here. > >When files were prepaired with write size 4K, the afterwards > > read performance is worse. (I did double check the result, > >but it is possible that I miss some affecting factors.); > >2. Without O_DIRECT, read performance seems correlated with the difference > > between read size and prepare write size, but with O_DIRECT, correlation is not obvious. > > > >And, to mention it again, if I overwrite the files **thoroughly** with fio write test > >(using same size), the read performance afterwards would be very good: > > > > Update some IO pattern (bio start address and size, in sectors, address&=-address), > between bcachefs and block layer: > > 4K-Direct-Read a file created by loop of `write(fd, buf, 1024*4)`: You're still testing small reads to big extents. Flip off data checksumming if you want to test that, or wait for block granular checksums to land. I already explained what's going on, so this isn't very helpful. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [BUG?] bcachefs performance: read is way too slow when a file has no overwrite. 2024-09-24 11:30 ` Kent Overstreet @ 2024-09-24 12:38 ` David Wang 0 siblings, 0 replies; 14+ messages in thread From: David Wang @ 2024-09-24 12:38 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-kernel At 2024-09-24 19:30:44, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >On Tue, Sep 24, 2024 at 07:08:07PM GMT, David Wang wrote: >> Hi, >> >> At 2024-09-07 18:34:37, "David Wang" <00107082@163.com> wrote: >> >At 2024-09-07 01:38:11, "Kent Overstreet" <kent.overstreet@linux.dev> wrote: >> >>That's because checksums are at extent granularity, not block: if you're >> >>doing O_DIRECT reads that are smaller than the writes the data was >> >>written with, performance will be bad because we have to read the entire >> >>extent to verify the checksum. >> > >> > >> >> >Based on the result: >> >1. The row with prepare-write size 4K stands out, here. >> >When files were prepaired with write size 4K, the afterwards >> > read performance is worse. (I did double check the result, >> >but it is possible that I miss some affecting factors.); >> >2. Without O_DIRECT, read performance seems correlated with the difference >> > between read size and prepare write size, but with O_DIRECT, correlation is not obvious. >> > >> >And, to mention it again, if I overwrite the files **thoroughly** with fio write test >> >(using same size), the read performance afterwards would be very good: >> > >> >> Update some IO pattern (bio start address and size, in sectors, address&=-address), >> between bcachefs and block layer: >> >> 4K-Direct-Read a file created by loop of `write(fd, buf, 1024*4)`: > >You're still testing small reads to big extents. Flip off data >checksumming if you want to test that, or wait for block granular >checksums to land. > >I already explained what's going on, so this isn't very helpful. Hi, I do understand it now, sorry for bothering. Mostly I wanted to explain to myself why the difference.... Beside that, just want to mention there is some io size of '1 sector', feel strange about it... David ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-09-24 12:38 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-06 15:43 [BUG?] bcachefs performance: read is way too slow when a file has no overwrite David Wang 2024-09-06 17:38 ` Kent Overstreet 2024-09-07 10:34 ` David Wang 2024-09-09 13:37 ` Kent Overstreet 2024-09-12 2:39 ` David Wang 2024-09-12 7:52 ` David Wang 2024-09-21 16:02 ` David Wang 2024-09-21 16:12 ` Kent Overstreet 2024-09-22 1:39 ` David Wang 2024-09-22 8:31 ` David Wang 2024-09-22 8:47 ` David Wang 2024-09-24 11:08 ` David Wang 2024-09-24 11:30 ` Kent Overstreet 2024-09-24 12:38 ` David Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox