* [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) [not found] ` <9d59af25-d648-4777-a5c0-c38c246a9610@ewheeler.net> @ 2022-05-23 18:36 ` Eric Wheeler 2022-05-24 5:34 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Eric Wheeler @ 2022-05-23 18:36 UTC (permalink / raw) To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block [-- Attachment #1: Type: text/plain, Size: 6042 bytes --] On Tue, 17 May 2022, Eric Wheeler wrote: > /sys/fs/bcache/<cset-uuid>/journal_delay_ms > Journal writes will delay for up to this many milliseconds, unless a > cache flush happens sooner. Defaults to 100. > > I just noticed that journal_delay_ms says "unless a cache flush happens > sooner" but cache flushes can be re-ordered so flushing the journal when > REQ_OP_FLUSH comes through may not be useful, especially if there is a > high volume of flushes coming down the pipe because the flushes could kill > the NVMe's cache---and maybe the 1.5ms ping is actual flash latency. It > would flush data and journal. > > Maybe there should be a cachedev_noflush sysfs option for those with some > kind of power-loss protection of there SSD's. It looks like this is > handled in request.c when these functions call bch_journal_meta(): > > 1053: static void cached_dev_nodata(struct closure *cl) > 1263: static void flash_dev_nodata(struct closure *cl) > > Coly can you comment about journal flush semantics with respect to > performance vs correctness and crash safety? > > Adriano, as a test, you could change this line in search_alloc() in > request.c: > > - s->iop.flush_journal = op_is_flush(bio->bi_opf); > + s->iop.flush_journal = 0; > > and see how performance changes. Hi Coly, all: Can you think of any reason that forcing iop.flush_journal=0 for bcache devs with backed by non-volatile cache would be unsafe? If it is safe, then three new sysctl flags to optionally drop flushes would increase overall bcache performance by avoiding controller flushes, especially on the spinning disks. These would of course default to 0: - noflush_journal - no flush on journal writes - noflush_cache - no flush on normal cache IO writes - noflush_bdev - no flush on normal bdev IO writes What do you think? From Coly's iopings: > # ./ioping -c10 /dev/bcache0 -D -Y -WWW -s4k > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=144.3 us (warmup) > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=84.1 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=71.8 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=68.9 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=69.8 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=68.7 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=68.8 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=70.3 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=68.8 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=68.5 us ^ Average is 71.1 us. > # ./ioping -c10 /dev/bcache0 -D -WWW -s4k > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=127.8 us (warmup) > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=67.8 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=60.3 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=46.9 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=52.6 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=43.8 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=52.7 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=44.3 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=52.0 us > 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=44.6 us ^ Average is 51.7 us. Dropping sync write flushes provides a 27% reduction in SSD latency! -- Eric Wheeler > > Someone correct me if I'm wrong, but I don't think flush_journal=0 will > affect correctness unless there is a crash. If that /is/ the performance > problem then it would narrow the scope of this discussion. > > 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can > you set your CPU governor to run at full clock speed and then slowest > clock speed to see if it is a CPU limit somewhere as we expect? > > You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure > the governor did its job. > > If it scales with CPU then something in bcache is working too hard. > Maybe garbage collection? Other devs would need to chime in here to > steer the troubleshooting if that is the case. > > > 5. I'm not sure if garbage collection is the issue, but you might try > Mingzhe's dynamic incremental gc patch: > https://www.spinics.net/lists/linux-bcache/msg11185.html > > 6. Try dm-cache and see if its IO latency is similar to bcache: If it is > about the same then that would indicate an issue in the block layer > somewhere outside of bcache. If dm-cache is better, then that confirms > a bcache issue. > > > > The cache was configured directly on one of the NVMe partitions (in this > > case, the first partition). I did several tests using fio and ioping, > > testing on a partition on the NVMe device, without partition and > > directly on the raw block, on a first partition, on the second, with or > > without configuring bcache. I did all this to remove any doubt as to the > > method. The results of tests performed directly on the hardware device, > > without going through bcache are always fast and similar. > > > > But tests in bcache are always slower. If you use writethrough, of > > course, it gets much worse, because the performance is equal to the raw > > spinning disk. > > > > Using writeback improves a lot, but still doesn't use the full speed of > > NVMe (honestly, much less than full speed). > > Indeed, I hope this can be fixed! A 20x improvement in bcache would > be awesome. > > > But I've also noticed that there is a limit on writing sequential data, > > which is a little more than half of the maximum write rate shown in > > direct tests by the NVMe device. > > For sync, async, or both? > > > Processing doesn't seem to be going up like the tests. > > What do you mean "processing" ? > > -Eric > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-23 18:36 ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler @ 2022-05-24 5:34 ` Christoph Hellwig 2022-05-24 20:14 ` Eric Wheeler 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2022-05-24 5:34 UTC (permalink / raw) To: Eric Wheeler Cc: Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block ... wait. Can someone explain what this is all about? Devices with power fail protection will advertise that (using VWC flag in NVMe for example) and we will never send flushes. So anything that explicitly disables flushed will generally cause data corruption. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-24 5:34 ` Christoph Hellwig @ 2022-05-24 20:14 ` Eric Wheeler 2022-05-24 20:34 ` Keith Busch 2022-05-25 5:17 ` Christoph Hellwig 0 siblings, 2 replies; 15+ messages in thread From: Eric Wheeler @ 2022-05-24 20:14 UTC (permalink / raw) To: Christoph Hellwig Cc: Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block Hi Christoph, On Mon, 23 May 2022, Christoph Hellwig wrote: > ... wait. > > Can someone explain what this is all about? Devices with power fail > protection will advertise that (using VWC flag in NVMe for example) and > we will never send flushes. So anything that explicitly disables flushed > will generally cause data corruption. Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache (instead of the expected ~70us), so perhaps the NVMe flushes were killing performance if every write was also forcing an erase cycle. The suggestion was to disable flushes in bcache as a troubleshooting step to see if that solved the problem, but with the warning that it could be unsafe. Questions: 1. If a user knows their disks have a non-volatile cache then is it safe to drop flushes? 2. If not, then under what circumstances is it unsafe with a non-volatile cache? 3. Since the block layer wont send flushes when the hardware reports that the cache is non-volatile, then how do you query the device to make sure it is reporting correctly? For NVMe you can get VWC as: nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc ...but how do you query a block device (like a RAID LUN) to make sure it is reporting a non-volatile cache correctly? -- Eric Wheeler > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-24 20:14 ` Eric Wheeler @ 2022-05-24 20:34 ` Keith Busch 2022-05-24 21:34 ` Eric Wheeler 2022-05-25 5:17 ` Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: Keith Busch @ 2022-05-24 20:34 UTC (permalink / raw) To: Eric Wheeler Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote: > Hi Christoph, > > On Mon, 23 May 2022, Christoph Hellwig wrote: > > ... wait. > > > > Can someone explain what this is all about? Devices with power fail > > protection will advertise that (using VWC flag in NVMe for example) and > > we will never send flushes. So anything that explicitly disables flushed > > will generally cause data corruption. > > Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache > (instead of the expected ~70us), so perhaps the NVMe flushes were killing > performance if every write was also forcing an erase cycle. > > The suggestion was to disable flushes in bcache as a troubleshooting step > to see if that solved the problem, but with the warning that it could be > unsafe. > > Questions: > > 1. If a user knows their disks have a non-volatile cache then is it safe > to drop flushes? > > 2. If not, then under what circumstances is it unsafe with a non-volatile > cache? > > 3. Since the block layer wont send flushes when the hardware reports that > the cache is non-volatile, then how do you query the device to make > sure it is reporting correctly? For NVMe you can get VWC as: > nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc > > ...but how do you query a block device (like a RAID LUN) to make sure > it is reporting a non-volatile cache correctly? You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the value is "write through", then the device is reporting it doesn't have a volatile cache. If it is "write back", then it has a volatile cache. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-24 20:34 ` Keith Busch @ 2022-05-24 21:34 ` Eric Wheeler 2022-05-25 5:20 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Eric Wheeler @ 2022-05-24 21:34 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Tue, 24 May 2022, Keith Busch wrote: > On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote: > > Hi Christoph, > > > > On Mon, 23 May 2022, Christoph Hellwig wrote: > > > ... wait. > > > > > > Can someone explain what this is all about? Devices with power fail > > > protection will advertise that (using VWC flag in NVMe for example) and > > > we will never send flushes. So anything that explicitly disables flushed > > > will generally cause data corruption. > > > > Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache > > (instead of the expected ~70us), so perhaps the NVMe flushes were killing > > performance if every write was also forcing an erase cycle. > > > > The suggestion was to disable flushes in bcache as a troubleshooting step > > to see if that solved the problem, but with the warning that it could be > > unsafe. > > > > Questions: > > > > 1. If a user knows their disks have a non-volatile cache then is it safe > > to drop flushes? > > > > 2. If not, then under what circumstances is it unsafe with a non-volatile > > cache? > > > > 3. Since the block layer wont send flushes when the hardware reports that > > the cache is non-volatile, then how do you query the device to make > > sure it is reporting correctly? For NVMe you can get VWC as: > > nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc > > > > ...but how do you query a block device (like a RAID LUN) to make sure > > it is reporting a non-volatile cache correctly? > > You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the > value is "write through", then the device is reporting it doesn't have a > volatile cache. If it is "write back", then it has a volatile cache. Thanks, Keith! Is this flag influced at all when /sys/block/sdX/queue/scheduler is set to "none", or does the write_cache flag operate independently of the selected scheduler? Does the block layer stop sending flushes at the first device in the stack that is set to "write back"? For example, if a device mapper target is writeback will it strip flushes on the way to the backing device? This confirms what I have suspected all along: We have an LSI MegaRAID SAS-3516 where the write policy is "write back" in the LUN, but the cache is flagged in Linux as write-through: ]# cat /sys/block/sdb/queue/write_cache write through I guess this is the correct place to adjust that behavior! -- Eric Wheeler ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-24 21:34 ` Eric Wheeler @ 2022-05-25 5:20 ` Christoph Hellwig 2022-05-25 18:44 ` Eric Wheeler 2022-05-28 1:52 ` Eric Wheeler 0 siblings, 2 replies; 15+ messages in thread From: Christoph Hellwig @ 2022-05-25 5:20 UTC (permalink / raw) To: Eric Wheeler Cc: Keith Busch, Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote: > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set > to "none", or does the write_cache flag operate independently of the > selected scheduler? This in completely independent from sthe scheduler. > Does the block layer stop sending flushes at the first device in the stack > that is set to "write back"? For example, if a device mapper target is > writeback will it strip flushes on the way to the backing device? This is up to the stacking driver. dm and tend to pass through flushes where needed. > This confirms what I have suspected all along: We have an LSI MegaRAID > SAS-3516 where the write policy is "write back" in the LUN, but the cache > is flagged in Linux as write-through: > > ]# cat /sys/block/sdb/queue/write_cache > write through > > I guess this is the correct place to adjust that behavior! MegaRAID has had all kinds of unsafe policies in the past unfortunately. I'm not even sure all of them could pass through flushes properly if we asked them to :( ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-25 5:20 ` Christoph Hellwig @ 2022-05-25 18:44 ` Eric Wheeler 2022-05-26 9:06 ` Christoph Hellwig 2022-05-28 1:52 ` Eric Wheeler 1 sibling, 1 reply; 15+ messages in thread From: Eric Wheeler @ 2022-05-25 18:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Tue, 24 May 2022, Christoph Hellwig wrote: > On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote: > > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set > > to "none", or does the write_cache flag operate independently of the > > selected scheduler? > > This is up to the stacking driver. dm and tend to pass through flushes > where needed. > > > This confirms what I have suspected all along: We have an LSI MegaRAID > > SAS-3516 where the write policy is "write back" in the LUN, but the cache > > is flagged in Linux as write-through: > > > > ]# cat /sys/block/sdb/queue/write_cache > > write through > > > > I guess this is the correct place to adjust that behavior! > > MegaRAID has had all kinds of unsafe policies in the past unfortunately. > I'm not even sure all of them could pass through flushes properly if we > asked them to :( Thanks for the feedback, great info! In your experience, which SAS/SATA RAID controllers are best behaved in terms of policies and reporting things like io_opt and writeback/writethrough to the kernel? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-25 18:44 ` Eric Wheeler @ 2022-05-26 9:06 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2022-05-26 9:06 UTC (permalink / raw) To: Eric Wheeler Cc: Christoph Hellwig, Keith Busch, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Wed, May 25, 2022 at 11:44:01AM -0700, Eric Wheeler wrote: > In your experience, which SAS/SATA RAID controllers are best behaved in > terms of policies and reporting things like io_opt and > writeback/writethrough to the kernel? I never had actually good experiences with any of them. That being said I also haven't used one for years. For SAS or SATA attachd to expanders setups I've mostly used the mpt2/3 family of controllers which are doing okay. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-25 5:20 ` Christoph Hellwig 2022-05-25 18:44 ` Eric Wheeler @ 2022-05-28 1:52 ` Eric Wheeler 2022-05-28 3:57 ` Keith Busch 2022-05-28 4:59 ` Christoph Hellwig 1 sibling, 2 replies; 15+ messages in thread From: Eric Wheeler @ 2022-05-28 1:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block [-- Attachment #1: Type: text/plain, Size: 2609 bytes --] On Tue, 24 May 2022, Christoph Hellwig wrote: > On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote: > > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set > > to "none", or does the write_cache flag operate independently of the > > selected scheduler? > > This in completely independent from sthe scheduler. > > > Does the block layer stop sending flushes at the first device in the stack > > that is set to "write back"? For example, if a device mapper target is > > writeback will it strip flushes on the way to the backing device? > > This is up to the stacking driver. dm and tend to pass through flushes > where needed. > > > This confirms what I have suspected all along: We have an LSI MegaRAID > > SAS-3516 where the write policy is "write back" in the LUN, but the cache > > is flagged in Linux as write-through: > > > > ]# cat /sys/block/sdb/queue/write_cache > > write through Hi Keith, Christoph: Adriano who started this thread (cc'ed) reported that setting queue/write_cache to "write back" provides much higher latency on his NVMe than "write through"; I tested a system here and found the same thing. Here is Adriano's summary: # cat /sys/block/nvme0n1/queue/write_cache write through # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K ... min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us ^^^^ ^^ # for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K ... min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us ^^^^ ^^ Interestingly, Adriano's is 24.01x and ours is 23.97x higher latency higher (see below). These 24x numbers seem too similar to be a coincidence on such different configurations. He's running Linux 5.4 and we are on 4.19. Is this expected? More info: The stack where I verified the behavior Adriano reported is slightly different, NVMe's are under md RAID1 with LVM on top, so latency is higher, but still basically the same high latency difference with writeback enabled: ]# cat /sys/block/nvme[01]n1/queue/write_cache write through write through ]# ionice -c1 -n1 ioping -c10 /dev/ssd/ssd-test -D -s4k -WWW -Y ... min/avg/max/mdev = 119.1 us / 754.9 us / 2.67 ms / 1.02 ms ]# cat /sys/block/nvme[01]n1/queue/write_cache write back write back ]# ionice -c1 -n1 ioping -c10 /dev/ssd/ssd-test -D -s4k -WWW -Y ... min/avg/max/mdev = 113.4 us / 18.1 ms / 29.2 ms / 9.53 ms -- Eric Wheeler ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-28 1:52 ` Eric Wheeler @ 2022-05-28 3:57 ` Keith Busch 2022-05-28 4:59 ` Christoph Hellwig 1 sibling, 0 replies; 15+ messages in thread From: Keith Busch @ 2022-05-28 3:57 UTC (permalink / raw) To: Eric Wheeler Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote: > Hi Keith, Christoph: > > Adriano who started this thread (cc'ed) reported that setting > queue/write_cache to "write back" provides much higher latency on his NVMe > than "write through"; I tested a system here and found the same thing. > > Here is Adriano's summary: > > # cat /sys/block/nvme0n1/queue/write_cache > write through > # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K > ... > min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us > ^^^^ ^^ > > # for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done > # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K > ... > min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us > ^^^^ ^^ With the "write back" setting, I find that the writes dispatched from ioping will have the force-unit-access bit set in the commands, so it is expected to take longer. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-28 1:52 ` Eric Wheeler 2022-05-28 3:57 ` Keith Busch @ 2022-05-28 4:59 ` Christoph Hellwig [not found] ` <24456292.2324073.1653742646974@mail.yahoo.com> 1 sibling, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2022-05-28 4:59 UTC (permalink / raw) To: Eric Wheeler Cc: Christoph Hellwig, Keith Busch, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote: > Adriano who started this thread (cc'ed) reported that setting > queue/write_cache to "write back" provides much higher latency on his NVMe > than "write through"; I tested a system here and found the same thing. > > [...] > > Is this expected? Once you do that, the block layer ignores all flushes and FUA bits, so yes it is going to be a lot faster. But also completely unsafe because it does not provide any data durability guarantees. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <24456292.2324073.1653742646974@mail.yahoo.com>]
[parent not found: <YpLmDtMgyNLxJgNQ@kbusch-mbp.dhcp.thefacebook.com>]
[parent not found: <2064546094.2440522.1653825057164@mail.yahoo.com>]
[parent not found: <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>]
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) [not found] ` <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com> @ 2022-06-01 19:27 ` Adriano Silva 2022-06-01 21:11 ` Eric Wheeler 0 siblings, 1 reply; 15+ messages in thread From: Adriano Silva @ 2022-06-01 19:27 UTC (permalink / raw) To: Keith Busch, Eric Wheeler, Matthias Ferdinand, Bcache Linux, Coly Li, Christoph Hellwig, linux-block@vger.kernel.org Tankyou, I don't know if my NVME's devices are 4K LBA. I do not think so. They are all the same model and manufacturer. I know that they work with blocks of 512 Bytes, but that their latency is very high when processing blocks of this size. However, in all the tests I do with them with 4K blocks, the result is much better. So I always use 4K blocks. Because in real life I don't think I'll use blocks smaller than 4K. > You can remove the kernel interpretation using passthrough commands. Here's an > example comparing with and without FUA assuming a 512b logical block format: > > # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency > # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency > > if you have a 4k LBA format, use "--block-count=0". > > And you may want to run each of the above several times to get an average since > other factors can affect the reported latency. I created a bash script capable of executing the two commands you suggested to me in a period of 10 seconds in a row, to get some more acceptable average. The result is the following: root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache write back root@pve-21:~# ./nvme_write.sh Total: 10 seconds, 3027 tests. Latency (us) : min: 29 / avr: 37 / max: 98 root@pve-21:~# ./nvme_write.sh --force-unit-access Total: 10 seconds, 2985 tests. Latency (us) : min: 29 / avr: 37 / max: 111 root@pve-21:~# root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0 Total: 10 seconds, 2556 tests. Latency (us) : min: 404 / avr: 428 / max: 492 root@pve-21:~# ./nvme_write.sh --block-count=0 Total: 10 seconds, 2521 tests. Latency (us) : min: 403 / avr: 428 / max: 496 root@pve-21:~# root@pve-21:~# root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache write through root@pve-21:~# ./nvme_write.sh Total: 10 seconds, 2988 tests. Latency (us) : min: 29 / avr: 37 / max: 114 root@pve-21:~# ./nvme_write.sh --force-unit-access Total: 10 seconds, 2926 tests. Latency (us) : min: 29 / avr: 36 / max: 71 root@pve-21:~# root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0 Total: 10 seconds, 2456 tests. Latency (us) : min: 31 / avr: 428 / max: 496 root@pve-21:~# ./nvme_write.sh --block-count=0 Total: 10 seconds, 2627 tests. Latency (us) : min: 402 / avr: 428 / max: 509 Well, as we can see above, in almost 3k tests run in a period of ten seconds, with each of the commands, I got even better results than I already got with ioping. I did tests with isolated commands as well, but I decided to write a bash script to be able to execute many commands in a short period of time and make an average. And we can see an average of about 37us in any situation. Very low! However, when using that suggested command --block-count=0 the latency is very high in any situation, around 428us. But as we see, using the nvme command, the latency is always the same in any scenario, whether with or without --force-unit-access, having a difference only regarding the use of the command directed to devices that don't have LBA or that aren't. What do you think? Tanks, Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote: > So why the slowness? Is it just the time spent in kernel code to set FUA and Flush Cache bits on writes that would cause all this latency increment (84us to 1.89ms) ? I don't think the kernel's handling accounts for that great of a difference. I think the difference is probably on the controller side. The NVMe spec says that a Write command with FUA set: "the controller shall write that data and metadata, if any, to non-volatile media before indicating command completion." So if the memory is non-volatile, it can complete the command without writing to the backing media. It can also commit the data to the backing media if it wants to before completing the command, but that's implementation specific details. You can remove the kernel interpretation using passthrough commands. Here's an example comparing with and without FUA assuming a 512b logical block format: # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency If you have a 4k LBA format, use "--block-count=0". And you may want to run each of the above several times to get an average since other factors can affect the reported latency. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-06-01 19:27 ` Adriano Silva @ 2022-06-01 21:11 ` Eric Wheeler 2022-06-02 5:26 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Eric Wheeler @ 2022-06-01 21:11 UTC (permalink / raw) To: Adriano Silva Cc: Keith Busch, Matthias Ferdinand, Bcache Linux, Coly Li, Christoph Hellwig, linux-block@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 5493 bytes --] On Wed, 1 Jun 2022, Adriano Silva wrote: > I don't know if my NVME's devices are 4K LBA. I do not think so. They > are all the same model and manufacturer. I know that they work with > blocks of 512 Bytes, but that their latency is very high when processing > blocks of this size. Ok, it should be safe in terms of the possible bcache bug I was referring to if it supports 512b IOs. > However, in all the tests I do with them with 4K blocks, the result is > much better. So I always use 4K blocks. Because in real life I don't > think I'll use blocks smaller than 4K. Makes sense, format with -w 4k. There is probably some CPU benefit to having page-aligned IOs, too. > > You can remove the kernel interpretation using passthrough commands. Here's an > > example comparing with and without FUA assuming a 512b logical block format: > > > > # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency > > # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency > > > > if you have a 4k LBA format, use "--block-count=0". > > > > And you may want to run each of the above several times to get an average since > > other factors can affect the reported latency. > > I created a bash script capable of executing the two commands you > suggested to me in a period of 10 seconds in a row, to get some more > acceptable average. The result is the following: > > root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done > root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache > write back > root@pve-21:~# ./nvme_write.sh > Total: 10 seconds, 3027 tests. Latency (us) : min: 29 / avr: 37 / max: 98 > root@pve-21:~# ./nvme_write.sh --force-unit-access > Total: 10 seconds, 2985 tests. Latency (us) : min: 29 / avr: 37 / max: 111 > root@pve-21:~# > root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0 > Total: 10 seconds, 2556 tests. Latency (us) : min: 404 / avr: 428 / max: 492 > root@pve-21:~# ./nvme_write.sh --block-count=0 > Total: 10 seconds, 2521 tests. Latency (us) : min: 403 / avr: 428 / max: 496 > root@pve-21:~# > root@pve-21:~# > root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done > root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache > write through > root@pve-21:~# ./nvme_write.sh > Total: 10 seconds, 2988 tests. Latency (us) : min: 29 / avr: 37 / max: 114 > root@pve-21:~# ./nvme_write.sh --force-unit-access > Total: 10 seconds, 2926 tests. Latency (us) : min: 29 / avr: 36 / max: 71 > root@pve-21:~# > root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0 > Total: 10 seconds, 2456 tests. Latency (us) : min: 31 / avr: 428 / max: 496 > root@pve-21:~# ./nvme_write.sh --block-count=0 > Total: 10 seconds, 2627 tests. Latency (us) : min: 402 / avr: 428 / max: 509 > > Well, as we can see above, in almost 3k tests run in a period of ten > seconds, with each of the commands, I got even better results than I > already got with ioping. I did tests with isolated commands as well, but > I decided to write a bash script to be able to execute many commands in > a short period of time and make an average. And we can see an average of > about 37us in any situation. Very low! > > However, when using that suggested command --block-count=0 the latency > is very high in any situation, around 428us. > > But as we see, using the nvme command, the latency is always the same in > any scenario, whether with or without --force-unit-access, having a > difference only regarding the use of the command directed to devices > that don't have LBA or that aren't. > > What do you think? It looks like the NVMe works well except in 512b situations. Its interesting that --force-unit-access doesn't increase the latency: Perhaps the NVMe ignores sync flags since it knows it has a non-volatile cache. -Eric > > Tanks, > > > Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: > > > > > > On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote: > > > So why the slowness? Is it just the time spent in kernel code to set > > FUA and Flush Cache bits on writes that would cause all this latency > > increment (84us to 1.89ms) ? > > > I don't think the kernel's handling accounts for that great of a difference. I > think the difference is probably on the controller side. > > The NVMe spec says that a Write command with FUA set: > > "the controller shall write that data and metadata, if any, to non-volatile > media before indicating command completion." > > So if the memory is non-volatile, it can complete the command without writing > to the backing media. It can also commit the data to the backing media if it > wants to before completing the command, but that's implementation specific > details. > > You can remove the kernel interpretation using passthrough commands. Here's an > example comparing with and without FUA assuming a 512b logical block format: > > # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency > # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency > > If you have a 4k LBA format, use "--block-count=0". > > And you may want to run each of the above several times to get an average since > other factors can affect the reported latency. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-06-01 21:11 ` Eric Wheeler @ 2022-06-02 5:26 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2022-06-02 5:26 UTC (permalink / raw) To: Eric Wheeler Cc: Adriano Silva, Keith Busch, Matthias Ferdinand, Bcache Linux, Coly Li, Christoph Hellwig, linux-block@vger.kernel.org On Wed, Jun 01, 2022 at 02:11:35PM -0700, Eric Wheeler wrote: > It looks like the NVMe works well except in 512b situations. Its > interesting that --force-unit-access doesn't increase the latency: Perhaps > the NVMe ignores sync flags since it knows it has a non-volatile cache. NVMe (and other interface) SSDs generally come in two flavors: - consumer ones have a volatile write cache and FUA/Flush has a lot of overhead - enterprise ones with the grossly nisnamed "power loss protection" feature have a non-volatile write cache and FUA/Flush has no overhead at all If this is an enterprise drive the behavior is expected. If on the other hand it is a cheap consumer driver chances are it just lies, which there have been a few instances of. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) 2022-05-24 20:14 ` Eric Wheeler 2022-05-24 20:34 ` Keith Busch @ 2022-05-25 5:17 ` Christoph Hellwig 1 sibling, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2022-05-25 5:17 UTC (permalink / raw) To: Eric Wheeler Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote: > Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache > (instead of the expected ~70us), so perhaps the NVMe flushes were killing > performance if every write was also forcing an erase cycle. This sounds very typical of a low end consumer grade NVMe SSD, yes. > The suggestion was to disable flushes in bcache as a troubleshooting step > to see if that solved the problem, but with the warning that it could be > unsafe. If you want to disable the cache (despite this being unsafe!) you can do this for every block device: echo "write through" > /sys/block/XXX/queue/write_cache > Questions: > > 1. If a user knows their disks have a non-volatile cache then is it safe > to drop flushes? It is, but in that case the disk will not advertise a write cache, and the flushes will not make it past the submit_bio and never reach the driver. > 3. Since the block layer wont send flushes when the hardware reports that > the cache is non-volatile, then how do you query the device to make > sure it is reporting correctly? For NVMe you can get VWC as: > nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc > > ...but how do you query a block device (like a RAID LUN) to make sure > it is reporting a non-volatile cache correctly? cat /sys/block/XXX/queue/write_cache ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-06-02 5:26 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <958894243.922478.1652201375900.ref@mail.yahoo.com>
[not found] ` <958894243.922478.1652201375900@mail.yahoo.com>
[not found] ` <9d59af25-d648-4777-a5c0-c38c246a9610@ewheeler.net>
2022-05-23 18:36 ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler
2022-05-24 5:34 ` Christoph Hellwig
2022-05-24 20:14 ` Eric Wheeler
2022-05-24 20:34 ` Keith Busch
2022-05-24 21:34 ` Eric Wheeler
2022-05-25 5:20 ` Christoph Hellwig
2022-05-25 18:44 ` Eric Wheeler
2022-05-26 9:06 ` Christoph Hellwig
2022-05-28 1:52 ` Eric Wheeler
2022-05-28 3:57 ` Keith Busch
2022-05-28 4:59 ` Christoph Hellwig
[not found] ` <24456292.2324073.1653742646974@mail.yahoo.com>
[not found] ` <YpLmDtMgyNLxJgNQ@kbusch-mbp.dhcp.thefacebook.com>
[not found] ` <2064546094.2440522.1653825057164@mail.yahoo.com>
[not found] ` <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>
2022-06-01 19:27 ` Adriano Silva
2022-06-01 21:11 ` Eric Wheeler
2022-06-02 5:26 ` Christoph Hellwig
2022-05-25 5:17 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).