[RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
       [not found]   ` <9d59af25-d648-4777-a5c0-c38c246a9610@ewheeler.net>
@ 2022-05-23 18:36     ` Eric Wheeler
  2022-05-24  5:34       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Wheeler @ 2022-05-23 18:36 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block

[-- Attachment #1: Type: text/plain, Size: 6042 bytes --]

On Tue, 17 May 2022, Eric Wheeler wrote:
>   /sys/fs/bcache/<cset-uuid>/journal_delay_ms
>     Journal writes will delay for up to this many milliseconds, unless a 
>     cache flush happens sooner. Defaults to 100.
> 
> I just noticed that journal_delay_ms says "unless a cache flush happens 
> sooner" but cache flushes can be re-ordered so flushing the journal when 
> REQ_OP_FLUSH comes through may not be useful, especially if there is a 
> high volume of flushes coming down the pipe because the flushes could kill 
> the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
> would flush data and journal.
> 
> Maybe there should be a cachedev_noflush sysfs option for those with some 
> kind of power-loss protection of there SSD's.  It looks like this is 
> handled in request.c when these functions call bch_journal_meta():
> 
> 	1053: static void cached_dev_nodata(struct closure *cl)
> 	1263: static void flash_dev_nodata(struct closure *cl)
> 
> Coly can you comment about journal flush semantics with respect to 
> performance vs correctness and crash safety?
> 
> Adriano, as a test, you could change this line in search_alloc() in 
> request.c:
> 
> 	- s->iop.flush_journal    = op_is_flush(bio->bi_opf);
> 	+ s->iop.flush_journal    = 0;
> 
> and see how performance changes.

Hi Coly, all:

Can you think of any reason that forcing iop.flush_journal=0 for bcache 
devs with backed by non-volatile cache would be unsafe?

If it is safe, then three new sysctl flags to optionally drop flushes 
would increase overall bcache performance by avoiding controller flushes, 
especially on the spinning disks.  These would of course default to 0:

  - noflush_journal - no flush on journal writes
  - noflush_cache   - no flush on normal cache IO writes
  - noflush_bdev    - no flush on normal bdev IO writes

What do you think?

From Coly's iopings:

>  # ./ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=144.3 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=84.1 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=71.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=68.9 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=69.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=68.7 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=68.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=70.3 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=68.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=68.5 us
 
^ Average is 71.1 us.

>  # ./ioping -c10 /dev/bcache0 -D -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=127.8 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=67.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=60.3 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=46.9 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=52.6 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=43.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=52.7 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=44.3 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=52.0 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=44.6 us

^ Average is 51.7 us.

Dropping sync write flushes provides a 27% reduction in SSD latency!


--
Eric Wheeler



> 
> Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
> affect correctness unless there is a crash.  If that /is/ the performance 
> problem then it would narrow the scope of this discussion.
> 
> 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
>    you set your CPU governor to run at full clock speed and then slowest 
>    clock speed to see if it is a CPU limit somewhere as we expect?
> 
>    You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
>    the governor did its job.  
> 
>    If it scales with CPU then something in bcache is working too hard.  
>    Maybe garbage collection?  Other devs would need to chime in here to 
>    steer the troubleshooting if that is the case.
> 
> 
> 5. I'm not sure if garbage collection is the issue, but you might try 
>    Mingzhe's dynamic incremental gc patch:
> 	https://www.spinics.net/lists/linux-bcache/msg11185.html
> 
> 6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
>    about the same then that would indicate an issue in the block layer 
>    somewhere outside of bcache.  If dm-cache is better, then that confirms 
>    a bcache issue.
> 
> 
> > The cache was configured directly on one of the NVMe partitions (in this 
> > case, the first partition). I did several tests using fio and ioping, 
> > testing on a partition on the NVMe device, without partition and 
> > directly on the raw block, on a first partition, on the second, with or 
> > without configuring bcache. I did all this to remove any doubt as to the 
> > method. The results of tests performed directly on the hardware device, 
> > without going through bcache are always fast and similar.
> > 
> > But tests in bcache are always slower. If you use writethrough, of 
> > course, it gets much worse, because the performance is equal to the raw 
> > spinning disk.
> > 
> > Using writeback improves a lot, but still doesn't use the full speed of 
> > NVMe (honestly, much less than full speed).
> 
> Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
> be awesome.
>  
> > But I've also noticed that there is a limit on writing sequential data, 
> > which is a little more than half of the maximum write rate shown in 
> > direct tests by the NVMe device.
> 
> For sync, async, or both?
> 
> > Processing doesn't seem to be going up like the tests.
> 
> What do you mean "processing" ?
> 
> -Eric
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-23 18:36     ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler
@ 2022-05-24  5:34       ` Christoph Hellwig
  2022-05-24 20:14         ` Eric Wheeler
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-05-24  5:34 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand,
	linux-block

... wait.

Can someone explain what this is all about?  Devices with power fail
protection will advertise that (using VWC flag in NVMe for example) and
we will never send flushes.  So anything that explicitly disables
flushed will generally cause data corruption.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24  5:34       ` Christoph Hellwig
@ 2022-05-24 20:14         ` Eric Wheeler
  2022-05-24 20:34           ` Keith Busch
  2022-05-25  5:17           ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Wheeler @ 2022-05-24 20:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand,
	linux-block

Hi Christoph,

On Mon, 23 May 2022, Christoph Hellwig wrote:
> ... wait.
> 
> Can someone explain what this is all about?  Devices with power fail 
> protection will advertise that (using VWC flag in NVMe for example) and 
> we will never send flushes. So anything that explicitly disables flushed 
> will generally cause data corruption.

Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
(instead of the expected ~70us), so perhaps the NVMe flushes were killing 
performance if every write was also forcing an erase cycle.

The suggestion was to disable flushes in bcache as a troubleshooting step 
to see if that solved the problem, but with the warning that it could be 
unsafe.

Questions:

1. If a user knows their disks have a non-volatile cache then is it safe 
   to drop flushes?

2. If not, then under what circumstances is it unsafe with a non-volatile 
   cache?

3. Since the block layer wont send flushes when the hardware reports that 
   the cache is non-volatile, then how do you query the device to make 
   sure it is reporting correctly?  For NVMe you can get VWC as:
	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc

   ...but how do you query a block device (like a RAID LUN) to make sure 
   it is reporting a non-volatile cache correctly?

--
Eric Wheeler

> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 20:14         ` Eric Wheeler
@ 2022-05-24 20:34           ` Keith Busch
  2022-05-24 21:34             ` Eric Wheeler
  2022-05-25  5:17           ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Keith Busch @ 2022-05-24 20:34 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote:
> Hi Christoph,
> 
> On Mon, 23 May 2022, Christoph Hellwig wrote:
> > ... wait.
> > 
> > Can someone explain what this is all about?  Devices with power fail 
> > protection will advertise that (using VWC flag in NVMe for example) and 
> > we will never send flushes. So anything that explicitly disables flushed 
> > will generally cause data corruption.
> 
> Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
> (instead of the expected ~70us), so perhaps the NVMe flushes were killing 
> performance if every write was also forcing an erase cycle.
> 
> The suggestion was to disable flushes in bcache as a troubleshooting step 
> to see if that solved the problem, but with the warning that it could be 
> unsafe.
> 
> Questions:
> 
> 1. If a user knows their disks have a non-volatile cache then is it safe 
>    to drop flushes?
> 
> 2. If not, then under what circumstances is it unsafe with a non-volatile 
>    cache?
>   
> 3. Since the block layer wont send flushes when the hardware reports that 
>    the cache is non-volatile, then how do you query the device to make 
>    sure it is reporting correctly?  For NVMe you can get VWC as:
> 	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
>    
>    ...but how do you query a block device (like a RAID LUN) to make sure 
>    it is reporting a non-volatile cache correctly?

You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the
value is "write through", then the device is reporting it doesn't have a
volatile cache. If it is "write back", then it has a volatile cache.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 20:34           ` Keith Busch
@ 2022-05-24 21:34             ` Eric Wheeler
  2022-05-25  5:20               ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Wheeler @ 2022-05-24 21:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, 24 May 2022, Keith Busch wrote:
> On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote:
> > Hi Christoph,
> > 
> > On Mon, 23 May 2022, Christoph Hellwig wrote:
> > > ... wait.
> > > 
> > > Can someone explain what this is all about?  Devices with power fail 
> > > protection will advertise that (using VWC flag in NVMe for example) and 
> > > we will never send flushes. So anything that explicitly disables flushed 
> > > will generally cause data corruption.
> > 
> > Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
> > (instead of the expected ~70us), so perhaps the NVMe flushes were killing 
> > performance if every write was also forcing an erase cycle.
> > 
> > The suggestion was to disable flushes in bcache as a troubleshooting step 
> > to see if that solved the problem, but with the warning that it could be 
> > unsafe.
> > 
> > Questions:
> > 
> > 1. If a user knows their disks have a non-volatile cache then is it safe 
> >    to drop flushes?
> > 
> > 2. If not, then under what circumstances is it unsafe with a non-volatile 
> >    cache?
> >   
> > 3. Since the block layer wont send flushes when the hardware reports that 
> >    the cache is non-volatile, then how do you query the device to make 
> >    sure it is reporting correctly?  For NVMe you can get VWC as:
> > 	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
> >    
> >    ...but how do you query a block device (like a RAID LUN) to make sure 
> >    it is reporting a non-volatile cache correctly?
> 
> You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the
> value is "write through", then the device is reporting it doesn't have a
> volatile cache. If it is "write back", then it has a volatile cache.
 
Thanks, Keith!  

Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
to "none", or does the write_cache flag operate independently of the 
selected scheduler?

Does the block layer stop sending flushes at the first device in the stack 
that is set to "write back"?  For example, if a device mapper target is 
writeback will it strip flushes on the way to the backing device?

This confirms what I have suspected all along: We have an LSI MegaRAID 
SAS-3516 where the write policy is "write back" in the LUN, but the cache 
is flagged in Linux as write-through:

	]# cat /sys/block/sdb/queue/write_cache 
	write through

I guess this is the correct place to adjust that behavior!


--
Eric Wheeler


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 21:34             ` Eric Wheeler
@ 2022-05-25  5:20               ` Christoph Hellwig
  2022-05-25 18:44                 ` Eric Wheeler
  2022-05-28  1:52                 ` Eric Wheeler
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-05-25  5:20 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Keith Busch, Christoph Hellwig, Coly Li, Adriano Silva,
	Bcache Linux, Matthias Ferdinand, linux-block

On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote:
> Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
> to "none", or does the write_cache flag operate independently of the 
> selected scheduler?

This in completely independent from ﬆhe scheduler.

> Does the block layer stop sending flushes at the first device in the stack 
> that is set to "write back"?  For example, if a device mapper target is 
> writeback will it strip flushes on the way to the backing device?

This is up to the stacking driver.  dm and tend to pass through flushes
where needed.

> This confirms what I have suspected all along: We have an LSI MegaRAID 
> SAS-3516 where the write policy is "write back" in the LUN, but the cache 
> is flagged in Linux as write-through:
> 
> 	]# cat /sys/block/sdb/queue/write_cache 
> 	write through
> 
> I guess this is the correct place to adjust that behavior!

MegaRAID has had all kinds of unsafe policies in the past unfortunately.
I'm not even sure all of them could pass through flushes properly if we
asked them to :(

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-25  5:20               ` Christoph Hellwig
@ 2022-05-25 18:44                 ` Eric Wheeler
  2022-05-26  9:06                   ` Christoph Hellwig
  2022-05-28  1:52                 ` Eric Wheeler
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Wheeler @ 2022-05-25 18:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, 24 May 2022, Christoph Hellwig wrote:
> On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote:
> > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
> > to "none", or does the write_cache flag operate independently of the 
> > selected scheduler?
> 
> This is up to the stacking driver.  dm and tend to pass through flushes
> where needed.
> 
> > This confirms what I have suspected all along: We have an LSI MegaRAID 
> > SAS-3516 where the write policy is "write back" in the LUN, but the cache 
> > is flagged in Linux as write-through:
> > 
> > 	]# cat /sys/block/sdb/queue/write_cache 
> > 	write through
> > 
> > I guess this is the correct place to adjust that behavior!
> 
> MegaRAID has had all kinds of unsafe policies in the past unfortunately.
> I'm not even sure all of them could pass through flushes properly if we
> asked them to :(

Thanks for the feedback, great info!

In your experience, which SAS/SATA RAID controllers are best behaved in 
terms of policies and reporting things like io_opt and 
writeback/writethrough to the kernel?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-25 18:44                 ` Eric Wheeler
@ 2022-05-26  9:06                   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-05-26  9:06 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Keith Busch, Coly Li, Adriano Silva,
	Bcache Linux, Matthias Ferdinand, linux-block

On Wed, May 25, 2022 at 11:44:01AM -0700, Eric Wheeler wrote:
> In your experience, which SAS/SATA RAID controllers are best behaved in 
> terms of policies and reporting things like io_opt and 
> writeback/writethrough to the kernel?

I never had actually good experiences with any of them.  That being
said I also haven't used one for years.  For SAS or SATA attachd to
expanders setups I've mostly used the mpt2/3 family of controllers
which are doing okay.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-25  5:20               ` Christoph Hellwig
  2022-05-25 18:44                 ` Eric Wheeler
@ 2022-05-28  1:52                 ` Eric Wheeler
  2022-05-28  3:57                   ` Keith Busch
  2022-05-28  4:59                   ` Christoph Hellwig
  1 sibling, 2 replies; 15+ messages in thread
From: Eric Wheeler @ 2022-05-28  1:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

[-- Attachment #1: Type: text/plain, Size: 2609 bytes --]

On Tue, 24 May 2022, Christoph Hellwig wrote:
> On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote:
> > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
> > to "none", or does the write_cache flag operate independently of the 
> > selected scheduler?
> 
> This in completely independent from ﬆhe scheduler.
> 
> > Does the block layer stop sending flushes at the first device in the stack 
> > that is set to "write back"?  For example, if a device mapper target is 
> > writeback will it strip flushes on the way to the backing device?
> 
> This is up to the stacking driver.  dm and tend to pass through flushes
> where needed.
> 
> > This confirms what I have suspected all along: We have an LSI MegaRAID 
> > SAS-3516 where the write policy is "write back" in the LUN, but the cache 
> > is flagged in Linux as write-through:
> > 
> > 	]# cat /sys/block/sdb/queue/write_cache 
> > 	write through

Hi Keith, Christoph:

Adriano who started this thread (cc'ed) reported that setting 
queue/write_cache to "write back" provides much higher latency on his NVMe 
than "write through"; I tested a system here and found the same thing.

Here is Adriano's summary:

        # cat /sys/block/nvme0n1/queue/write_cache
        write through
        # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
        ...
        min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us
                                     ^^^^ ^^

        # for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
        # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
        ...
        min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us
                                     ^^^^ ^^

Interestingly, Adriano's is 24.01x and ours is 23.97x higher latency
higher (see below).  These 24x numbers seem too similar to be a
coincidence on such different configurations.  He's running Linux 5.4
and we are on 4.19.

Is this expected?


More info:

The stack where I verified the behavior Adriano reported is slightly
different, NVMe's are under md RAID1 with LVM on top, so latency is
higher, but still basically the same high latency difference with
writeback enabled:

	]# cat /sys/block/nvme[01]n1/queue/write_cache
	write through
	write through
	]# ionice -c1 -n1 ioping -c10 /dev/ssd/ssd-test -D -s4k -WWW -Y
	...
	min/avg/max/mdev = 119.1 us / 754.9 us / 2.67 ms / 1.02 ms


	]# cat /sys/block/nvme[01]n1/queue/write_cache
	write back
	write back
	]# ionice -c1 -n1 ioping -c10 /dev/ssd/ssd-test -D -s4k -WWW -Y
	...
	min/avg/max/mdev = 113.4 us / 18.1 ms / 29.2 ms / 9.53 ms


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-28  1:52                 ` Eric Wheeler
@ 2022-05-28  3:57                   ` Keith Busch
  2022-05-28  4:59                   ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Keith Busch @ 2022-05-28  3:57 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote:
> Hi Keith, Christoph:
> 
> Adriano who started this thread (cc'ed) reported that setting 
> queue/write_cache to "write back" provides much higher latency on his NVMe 
> than "write through"; I tested a system here and found the same thing.
> 
> Here is Adriano's summary:
> 
>         # cat /sys/block/nvme0n1/queue/write_cache
>         write through
>         # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
>         ...
>         min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us
>                                      ^^^^ ^^
> 
>         # for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
>         # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
>         ...
>         min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us
>                                      ^^^^ ^^

With the "write back" setting, I find that the writes dispatched from ioping
will have the force-unit-access bit set in the commands, so it is expected to
take longer.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-28  1:52                 ` Eric Wheeler
  2022-05-28  3:57                   ` Keith Busch
@ 2022-05-28  4:59                   ` Christoph Hellwig
       [not found]                     ` <24456292.2324073.1653742646974@mail.yahoo.com>
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2022-05-28  4:59 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Keith Busch, Coly Li, Adriano Silva,
	Bcache Linux, Matthias Ferdinand, linux-block

On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote:
> Adriano who started this thread (cc'ed) reported that setting 
> queue/write_cache to "write back" provides much higher latency on his NVMe 
> than "write through"; I tested a system here and found the same thing.
>
> [...]
>
> Is this expected?

Once you do that, the block layer ignores all flushes and FUA bits, so
yes it is going to be a lot faster.  But also completely unsafe because
it does not provide any data durability guarantees.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <24456292.2324073.1653742646974@mail.yahoo.com>]

[parent not found: <YpLmDtMgyNLxJgNQ@kbusch-mbp.dhcp.thefacebook.com>]

[parent not found: <2064546094.2440522.1653825057164@mail.yahoo.com>]

[parent not found: <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>]

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
       [not found]                           ` <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>
@ 2022-06-01 19:27                             ` Adriano Silva
  2022-06-01 21:11                               ` Eric Wheeler
  0 siblings, 1 reply; 15+ messages in thread
From: Adriano Silva @ 2022-06-01 19:27 UTC (permalink / raw)
  To: Keith Busch, Eric Wheeler, Matthias Ferdinand, Bcache Linux,
	Coly Li, Christoph Hellwig, linux-block@vger.kernel.org

Tankyou,

I don't know if my NVME's devices are 4K LBA. I do not think so. They are all the same model and manufacturer. I know that they work with blocks of 512 Bytes, but that their latency is very high when processing blocks of this size.

However, in all the tests I do with them with 4K blocks, the result is much better. So I always use 4K blocks. Because in real life I don't think I'll use blocks smaller than 4K.

> You can remove the kernel interpretation using passthrough commands. Here's an
> example comparing with and without FUA assuming a 512b logical block format:
> 
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> 
> if you have a 4k LBA format, use "--block-count=0".
> 
> And you may want to run each of the above several times to get an average since
> other factors can affect the reported latency.

I created a bash script capable of executing the two commands you suggested to me in a period of 10 seconds in a row, to get some more acceptable average. The result is the following:

root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write back
root@pve-21:~# ./nvme_write.sh
Total: 10 seconds, 3027 tests. Latency (us) : min: 29  /  avr: 37   /  max: 98
root@pve-21:~# ./nvme_write.sh --force-unit-access
Total: 10 seconds, 2985 tests. Latency (us) : min: 29  /  avr: 37   /  max: 111
root@pve-21:~#
root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
Total: 10 seconds, 2556 tests. Latency (us) : min: 404  /  avr: 428   /  max: 492
root@pve-21:~# ./nvme_write.sh --block-count=0
Total: 10 seconds, 2521 tests. Latency (us) : min: 403  /  avr: 428   /  max: 496
root@pve-21:~#
root@pve-21:~#
root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write through
root@pve-21:~# ./nvme_write.sh
Total: 10 seconds, 2988 tests. Latency (us) : min: 29  /  avr: 37   /  max: 114
root@pve-21:~# ./nvme_write.sh --force-unit-access
Total: 10 seconds, 2926 tests. Latency (us) : min: 29  /  avr: 36   /  max: 71
root@pve-21:~#
root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
Total: 10 seconds, 2456 tests. Latency (us) : min: 31  /  avr: 428   /  max: 496
root@pve-21:~# ./nvme_write.sh --block-count=0
Total: 10 seconds, 2627 tests. Latency (us) : min: 402  /  avr: 428   /  max: 509

Well, as we can see above, in almost 3k tests run in a period of ten seconds, with each of the commands, I got even better results than I already got with ioping. I did tests with isolated commands as well, but I decided to write a bash script to be able to execute many commands in a short period of time and make an average. And we can see an average of about 37us in any situation. Very low!

However, when using that suggested command --block-count=0 the latency is very high in any situation, around 428us.

But as we see, using the nvme command, the latency is always the same in any scenario, whether with or without --force-unit-access, having a difference only regarding the use of the command directed to devices that don't have LBA or that aren't.

What do you think?

Tanks,

Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: 

On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote:

> So why the slowness? Is it just the time spent in kernel code to set FUA and Flush Cache bits on writes that would cause all this latency increment (84us to 1.89ms) ?

I don't think the kernel's handling accounts for that great of a difference. I
think the difference is probably on the controller side.

The NVMe spec says that a Write command with FUA set:

"the controller shall write that data and metadata, if any, to non-volatile
media before indicating command completion."

So if the memory is non-volatile, it can complete the command without writing
to the backing media. It can also commit the data to the backing media if it
wants to before completing the command, but that's implementation specific
details.

You can remove the kernel interpretation using passthrough commands. Here's an
example comparing with and without FUA assuming a 512b logical block format:

  # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
  # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency

If you have a 4k LBA format, use "--block-count=0".

And you may want to run each of the above several times to get an average since
other factors can affect the reported latency.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-06-01 19:27                             ` Adriano Silva
@ 2022-06-01 21:11                               ` Eric Wheeler
  2022-06-02  5:26                                 ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Wheeler @ 2022-06-01 21:11 UTC (permalink / raw)
  To: Adriano Silva
  Cc: Keith Busch, Matthias Ferdinand, Bcache Linux, Coly Li,
	Christoph Hellwig, linux-block@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5493 bytes --]

On Wed, 1 Jun 2022, Adriano Silva wrote:
> I don't know if my NVME's devices are 4K LBA. I do not think so. They 
> are all the same model and manufacturer. I know that they work with 
> blocks of 512 Bytes, but that their latency is very high when processing 
> blocks of this size.

Ok, it should be safe in terms of the possible bcache bug I was referring 
to if it supports 512b IOs.

> However, in all the tests I do with them with 4K blocks, the result is 
> much better. So I always use 4K blocks. Because in real life I don't 
> think I'll use blocks smaller than 4K.

Makes sense, format with -w 4k.  There is probably some CPU benefit to 
having page-aligned IOs, too.

> > You can remove the kernel interpretation using passthrough commands. Here's an
> > example comparing with and without FUA assuming a 512b logical block format:
> > 
> >   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
> >   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> > 
> > if you have a 4k LBA format, use "--block-count=0".
> > 
> > And you may want to run each of the above several times to get an average since
> > other factors can affect the reported latency.
> 
> I created a bash script capable of executing the two commands you 
> suggested to me in a period of 10 seconds in a row, to get some more 
> acceptable average. The result is the following:
> 
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write back
> root@pve-21:~# ./nvme_write.sh
> Total: 10 seconds, 3027 tests. Latency (us) : min: 29  /  avr: 37   /  max: 98
> root@pve-21:~# ./nvme_write.sh --force-unit-access
> Total: 10 seconds, 2985 tests. Latency (us) : min: 29  /  avr: 37   /  max: 111
> root@pve-21:~#
> root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
> Total: 10 seconds, 2556 tests. Latency (us) : min: 404  /  avr: 428   /  max: 492
> root@pve-21:~# ./nvme_write.sh --block-count=0
> Total: 10 seconds, 2521 tests. Latency (us) : min: 403  /  avr: 428   /  max: 496
> root@pve-21:~#
> root@pve-21:~#
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write through
> root@pve-21:~# ./nvme_write.sh
> Total: 10 seconds, 2988 tests. Latency (us) : min: 29  /  avr: 37   /  max: 114
> root@pve-21:~# ./nvme_write.sh --force-unit-access
> Total: 10 seconds, 2926 tests. Latency (us) : min: 29  /  avr: 36   /  max: 71
> root@pve-21:~#
> root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
> Total: 10 seconds, 2456 tests. Latency (us) : min: 31  /  avr: 428   /  max: 496
> root@pve-21:~# ./nvme_write.sh --block-count=0
> Total: 10 seconds, 2627 tests. Latency (us) : min: 402  /  avr: 428   /  max: 509
> 
> Well, as we can see above, in almost 3k tests run in a period of ten 
> seconds, with each of the commands, I got even better results than I 
> already got with ioping. I did tests with isolated commands as well, but 
> I decided to write a bash script to be able to execute many commands in 
> a short period of time and make an average. And we can see an average of 
> about 37us in any situation. Very low!
> 
> However, when using that suggested command --block-count=0 the latency 
> is very high in any situation, around 428us.
> 
> But as we see, using the nvme command, the latency is always the same in 
> any scenario, whether with or without --force-unit-access, having a 
> difference only regarding the use of the command directed to devices 
> that don't have LBA or that aren't.
> 
> What do you think?

It looks like the NVMe works well except in 512b situations.  Its 
interesting that --force-unit-access doesn't increase the latency: Perhaps 
the NVMe ignores sync flags since it knows it has a non-volatile cache.

-Eric

> 
> Tanks,
> 
> 
> Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: 
> 
> 
> 
> 
> 
> On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote:
> 
> > So why the slowness? Is it just the time spent in kernel code to set 
> > FUA and Flush Cache bits on writes that would cause all this latency 
> > increment (84us to 1.89ms) ?
> 
> 
> I don't think the kernel's handling accounts for that great of a difference. I
> think the difference is probably on the controller side.
> 
> The NVMe spec says that a Write command with FUA set:
> 
> "the controller shall write that data and metadata, if any, to non-volatile
> media before indicating command completion."
> 
> So if the memory is non-volatile, it can complete the command without writing
> to the backing media. It can also commit the data to the backing media if it
> wants to before completing the command, but that's implementation specific
> details.
> 
> You can remove the kernel interpretation using passthrough commands. Here's an
> example comparing with and without FUA assuming a 512b logical block format:
> 
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> 
> If you have a 4k LBA format, use "--block-count=0".
> 
> And you may want to run each of the above several times to get an average since
> other factors can affect the reported latency.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-06-01 21:11                               ` Eric Wheeler
@ 2022-06-02  5:26                                 ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-06-02  5:26 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Adriano Silva, Keith Busch, Matthias Ferdinand, Bcache Linux,
	Coly Li, Christoph Hellwig, linux-block@vger.kernel.org

On Wed, Jun 01, 2022 at 02:11:35PM -0700, Eric Wheeler wrote:
> It looks like the NVMe works well except in 512b situations.  Its 
> interesting that --force-unit-access doesn't increase the latency: Perhaps 
> the NVMe ignores sync flags since it knows it has a non-volatile cache.

NVMe (and other interface) SSDs generally come in two flavors:

 - consumer ones have a volatile write cache and FUA/Flush has a lot of
   overhead
 - enterprise ones with the grossly nisnamed "power loss protection"
   feature have a non-volatile write cache and FUA/Flush has no overhead
   at all

If this is an enterprise drive the behavior is expected.  If on the
other hand it is a cheap consumer driver chances are it just lies, which
there have been a few instances of.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 20:14         ` Eric Wheeler
  2022-05-24 20:34           ` Keith Busch
@ 2022-05-25  5:17           ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2022-05-25  5:17 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote:
> Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
> (instead of the expected ~70us), so perhaps the NVMe flushes were killing 
> performance if every write was also forcing an erase cycle.

This sounds very typical of a low end consumer grade NVMe SSD, yes.

> The suggestion was to disable flushes in bcache as a troubleshooting step 
> to see if that solved the problem, but with the warning that it could be 
> unsafe.

If you want to disable the cache (despite this being unsafe!) you can
do this for every block device:

	echo "write through" > /sys/block/XXX/queue/write_cache

> Questions:
> 
> 1. If a user knows their disks have a non-volatile cache then is it safe 
>    to drop flushes?

It is, but in that case the disk will not advertise a write cache, and
the flushes will not make it past the submit_bio and never reach the
driver.

> 3. Since the block layer wont send flushes when the hardware reports that 
>    the cache is non-volatile, then how do you query the device to make 
>    sure it is reporting correctly?  For NVMe you can get VWC as:
> 	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
>    
>    ...but how do you query a block device (like a RAID LUN) to make sure 
>    it is reporting a non-volatile cache correctly?

cat /sys/block/XXX/queue/write_cache

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-06-02  5:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <958894243.922478.1652201375900.ref@mail.yahoo.com>
     [not found] ` <958894243.922478.1652201375900@mail.yahoo.com>
     [not found]   ` <9d59af25-d648-4777-a5c0-c38c246a9610@ewheeler.net>
2022-05-23 18:36     ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler
2022-05-24  5:34       ` Christoph Hellwig
2022-05-24 20:14         ` Eric Wheeler
2022-05-24 20:34           ` Keith Busch
2022-05-24 21:34             ` Eric Wheeler
2022-05-25  5:20               ` Christoph Hellwig
2022-05-25 18:44                 ` Eric Wheeler
2022-05-26  9:06                   ` Christoph Hellwig
2022-05-28  1:52                 ` Eric Wheeler
2022-05-28  3:57                   ` Keith Busch
2022-05-28  4:59                   ` Christoph Hellwig
     [not found]                     ` <24456292.2324073.1653742646974@mail.yahoo.com>
     [not found]                       ` <YpLmDtMgyNLxJgNQ@kbusch-mbp.dhcp.thefacebook.com>
     [not found]                         ` <2064546094.2440522.1653825057164@mail.yahoo.com>
     [not found]                           ` <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>
2022-06-01 19:27                             ` Adriano Silva
2022-06-01 21:11                               ` Eric Wheeler
2022-06-02  5:26                                 ` Christoph Hellwig
2022-05-25  5:17           ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).