Re: 4.8.8, bcache deadlock and hard lockup

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: 4.8.8, bcache deadlock and hard lockup
       [not found]     ` <alpine.LRH.2.11.1611291255350.1914@mail.ewheeler.net>
@ 2016-11-30 16:46       ` Marc MERLIN
  2016-11-30 17:16         ` Marc MERLIN
                           ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Marc MERLIN @ 2016-11-30 16:46 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Coly Li, linux-bcache, Btrfs BTRFS

+btrfs mailing list, see below why

On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
> On Mon, 27 Nov 2016, Coly Li wrote:
> > 
> > Yes, too many work queues... I guess the locking might be caused by some
> > very obscure reference of closure code. I cannot have any clue if I
> > cannot find a stable procedure to reproduce this issue.
> > 
> > Hmm, if there is a tool to clone all the meta data of the back end cache
> > and whole cached device, there might be a method to replay the oops much
> > easier.
> > 
> > Eric, do you have any hint ?
> 
> Note that the backing device doesn't have any metadata, just a superblock. 
> You can easily dd that off onto some other volume without transferring the 
> data. By default, data starts at 8k, or whatever you used in `make-bcache 
> -w`.

Ok, Linus helped me find a workaround for this problem:
https://lkml.org/lkml/2016/11/29/667
namely:
   echo 2 > /proc/sys/vm/dirty_ratio
   echo 1 > /proc/sys/vm/dirty_background_ratio
(it's a 24GB system, so the defaults of 20 and 10 were creating too many
requests in th buffers)

Note that this is only a workaround, not a fix.

When I did this and re tried my big copy again, I still got 100+ kernel
work queues, but apparently the underlying swraid5 was able to unblock
and satisfy the write requests before too many accumulated and crashed
the kernel.

I'm not a kernel coder, but seems to me that bcache needs a way to
throttle incoming requests if there are too many so that it does not end
up in a state where things blow up due to too many piled up requests.

You should be able to reproduce this by taking 5 spinning rust drives,
put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
I used btrfs) and send lots of requests.
Actually to be honest, the problems have mostly been happening when I do
btrfs scrub and btrfs send/receive which both generate I/O from within
the kernel instead of user space.
So here, btrfs may be a contributor to the problem too, but while btrfs
still trashes my system if I remove the caching device on bcache (and
with the default dirty ratio values), it doesn't crash the kernel.

I'll start another separate thread with the btrfs folks on how much
pressure is put on the system, but on your side it would be good to help
ensure that bcache doesn't crash the system altogether if too many
requests are allowed to pile up.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.8.8, bcache deadlock and hard lockup
  2016-11-30 16:46       ` 4.8.8, bcache deadlock and hard lockup Marc MERLIN
@ 2016-11-30 17:16         ` Marc MERLIN
  2016-11-30 17:18         ` btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off Marc MERLIN
  2016-11-30 23:57         ` 4.8.8, bcache deadlock and hard lockup Eric Wheeler
  2 siblings, 0 replies; 12+ messages in thread
From: Marc MERLIN @ 2016-11-30 17:16 UTC (permalink / raw)
  To: Eric Wheeler, Linus Torvalds; +Cc: Coly Li, linux-bcache, Btrfs BTRFS

On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote:
> +btrfs mailing list, see below why
> 
> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
> > On Mon, 27 Nov 2016, Coly Li wrote:
> > > 
> > > Yes, too many work queues... I guess the locking might be caused by some
> > > very obscure reference of closure code. I cannot have any clue if I
> > > cannot find a stable procedure to reproduce this issue.
> > > 
> > > Hmm, if there is a tool to clone all the meta data of the back end cache
> > > and whole cached device, there might be a method to replay the oops much
> > > easier.
> > > 
> > > Eric, do you have any hint ?
> > 
> > Note that the backing device doesn't have any metadata, just a superblock. 
> > You can easily dd that off onto some other volume without transferring the 
> > data. By default, data starts at 8k, or whatever you used in `make-bcache 
> > -w`.
> 
> Ok, Linus helped me find a workaround for this problem:
> https://lkml.org/lkml/2016/11/29/667
> namely:
>    echo 2 > /proc/sys/vm/dirty_ratio
>    echo 1 > /proc/sys/vm/dirty_background_ratio
> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
> requests in th buffers)
> 
> Note that this is only a workaround, not a fix.

Actually, I'm even more worried about the general bcache situation when
caching is enabled. In the message above, Linus wrote:

"One situation where I've seen something like this happen is

 (a) lots and lots of dirty data queued up
 (b) horribly slow storage
 (c) filesystem that ends up serializing on writeback under certain
circumstances

The usual case for (b) in the modern world is big SSD's that have bad
worst-case behavior (ie they may do gbps speeds when doing well, and
then they come to a screeching halt when their buffers fill up and
they have to do rewrites, and their gbps throughput drops to mbps or
lower).

Generally you only find that kind of really nasty SSD in the USB stick
world these days."

Well, come to think of it, this is _exactly_ what bcache will create, by
design. It'll swallow up a lot of IO cached to the SSD, until the SSD
buffers fill up and then things will hang while bcache struggles to
write it all to slower spinning rust storage.

Looks to me like bcache and dirty_ratio need to be synced somehow, or
things will fall over reliably.

What do you think?

Thanks,
Marc


> When I did this and re tried my big copy again, I still got 100+ kernel
> work queues, but apparently the underlying swraid5 was able to unblock
> and satisfy the write requests before too many accumulated and crashed
> the kernel.
> 
> I'm not a kernel coder, but seems to me that bcache needs a way to
> throttle incoming requests if there are too many so that it does not end
> up in a state where things blow up due to too many piled up requests.
> 
> You should be able to reproduce this by taking 5 spinning rust drives,
> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
> I used btrfs) and send lots of requests.
> Actually to be honest, the problems have mostly been happening when I do
> btrfs scrub and btrfs send/receive which both generate I/O from within
> the kernel instead of user space.
> So here, btrfs may be a contributor to the problem too, but while btrfs
> still trashes my system if I remove the caching device on bcache (and
> with the default dirty ratio values), it doesn't crash the kernel.
> 
> I'll start another separate thread with the btrfs folks on how much
> pressure is put on the system, but on your side it would be good to help
> ensure that bcache doesn't crash the system altogether if too many
> requests are allowed to pile up.
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
  2016-11-30 16:46       ` 4.8.8, bcache deadlock and hard lockup Marc MERLIN
  2016-11-30 17:16         ` Marc MERLIN
@ 2016-11-30 17:18         ` Marc MERLIN
  2016-11-30 18:00           ` Austin S. Hemmelgarn
  2016-11-30 23:57         ` 4.8.8, bcache deadlock and hard lockup Eric Wheeler
  2 siblings, 1 reply; 12+ messages in thread
From: Marc MERLIN @ 2016-11-30 17:18 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote:
> +btrfs mailing list, see below why
> 
> Ok, Linus helped me find a workaround for this problem:
> https://lkml.org/lkml/2016/11/29/667
> namely:
>    echo 2 > /proc/sys/vm/dirty_ratio
>    echo 1 > /proc/sys/vm/dirty_background_ratio
> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
> requests in th buffers)

I'll remove the bcache list on this followup since I want to concentrate
here on the fact that btrfs does behave badly with the default
dirty_ratio values.
As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays
on spinning rust.
swraid5 < bcache < dmcrypt < btrfs

Copying with btrfs send/receive causes massive hangs on the system.
Please see this explanation from Linus on why the workaround was
suggested:
https://lkml.org/lkml/2016/11/29/667

The hangs that I'm getting with bcache cache turned off (i.e.
passthrough) are now very likely only due to btrfs and mess up anything
doing file IO that ends up timing out, break USB even as reads time out
in the middle of USB requests, interrupts lost, and so forth.

All of this mostly went away with Linus' suggestion:
echo 2 > /proc/sys/vm/dirty_ratio
echo 1 > /proc/sys/vm/dirty_background_ratio

But that's hiding the symptom which I think is that btrfs is piling up too many I/O
requests during btrfs send/receive and btrfs scrub (probably balance too) and not 
looking at resulting impact to system health.

Is there a way to stop flodding the entire system with I/O and causing
so much strain on it?
(I realize that if there is a caching layer underneath that just takes
requests and says thank you without giving other clues that underneath
bad things are happening, it may be hard, but I'm asking anyway :)


[10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750

[12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32

[17761.122238] usb 4-1.4: USB disconnect, device number 39
[17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108
[17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd

[24130.574425] hpet1: lost 2306 rtc interrupts
[24156.034950] hpet1: lost 1628 rtc interrupts
[24173.314738] hpet1: lost 1104 rtc interrupts
[24180.129950] hpet1: lost 436 rtc interrupts
[24257.557955] hpet1: lost 4954 rtc interrupts
[24267.522656] hpet1: lost 637 rtc interrupts

[28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28034.975471]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28035.025429] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
[28035.047717]  ffff91154d33fc70 0000000000200246 ffff911842f880c0 ffff9115a4cf01c0
[28035.071020]  ffff91154d33fc58 ffff91154d340000 ffff91165493bca0 ffff9115623773f0
[28035.094252]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
[28035.117538] Call Trace:
[28035.125791]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28035.141550]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
[28035.162457]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28035.180891]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
[28035.201723]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
[28035.219269]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
[28035.237032]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
[28035.254566]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
[28035.271057]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
[28035.287782]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7

[28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28155.802229]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28155.827894] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28155.852479] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
[28155.874761]  ffff91154d33fc70 0000000000200246 ffff911842f880c0 ffff9115a4cf01c0
[28155.898059]  ffff91154d33fc58 ffff91154d340000 ffff91165493bca0 ffff9115623773f0
[28155.921464]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
[28155.944720] Call Trace:
[28155.953176]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28155.968945]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
[28155.989811]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28156.008195]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
[28156.028498]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
[28156.046081]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
[28156.063621]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
[28156.081667]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
[28156.098732]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
[28156.115489]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
[28156.133389]  [<ffffffffb81ad63a>] SyS_truncate+0xe/0x10
[28156.149831]  [<ffffffffb80039ec>] do_syscall_64+0x61/0x72
[28156.167179]  [<ffffffffb86d2265>] entry_SYSCALL64_slow_path+0x25/0x25

[28397.436986] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28397.456798]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28397.482122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28397.506359] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
[28397.528494]  ffff91154d33fc70 0000000000200246 ffff911842fa41c0 ffff9115a4cf01c0
[28397.551645]  ffff91154d33fc58 ffff91154d340000 ffff91142ba20140 ffff9115ba3ad7e8
[28397.574829]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
[28397.598015] Call Trace:
[28397.606120]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28397.621772]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
[28397.642376]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28397.660661]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
[28397.680801]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
[28397.698281]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
[28397.715769]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
[28397.733279]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
[28397.749731]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
[28397.766379]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
[28397.783878]  [<ffffffffb81ad63a>] SyS_truncate+0xe/0x10
[28397.800319]  [<ffffffffb80039ec>] do_syscall_64+0x61/0x72
[28397.817286]  [<ffffffffb86d2265>] entry_SYSCALL64_slow_path+0x25/0x25

[28518.264522] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28518.285084]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28518.310600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28518.335590] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
[28518.357716]  ffff91154d33fc70 0000000000200246 ffff911842fa41c0 ffff9115a4cf01c0
[28518.380902]  ffff91154d33fc58 ffff91154d340000 ffff91142ba20140 ffff9115ba3ad7e8
[28518.404075]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
[28518.427227] Call Trace:
[28518.435384]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28518.451151]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
[28518.471821]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28518.490186]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
[28518.510301]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
[28518.527818]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
[28518.545320]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
[28518.562924]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
[28518.579434]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
[28518.596132]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
[28518.613687]  [<ffffffffb81ad63a>] SyS_truncate+0xe/0x10
[28518.630115]  [<ffffffffb80039ec>] do_syscall_64+0x61/0x72
[28518.647003]  [<ffffffffb86d2265>] entry_SYSCALL64_slow_path+0x25/0x25
[28518.667036] INFO: task kworker/u16:34:27172 blocked for more than 120 seconds.
[28518.689602]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28518.715144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28518.739569] kworker/u16:34  D ffff9117ec52bc80     0 27172      2 0x00000080
[28518.761628] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[28518.780669]  ffff9117ec52bc80 0000000000200246 ffff911842f9e180 ffff91157c386000
[28518.803719]  ffff9117ec52bc68 ffff9117ec52c000 ffff9115cbf779f0 ffff9115cbf77800
[28518.826783]  0000000000000000 ffff9116af353500 ffff9117ec52bc98 ffffffffb86cf1a6
[28518.849863] Call Trace:
[28518.858011]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28518.873624]  [<ffffffffb82a364f>] wait_current_trans.isra.16+0xcc/0x107
[28518.894158]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28518.912370]  [<ffffffffb82a58e1>] start_transaction+0x2ba/0x418
[28518.930827]  [<ffffffffb82a5a8a>] btrfs_join_transaction+0x17/0x19
[28518.950105]  [<ffffffffb82ad3ec>] btrfs_finish_ordered_io+0x1e9/0x4db
[28518.970132]  [<ffffffffb8088000>] ? preempt_latency_start+0x34/0x5d
[28518.989640]  [<ffffffffb82ad946>] finish_ordered_fn+0x15/0x17
[28519.007579]  [<ffffffffb82cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[28519.027844]  [<ffffffffb807b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
[28519.048420]  [<ffffffffb82cff40>] btrfs_endio_write_helper+0xe/0x10
[28519.067907]  [<ffffffffb807d068>] process_one_work+0x186/0x29d
[28519.086158]  [<ffffffffb807d66d>] worker_thread+0x1ea/0x2ba
[28519.103608]  [<ffffffffb807d483>] ? rescuer_thread+0x2d1/0x2d1
[28519.121783]  [<ffffffffb8081df3>] kthread+0xb4/0xbc
[28519.137100]  [<ffffffffb86d23df>] ret_from_fork+0x1f/0x40
[28519.153938]  [<ffffffffb8081d3f>] ? init_completion+0x24/0x24
[28519.171842] INFO: task kworker/u16:8:8909 blocked for more than 120 seconds.
[28519.193688]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28519.218851] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28519.243006] kworker/u16:8   D ffff9117b992fc80     0  8909      2 0x00000080
[28519.264933] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[28519.284037]  ffff9117b992fc80 0000000000200246 ffff911842f9e180 ffff91137ef54100
[28519.307073]  ffff9117b992fc68 ffff9117b9930000 ffff9115cbf779f0 ffff9115cbf77800
[28519.330159]  0000000000000000 ffff911282f35e60 ffff9117b992fc98 ffffffffb86cf1a6
[28519.353218] Call Trace:
[28519.361192]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
[28519.376747]  [<ffffffffb82a364f>] wait_current_trans.isra.16+0xcc/0x107
[28519.397211]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
[28519.415378]  [<ffffffffb82a58e1>] start_transaction+0x2ba/0x418
[28519.433752]  [<ffffffffb82a5a8a>] btrfs_join_transaction+0x17/0x19
[28519.452929]  [<ffffffffb82ad3ec>] btrfs_finish_ordered_io+0x1e9/0x4db
[28519.472945]  [<ffffffffb8088000>] ? preempt_latency_start+0x34/0x5d
[28519.492434]  [<ffffffffb82ad946>] finish_ordered_fn+0x15/0x17
[28519.510364]  [<ffffffffb82cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[28519.530582]  [<ffffffffb807b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
[28519.551062]  [<ffffffffb82cff40>] btrfs_endio_write_helper+0xe/0x10
[28519.570519]  [<ffffffffb807d068>] process_one_work+0x186/0x29d
[28519.588711]  [<ffffffffb807d66d>] worker_thread+0x1ea/0x2ba
[28519.606111]  [<ffffffffb807d483>] ? rescuer_thread+0x2d1/0x2d1
[28519.624269]  [<ffffffffb8081df3>] kthread+0xb4/0xbc
[28519.639539]  [<ffffffffb86d23df>] ret_from_fork+0x1f/0x40
[28519.656380]  [<ffffffffb8081d3f>] ? init_completion+0x24/0x24


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
  2016-11-30 17:18         ` btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off Marc MERLIN
@ 2016-11-30 18:00           ` Austin S. Hemmelgarn
  2016-11-30 18:16             ` Marc MERLIN
  0 siblings, 1 reply; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2016-11-30 18:00 UTC (permalink / raw)
  To: Marc MERLIN, Btrfs BTRFS

On 2016-11-30 12:18, Marc MERLIN wrote:
> On Wed, Nov 30, 2016 at 08:46:46AM -0800, Marc MERLIN wrote:
>> +btrfs mailing list, see below why
>>
>> Ok, Linus helped me find a workaround for this problem:
>> https://lkml.org/lkml/2016/11/29/667
>> namely:
>>    echo 2 > /proc/sys/vm/dirty_ratio
>>    echo 1 > /proc/sys/vm/dirty_background_ratio
>> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
>> requests in th buffers)
>
> I'll remove the bcache list on this followup since I want to concentrate
> here on the fact that btrfs does behave badly with the default
> dirty_ratio values.
I will comment that on big systems, almost everything behaves badly with 
the default dirty ratios, they're leftovers from when 1GB was a huge 
amount of RAM.  As usual though, BTRFS has pathological behavior 
compared to other options.
> As a reminder, it's a btrfs send/receive copy between 2 swraid5 arrays
> on spinning rust.
> swraid5 < bcache < dmcrypt < btrfs
>
> Copying with btrfs send/receive causes massive hangs on the system.
> Please see this explanation from Linus on why the workaround was
> suggested:
> https://lkml.org/lkml/2016/11/29/667
And Linux' assessment is absolutely correct (at least, the general 
assessment is, I have no idea about btrfs_start_shared_extent, but I'm 
more than willing to bet he's correct that that's the culprit).
>
> The hangs that I'm getting with bcache cache turned off (i.e.
> passthrough) are now very likely only due to btrfs and mess up anything
> doing file IO that ends up timing out, break USB even as reads time out
> in the middle of USB requests, interrupts lost, and so forth.
>
> All of this mostly went away with Linus' suggestion:
> echo 2 > /proc/sys/vm/dirty_ratio
> echo 1 > /proc/sys/vm/dirty_background_ratio
>
> But that's hiding the symptom which I think is that btrfs is piling up too many I/O
> requests during btrfs send/receive and btrfs scrub (probably balance too) and not
> looking at resulting impact to system health.
I see pretty much identical behavior using any number of other storage 
configurations on a USB 2.0 flash drive connected to a system with 16GB 
of RAM with the default dirty ratios because it's trying to cache up to 
3.2GB of data for writeback.  While BTRFS is doing highly sub-optimal 
things here, the ancient default writeback ratios are just as much a 
culprit.  I would suggest that get changed to 200MB or 20% of RAM, 
whichever is smaller, which would give overall almost identical behavior 
to x86-32, which in turn works reasonably well for most cases.  I sadly 
don't have the time, patience, or expertise to write up such a patch 
myself though.
>
> Is there a way to stop flodding the entire system with I/O and causing
> so much strain on it?
> (I realize that if there is a caching layer underneath that just takes
> requests and says thank you without giving other clues that underneath
> bad things are happening, it may be hard, but I'm asking anyway :)
>
>
> [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750
>
> [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32
>
> [17761.122238] usb 4-1.4: USB disconnect, device number 39
> [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108
> [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
> [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd
>
> [24130.574425] hpet1: lost 2306 rtc interrupts
> [24156.034950] hpet1: lost 1628 rtc interrupts
> [24173.314738] hpet1: lost 1104 rtc interrupts
> [24180.129950] hpet1: lost 436 rtc interrupts
> [24257.557955] hpet1: lost 4954 rtc interrupts
> [24267.522656] hpet1: lost 637 rtc interrupts
>
> [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
> [28034.975471]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
> [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [28035.025429] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
> [28035.047717]  ffff91154d33fc70 0000000000200246 ffff911842f880c0 ffff9115a4cf01c0
> [28035.071020]  ffff91154d33fc58 ffff91154d340000 ffff91165493bca0 ffff9115623773f0
> [28035.094252]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
> [28035.117538] Call Trace:
> [28035.125791]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
> [28035.141550]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
> [28035.162457]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
> [28035.180891]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
> [28035.201723]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
> [28035.219269]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
> [28035.237032]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
> [28035.254566]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
> [28035.271057]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
> [28035.287782]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
>
> [28155.781987] INFO: task btrfs:5618 blocked for more than 120 seconds.
> [28155.802229]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
> [28155.827894] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [28155.852479] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
> [28155.874761]  ffff91154d33fc70 0000000000200246 ffff911842f880c0 ffff9115a4cf01c0
> [28155.898059]  ffff91154d33fc58 ffff91154d340000 ffff91165493bca0 ffff9115623773f0
> [28155.921464]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
> [28155.944720] Call Trace:
> [28155.953176]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
> [28155.968945]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
> [28155.989811]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
> [28156.008195]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
> [28156.028498]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
> [28156.046081]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
> [28156.063621]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
> [28156.081667]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
> [28156.098732]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
> [28156.115489]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
> [28156.133389]  [<ffffffffb81ad63a>] SyS_truncate+0xe/0x10
> [28156.149831]  [<ffffffffb80039ec>] do_syscall_64+0x61/0x72
> [28156.167179]  [<ffffffffb86d2265>] entry_SYSCALL64_slow_path+0x25/0x25
>
> [28397.436986] INFO: task btrfs:5618 blocked for more than 120 seconds.
> [28397.456798]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
> [28397.482122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [28397.506359] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
> [28397.528494]  ffff91154d33fc70 0000000000200246 ffff911842fa41c0 ffff9115a4cf01c0
> [28397.551645]  ffff91154d33fc58 ffff91154d340000 ffff91142ba20140 ffff9115ba3ad7e8
> [28397.574829]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
> [28397.598015] Call Trace:
> [28397.606120]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
> [28397.621772]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
> [28397.642376]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
> [28397.660661]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
> [28397.680801]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
> [28397.698281]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
> [28397.715769]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
> [28397.733279]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
> [28397.749731]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
> [28397.766379]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
> [28397.783878]  [<ffffffffb81ad63a>] SyS_truncate+0xe/0x10
> [28397.800319]  [<ffffffffb80039ec>] do_syscall_64+0x61/0x72
> [28397.817286]  [<ffffffffb86d2265>] entry_SYSCALL64_slow_path+0x25/0x25
>
> [28518.264522] INFO: task btrfs:5618 blocked for more than 120 seconds.
> [28518.285084]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
> [28518.310600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [28518.335590] btrfs           D ffff91154d33fc70     0  5618   5372 0x00000080
> [28518.357716]  ffff91154d33fc70 0000000000200246 ffff911842fa41c0 ffff9115a4cf01c0
> [28518.380902]  ffff91154d33fc58 ffff91154d340000 ffff91142ba20140 ffff9115ba3ad7e8
> [28518.404075]  0000000000001000 0000000000000001 ffff91154d33fc88 ffffffffb86cf1a6
> [28518.427227] Call Trace:
> [28518.435384]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
> [28518.451151]  [<ffffffffb82bd18e>] btrfs_start_ordered_extent+0xce/0x122
> [28518.471821]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
> [28518.490186]  [<ffffffffb82bd434>] btrfs_wait_ordered_range+0xa9/0x10d
> [28518.510301]  [<ffffffffb82aec04>] btrfs_truncate+0x40/0x24b
> [28518.527818]  [<ffffffffb82af437>] btrfs_setattr+0x1da/0x2d7
> [28518.545320]  [<ffffffffb81c7507>] notify_change+0x252/0x39c
> [28518.562924]  [<ffffffffb81ad35b>] do_truncate+0x81/0xb4
> [28518.579434]  [<ffffffffb81ad467>] vfs_truncate+0xd9/0xf9
> [28518.596132]  [<ffffffffb81ad4ea>] do_sys_truncate+0x63/0xa7
> [28518.613687]  [<ffffffffb81ad63a>] SyS_truncate+0xe/0x10
> [28518.630115]  [<ffffffffb80039ec>] do_syscall_64+0x61/0x72
> [28518.647003]  [<ffffffffb86d2265>] entry_SYSCALL64_slow_path+0x25/0x25
> [28518.667036] INFO: task kworker/u16:34:27172 blocked for more than 120 seconds.
> [28518.689602]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
> [28518.715144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [28518.739569] kworker/u16:34  D ffff9117ec52bc80     0 27172      2 0x00000080
> [28518.761628] Workqueue: btrfs-endio-write btrfs_endio_write_helper
> [28518.780669]  ffff9117ec52bc80 0000000000200246 ffff911842f9e180 ffff91157c386000
> [28518.803719]  ffff9117ec52bc68 ffff9117ec52c000 ffff9115cbf779f0 ffff9115cbf77800
> [28518.826783]  0000000000000000 ffff9116af353500 ffff9117ec52bc98 ffffffffb86cf1a6
> [28518.849863] Call Trace:
> [28518.858011]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
> [28518.873624]  [<ffffffffb82a364f>] wait_current_trans.isra.16+0xcc/0x107
> [28518.894158]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
> [28518.912370]  [<ffffffffb82a58e1>] start_transaction+0x2ba/0x418
> [28518.930827]  [<ffffffffb82a5a8a>] btrfs_join_transaction+0x17/0x19
> [28518.950105]  [<ffffffffb82ad3ec>] btrfs_finish_ordered_io+0x1e9/0x4db
> [28518.970132]  [<ffffffffb8088000>] ? preempt_latency_start+0x34/0x5d
> [28518.989640]  [<ffffffffb82ad946>] finish_ordered_fn+0x15/0x17
> [28519.007579]  [<ffffffffb82cfd68>] btrfs_scrubparity_helper+0x10e/0x258
> [28519.027844]  [<ffffffffb807b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
> [28519.048420]  [<ffffffffb82cff40>] btrfs_endio_write_helper+0xe/0x10
> [28519.067907]  [<ffffffffb807d068>] process_one_work+0x186/0x29d
> [28519.086158]  [<ffffffffb807d66d>] worker_thread+0x1ea/0x2ba
> [28519.103608]  [<ffffffffb807d483>] ? rescuer_thread+0x2d1/0x2d1
> [28519.121783]  [<ffffffffb8081df3>] kthread+0xb4/0xbc
> [28519.137100]  [<ffffffffb86d23df>] ret_from_fork+0x1f/0x40
> [28519.153938]  [<ffffffffb8081d3f>] ? init_completion+0x24/0x24
> [28519.171842] INFO: task kworker/u16:8:8909 blocked for more than 120 seconds.
> [28519.193688]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
> [28519.218851] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [28519.243006] kworker/u16:8   D ffff9117b992fc80     0  8909      2 0x00000080
> [28519.264933] Workqueue: btrfs-endio-write btrfs_endio_write_helper
> [28519.284037]  ffff9117b992fc80 0000000000200246 ffff911842f9e180 ffff91137ef54100
> [28519.307073]  ffff9117b992fc68 ffff9117b9930000 ffff9115cbf779f0 ffff9115cbf77800
> [28519.330159]  0000000000000000 ffff911282f35e60 ffff9117b992fc98 ffffffffb86cf1a6
> [28519.353218] Call Trace:
> [28519.361192]  [<ffffffffb86cf1a6>] schedule+0x8b/0xa3
> [28519.376747]  [<ffffffffb82a364f>] wait_current_trans.isra.16+0xcc/0x107
> [28519.397211]  [<ffffffffb809af6c>] ? wake_up_atomic_t+0x2c/0x2c
> [28519.415378]  [<ffffffffb82a58e1>] start_transaction+0x2ba/0x418
> [28519.433752]  [<ffffffffb82a5a8a>] btrfs_join_transaction+0x17/0x19
> [28519.452929]  [<ffffffffb82ad3ec>] btrfs_finish_ordered_io+0x1e9/0x4db
> [28519.472945]  [<ffffffffb8088000>] ? preempt_latency_start+0x34/0x5d
> [28519.492434]  [<ffffffffb82ad946>] finish_ordered_fn+0x15/0x17
> [28519.510364]  [<ffffffffb82cfd68>] btrfs_scrubparity_helper+0x10e/0x258
> [28519.530582]  [<ffffffffb807b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
> [28519.551062]  [<ffffffffb82cff40>] btrfs_endio_write_helper+0xe/0x10
> [28519.570519]  [<ffffffffb807d068>] process_one_work+0x186/0x29d
> [28519.588711]  [<ffffffffb807d66d>] worker_thread+0x1ea/0x2ba
> [28519.606111]  [<ffffffffb807d483>] ? rescuer_thread+0x2d1/0x2d1
> [28519.624269]  [<ffffffffb8081df3>] kthread+0xb4/0xbc
> [28519.639539]  [<ffffffffb86d23df>] ret_from_fork+0x1f/0x40
> [28519.656380]  [<ffffffffb8081d3f>] ? init_completion+0x24/0x24
>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
  2016-11-30 18:00           ` Austin S. Hemmelgarn
@ 2016-11-30 18:16             ` Marc MERLIN
  2016-12-01 15:49               ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Marc MERLIN @ 2016-11-30 18:16 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Btrfs BTRFS, Michal Hocko, Vlastimil Babka, linux-mm, Joonsoo Kim,
	torvalds

+folks from linux-mm thread for your suggestion

On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
> > swraid5 < bcache < dmcrypt < btrfs
> > 
> > Copying with btrfs send/receive causes massive hangs on the system.
> > Please see this explanation from Linus on why the workaround was
> > suggested:
> > https://lkml.org/lkml/2016/11/29/667
> And Linux' assessment is absolutely correct (at least, the general
> assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
> than willing to bet he's correct that that's the culprit).

> > All of this mostly went away with Linus' suggestion:
> > echo 2 > /proc/sys/vm/dirty_ratio
> > echo 1 > /proc/sys/vm/dirty_background_ratio
> > 
> > But that's hiding the symptom which I think is that btrfs is piling up too many I/O
> > requests during btrfs send/receive and btrfs scrub (probably balance too) and not
> > looking at resulting impact to system health.

> I see pretty much identical behavior using any number of other storage
> configurations on a USB 2.0 flash drive connected to a system with 16GB of
> RAM with the default dirty ratios because it's trying to cache up to 3.2GB
> of data for writeback.  While BTRFS is doing highly sub-optimal things here,
> the ancient default writeback ratios are just as much a culprit.  I would
> suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which
> would give overall almost identical behavior to x86-32, which in turn works
> reasonably well for most cases.  I sadly don't have the time, patience, or
> expertise to write up such a patch myself though.

Dear linux-mm folks, is that something you could consider (changing the
dirty_ratio defaults) given that it affects at least bcache and btrfs
(with or without bcache)?

By the way, on the 200MB max suggestion, when I had 2 and 1% (or 480MB
and 240MB on my 24GB system), this was enough to make btrfs behave
sanely, but only if I had bcache turned off.
With bcache enabled, those values were just enough so that bcache didn't
crash my system, but not enough that prevent undesirable behaviour
(things hanging, 100+ bcache kworkers piled up, and more). However, the
copy did succeed, despite the relative impact on the system, so it's
better than nothing :)
But the impact from bcache probably goes beyond what btrfs is
responsible for, so I have a separate thread on the bcache list:
http://marc.info/?l=linux-bcache&m=148052441423532&w=2
http://marc.info/?l=linux-bcache&m=148052620524162&w=2

On the plus side, btrfs did ok with 0 visible impact to my system with
those 480 and 240MB dirty ratio values.

Thanks for your reply, Austin.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.8.8, bcache deadlock and hard lockup
  2016-11-30 16:46       ` 4.8.8, bcache deadlock and hard lockup Marc MERLIN
  2016-11-30 17:16         ` Marc MERLIN
  2016-11-30 17:18         ` btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off Marc MERLIN
@ 2016-11-30 23:57         ` Eric Wheeler
  2016-12-01  0:09           ` Marc MERLIN
  2016-12-01  0:48           ` Chris Murphy
  2 siblings, 2 replies; 12+ messages in thread
From: Eric Wheeler @ 2016-11-30 23:57 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Coly Li, linux-bcache, Btrfs BTRFS

On Wed, 30 Nov 2016, Marc MERLIN wrote:
> +btrfs mailing list, see below why
> 
> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
> > On Mon, 27 Nov 2016, Coly Li wrote:
> > > 
> > > Yes, too many work queues... I guess the locking might be caused by some
> > > very obscure reference of closure code. I cannot have any clue if I
> > > cannot find a stable procedure to reproduce this issue.
> > > 
> > > Hmm, if there is a tool to clone all the meta data of the back end cache
> > > and whole cached device, there might be a method to replay the oops much
> > > easier.
> > > 
> > > Eric, do you have any hint ?
> > 
> > Note that the backing device doesn't have any metadata, just a superblock. 
> > You can easily dd that off onto some other volume without transferring the 
> > data. By default, data starts at 8k, or whatever you used in `make-bcache 
> > -w`.
> 
> Ok, Linus helped me find a workaround for this problem:
> https://lkml.org/lkml/2016/11/29/667
> namely:
>    echo 2 > /proc/sys/vm/dirty_ratio
>    echo 1 > /proc/sys/vm/dirty_background_ratio
> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
> requests in th buffers)
> 
> Note that this is only a workaround, not a fix.
> 
> When I did this and re tried my big copy again, I still got 100+ kernel
> work queues, but apparently the underlying swraid5 was able to unblock
> and satisfy the write requests before too many accumulated and crashed
> the kernel.
> 
> I'm not a kernel coder, but seems to me that bcache needs a way to
> throttle incoming requests if there are too many so that it does not end
> up in a state where things blow up due to too many piled up requests.
> 
> You should be able to reproduce this by taking 5 spinning rust drives,
> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
> I used btrfs) and send lots of requests.
> Actually to be honest, the problems have mostly been happening when I do
> btrfs scrub and btrfs send/receive which both generate I/O from within
> the kernel instead of user space.
> So here, btrfs may be a contributor to the problem too, but while btrfs
> still trashes my system if I remove the caching device on bcache (and
> with the default dirty ratio values), it doesn't crash the kernel.
> 
> I'll start another separate thread with the btrfs folks on how much
> pressure is put on the system, but on your side it would be good to help
> ensure that bcache doesn't crash the system altogether if too many
> requests are allowed to pile up.


Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk 
writes at the request queue on its way to the spinning disk or SSD:
	http://algo.ing.unimo.it/people/paolo/disk_sched/

use the latest BFQ git here, merge it into v4.8.y:
	https://github.com/linusw/linux-bfq/commits/bfq-v8

This doesn't completely fix the dirty_ration problem, but it is far better 
than CFQ or deadline in my opinion (and experience).

-Eric



--
Eric Wheeler


> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.8.8, bcache deadlock and hard lockup
  2016-11-30 23:57         ` 4.8.8, bcache deadlock and hard lockup Eric Wheeler
@ 2016-12-01  0:09           ` Marc MERLIN
  2016-12-01 21:58             ` Eric Wheeler
  2016-12-01  0:48           ` Chris Murphy
  1 sibling, 1 reply; 12+ messages in thread
From: Marc MERLIN @ 2016-12-01  0:09 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Coly Li, linux-bcache, Btrfs BTRFS

On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote:
> > I'll start another separate thread with the btrfs folks on how much
> > pressure is put on the system, but on your side it would be good to help
> > ensure that bcache doesn't crash the system altogether if too many
> > requests are allowed to pile up.
> 
> Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk 
> writes at the request queue on its way to the spinning disk or SSD:
> 	http://algo.ing.unimo.it/people/paolo/disk_sched/
> 
> use the latest BFQ git here, merge it into v4.8.y:
> 	https://github.com/linusw/linux-bfq/commits/bfq-v8
> 
> This doesn't completely fix the dirty_ration problem, but it is far better 
> than CFQ or deadline in my opinion (and experience).

That's good to know thanks.
But for my uninformed opinion, is there anything bcache can do to throttle
incoming requests if they are piling up, or they're coming from producers
upstream and bcache has no choice but try and process them as quickly as
possible without a way to block the sender if too many are coming?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.8.8, bcache deadlock and hard lockup
  2016-11-30 23:57         ` 4.8.8, bcache deadlock and hard lockup Eric Wheeler
  2016-12-01  0:09           ` Marc MERLIN
@ 2016-12-01  0:48           ` Chris Murphy
  2016-12-01 12:30             ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2016-12-01  0:48 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Marc MERLIN, Coly Li, linux-bcache, Btrfs BTRFS

On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> On Wed, 30 Nov 2016, Marc MERLIN wrote:
>> +btrfs mailing list, see below why
>>
>> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
>> > On Mon, 27 Nov 2016, Coly Li wrote:
>> > >
>> > > Yes, too many work queues... I guess the locking might be caused by some
>> > > very obscure reference of closure code. I cannot have any clue if I
>> > > cannot find a stable procedure to reproduce this issue.
>> > >
>> > > Hmm, if there is a tool to clone all the meta data of the back end cache
>> > > and whole cached device, there might be a method to replay the oops much
>> > > easier.
>> > >
>> > > Eric, do you have any hint ?
>> >
>> > Note that the backing device doesn't have any metadata, just a superblock.
>> > You can easily dd that off onto some other volume without transferring the
>> > data. By default, data starts at 8k, or whatever you used in `make-bcache
>> > -w`.
>>
>> Ok, Linus helped me find a workaround for this problem:
>> https://lkml.org/lkml/2016/11/29/667
>> namely:
>>    echo 2 > /proc/sys/vm/dirty_ratio
>>    echo 1 > /proc/sys/vm/dirty_background_ratio
>> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
>> requests in th buffers)
>>
>> Note that this is only a workaround, not a fix.
>>
>> When I did this and re tried my big copy again, I still got 100+ kernel
>> work queues, but apparently the underlying swraid5 was able to unblock
>> and satisfy the write requests before too many accumulated and crashed
>> the kernel.
>>
>> I'm not a kernel coder, but seems to me that bcache needs a way to
>> throttle incoming requests if there are too many so that it does not end
>> up in a state where things blow up due to too many piled up requests.
>>
>> You should be able to reproduce this by taking 5 spinning rust drives,
>> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
>> I used btrfs) and send lots of requests.
>> Actually to be honest, the problems have mostly been happening when I do
>> btrfs scrub and btrfs send/receive which both generate I/O from within
>> the kernel instead of user space.
>> So here, btrfs may be a contributor to the problem too, but while btrfs
>> still trashes my system if I remove the caching device on bcache (and
>> with the default dirty ratio values), it doesn't crash the kernel.
>>
>> I'll start another separate thread with the btrfs folks on how much
>> pressure is put on the system, but on your side it would be good to help
>> ensure that bcache doesn't crash the system altogether if too many
>> requests are allowed to pile up.
>
>
> Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk
> writes at the request queue on its way to the spinning disk or SSD:
>         http://algo.ing.unimo.it/people/paolo/disk_sched/
>
> use the latest BFQ git here, merge it into v4.8.y:
>         https://github.com/linusw/linux-bfq/commits/bfq-v8
>
> This doesn't completely fix the dirty_ration problem, but it is far better
> than CFQ or deadline in my opinion (and experience).

There are several threads over the past year with users having
problems no one else had previously reported, and they were using BFQ.
But there's no evidence whether BFQ was the cause, or exposing some
existing bug that another scheduler doesn't. Anyway, I'd say using an
out of tree scheduler means higher burden of testing and skepticism.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.8.8, bcache deadlock and hard lockup
  2016-12-01  0:48           ` Chris Murphy
@ 2016-12-01 12:30             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2016-12-01 12:30 UTC (permalink / raw)
  To: Chris Murphy, Eric Wheeler
  Cc: Marc MERLIN, Coly Li, linux-bcache, Btrfs BTRFS

On 2016-11-30 19:48, Chris Murphy wrote:
> On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheeler <bcache@lists.ewheeler.net> wrote:
>> On Wed, 30 Nov 2016, Marc MERLIN wrote:
>>> +btrfs mailing list, see below why
>>>
>>> On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:
>>>> On Mon, 27 Nov 2016, Coly Li wrote:
>>>>>
>>>>> Yes, too many work queues... I guess the locking might be caused by some
>>>>> very obscure reference of closure code. I cannot have any clue if I
>>>>> cannot find a stable procedure to reproduce this issue.
>>>>>
>>>>> Hmm, if there is a tool to clone all the meta data of the back end cache
>>>>> and whole cached device, there might be a method to replay the oops much
>>>>> easier.
>>>>>
>>>>> Eric, do you have any hint ?
>>>>
>>>> Note that the backing device doesn't have any metadata, just a superblock.
>>>> You can easily dd that off onto some other volume without transferring the
>>>> data. By default, data starts at 8k, or whatever you used in `make-bcache
>>>> -w`.
>>>
>>> Ok, Linus helped me find a workaround for this problem:
>>> https://lkml.org/lkml/2016/11/29/667
>>> namely:
>>>    echo 2 > /proc/sys/vm/dirty_ratio
>>>    echo 1 > /proc/sys/vm/dirty_background_ratio
>>> (it's a 24GB system, so the defaults of 20 and 10 were creating too many
>>> requests in th buffers)
>>>
>>> Note that this is only a workaround, not a fix.
>>>
>>> When I did this and re tried my big copy again, I still got 100+ kernel
>>> work queues, but apparently the underlying swraid5 was able to unblock
>>> and satisfy the write requests before too many accumulated and crashed
>>> the kernel.
>>>
>>> I'm not a kernel coder, but seems to me that bcache needs a way to
>>> throttle incoming requests if there are too many so that it does not end
>>> up in a state where things blow up due to too many piled up requests.
>>>
>>> You should be able to reproduce this by taking 5 spinning rust drives,
>>> put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
>>> I used btrfs) and send lots of requests.
>>> Actually to be honest, the problems have mostly been happening when I do
>>> btrfs scrub and btrfs send/receive which both generate I/O from within
>>> the kernel instead of user space.
>>> So here, btrfs may be a contributor to the problem too, but while btrfs
>>> still trashes my system if I remove the caching device on bcache (and
>>> with the default dirty ratio values), it doesn't crash the kernel.
>>>
>>> I'll start another separate thread with the btrfs folks on how much
>>> pressure is put on the system, but on your side it would be good to help
>>> ensure that bcache doesn't crash the system altogether if too many
>>> requests are allowed to pile up.
>>
>>
>> Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk
>> writes at the request queue on its way to the spinning disk or SSD:
>>         http://algo.ing.unimo.it/people/paolo/disk_sched/
>>
>> use the latest BFQ git here, merge it into v4.8.y:
>>         https://github.com/linusw/linux-bfq/commits/bfq-v8
>>
>> This doesn't completely fix the dirty_ration problem, but it is far better
>> than CFQ or deadline in my opinion (and experience).
>
> There are several threads over the past year with users having
> problems no one else had previously reported, and they were using BFQ.
> But there's no evidence whether BFQ was the cause, or exposing some
> existing bug that another scheduler doesn't. Anyway, I'd say using an
> out of tree scheduler means higher burden of testing and skepticism.
Normally I'd agree on this, but BFQ is a bit of a different situation 
from usual because:
1. 90% of the reason that BFQ isn't in mainline is that the block 
maintainers have declared the legacy (non blk-mq) code deprecated and 
refuse to take anything new there despite having absolutely zero 
scheduling in blk-mq.
2. It's been around for years with hundreds of thousands of users over 
the years who have had no issues with it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
  2016-11-30 18:16             ` Marc MERLIN
@ 2016-12-01 15:49               ` Michal Hocko
  2016-12-01 21:21                 ` Janos Toth F.
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2016-12-01 15:49 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Austin S. Hemmelgarn, Btrfs BTRFS, Vlastimil Babka, linux-mm,
	Joonsoo Kim, torvalds

On Wed 30-11-16 10:16:53, Marc MERLIN wrote:
> +folks from linux-mm thread for your suggestion
> 
> On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
> > > swraid5 < bcache < dmcrypt < btrfs
> > > 
> > > Copying with btrfs send/receive causes massive hangs on the system.
> > > Please see this explanation from Linus on why the workaround was
> > > suggested:
> > > https://lkml.org/lkml/2016/11/29/667
> > And Linux' assessment is absolutely correct (at least, the general
> > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
> > than willing to bet he's correct that that's the culprit).
> 
> > > All of this mostly went away with Linus' suggestion:
> > > echo 2 > /proc/sys/vm/dirty_ratio
> > > echo 1 > /proc/sys/vm/dirty_background_ratio
> > > 
> > > But that's hiding the symptom which I think is that btrfs is piling up too many I/O
> > > requests during btrfs send/receive and btrfs scrub (probably balance too) and not
> > > looking at resulting impact to system health.
> 
> > I see pretty much identical behavior using any number of other storage
> > configurations on a USB 2.0 flash drive connected to a system with 16GB of
> > RAM with the default dirty ratios because it's trying to cache up to 3.2GB
> > of data for writeback.  While BTRFS is doing highly sub-optimal things here,
> > the ancient default writeback ratios are just as much a culprit.  I would
> > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which
> > would give overall almost identical behavior to x86-32, which in turn works
> > reasonably well for most cases.  I sadly don't have the time, patience, or
> > expertise to write up such a patch myself though.
> 
> Dear linux-mm folks, is that something you could consider (changing the
> dirty_ratio defaults) given that it affects at least bcache and btrfs
> (with or without bcache)?

As much as the dirty_*ratio defaults a major PITA this is not something
that would be _easy_ to change without high risks of regressions. This
topic has been discussed many times with many good ideas, nothing really
materialized from them though :/

To be honest I really do hate dirty_*ratio and have seen many issues on
very large machines and always suggested to use dirty_bytes instead but
a particular value has always been a challenge to get right. It has
always been very workload specific.

That being said this is something more for IO people than MM IMHO.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off
  2016-12-01 15:49               ` Michal Hocko
@ 2016-12-01 21:21                 ` Janos Toth F.
  0 siblings, 0 replies; 12+ messages in thread
From: Janos Toth F. @ 2016-12-01 21:21 UTC (permalink / raw)
  To: Btrfs BTRFS

Is there any fundamental reason not to support huge writeback caches?
(I mean, besides working around bugs and/or questionably poor design
choices which no one wishes to fix.)
The obvious drawback is the increased risk of data loss upon hardware
failure or kernel panic but why couldn't the user be allowed to draw
the line between probability of data loss and potential performance
gains?

The last time I changed hardware, I put double the amount of RAM into
my little home server for the sole reason to use a relatively huge
cache, especially a huge writeback cache. Although I realized it soon
enough that writeback ratios like 20/45 will make the system unstable
(OOM reaping) even if ~90% of the memory is theoretically free = used
as some form of cache, read or write, depending on this ratio
parameter and I ended up below the default to get rid of The Reaper.

My plan was to try and decrease the fragmentation of files which are
created by dumping several parallel real-time video streams into
separate files (and also minimize the HDD head seeks due to that).
(The computer in question is on a UPS.)

On Thu, Dec 1, 2016 at 4:49 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 30-11-16 10:16:53, Marc MERLIN wrote:
>> +folks from linux-mm thread for your suggestion
>>
>> On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
>> > > swraid5 < bcache < dmcrypt < btrfs
>> > >
>> > > Copying with btrfs send/receive causes massive hangs on the system.
>> > > Please see this explanation from Linus on why the workaround was
>> > > suggested:
>> > > https://lkml.org/lkml/2016/11/29/667
>> > And Linux' assessment is absolutely correct (at least, the general
>> > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
>> > than willing to bet he's correct that that's the culprit).
>>
>> > > All of this mostly went away with Linus' suggestion:
>> > > echo 2 > /proc/sys/vm/dirty_ratio
>> > > echo 1 > /proc/sys/vm/dirty_background_ratio
>> > >
>> > > But that's hiding the symptom which I think is that btrfs is piling up too many I/O
>> > > requests during btrfs send/receive and btrfs scrub (probably balance too) and not
>> > > looking at resulting impact to system health.
>>
>> > I see pretty much identical behavior using any number of other storage
>> > configurations on a USB 2.0 flash drive connected to a system with 16GB of
>> > RAM with the default dirty ratios because it's trying to cache up to 3.2GB
>> > of data for writeback.  While BTRFS is doing highly sub-optimal things here,
>> > the ancient default writeback ratios are just as much a culprit.  I would
>> > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which
>> > would give overall almost identical behavior to x86-32, which in turn works
>> > reasonably well for most cases.  I sadly don't have the time, patience, or
>> > expertise to write up such a patch myself though.
>>
>> Dear linux-mm folks, is that something you could consider (changing the
>> dirty_ratio defaults) given that it affects at least bcache and btrfs
>> (with or without bcache)?
>
> As much as the dirty_*ratio defaults a major PITA this is not something
> that would be _easy_ to change without high risks of regressions. This
> topic has been discussed many times with many good ideas, nothing really
> materialized from them though :/
>
> To be honest I really do hate dirty_*ratio and have seen many issues on
> very large machines and always suggested to use dirty_bytes instead but
> a particular value has always been a challenge to get right. It has
> always been very workload specific.
>
> That being said this is something more for IO people than MM IMHO.
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.8.8, bcache deadlock and hard lockup
  2016-12-01  0:09           ` Marc MERLIN
@ 2016-12-01 21:58             ` Eric Wheeler
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Wheeler @ 2016-12-01 21:58 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Coly Li, linux-bcache, Btrfs BTRFS

On Wed, 30 Nov 2016, Marc MERLIN wrote:
> On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote:
> > > I'll start another separate thread with the btrfs folks on how much
> > > pressure is put on the system, but on your side it would be good to help
> > > ensure that bcache doesn't crash the system altogether if too many
> > > requests are allowed to pile up.
> > 
> > Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk 
> > writes at the request queue on its way to the spinning disk or SSD:
> > 	http://algo.ing.unimo.it/people/paolo/disk_sched/
> > 
> > use the latest BFQ git here, merge it into v4.8.y:
> > 	https://github.com/linusw/linux-bfq/commits/bfq-v8
> > 
> > This doesn't completely fix the dirty_ration problem, but it is far better 
> > than CFQ or deadline in my opinion (and experience).
> 
> That's good to know thanks.
> But for my uninformed opinion, is there anything bcache can do to throttle
> incoming requests if they are piling up, or they're coming from producers
> upstream and bcache has no choice but try and process them as quickly as
> possible without a way to block the sender if too many are coming?

Not really.  The congestion isn't in bcache, its at the disk queue beyond 
bcache, but userspace processes are blocked by the (huge) pagecache dirty 
writeback which happens before bcache gets it and must complete before 
userspace may proceed: 

fs -> pagecache -> bcache -> {ssd,disk}  

The real issue is that the dirty page cache gets really big, flushes, 
waits for downstream devices (bcache->ssd,disk) to finish, and then 
returns to userspace.  The only way to limit dirty cache are those options 
that Linus mentioned.

BFQ can help for processes not tied to the flush because it may re-order 
other process requests ahead of the big flush---so even though a big flush 
is happening and that process is stalled, others might proceed without 
delay.

See this thread, too:
	https://groups.google.com/forum/#!msg/bfq-iosched/M2M_UhbC05A/hf6Ni9JbAQAJ

--
Eric Wheeler



> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-12-01 21:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20161118164643.g7ttuzgsj74d6fbz@merlins.org>
     [not found] ` <20161118184915.j6dlazbgminxnxzx@merlins.org>
     [not found]   ` <b6c3daab-d990-e873-4d0f-0f0afe2259b1@coly.li>
     [not found]     ` <alpine.LRH.2.11.1611291255350.1914@mail.ewheeler.net>
2016-11-30 16:46       ` 4.8.8, bcache deadlock and hard lockup Marc MERLIN
2016-11-30 17:16         ` Marc MERLIN
2016-11-30 17:18         ` btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off Marc MERLIN
2016-11-30 18:00           ` Austin S. Hemmelgarn
2016-11-30 18:16             ` Marc MERLIN
2016-12-01 15:49               ` Michal Hocko
2016-12-01 21:21                 ` Janos Toth F.
2016-11-30 23:57         ` 4.8.8, bcache deadlock and hard lockup Eric Wheeler
2016-12-01  0:09           ` Marc MERLIN
2016-12-01 21:58             ` Eric Wheeler
2016-12-01  0:48           ` Chris Murphy
2016-12-01 12:30             ` Austin S. Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).