From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Priebe - Profihost AG Subject: Re: bcache_writebac gets stuck on DRBD Date: Thu, 14 Jul 2016 19:47:13 +0200 Message-ID: <5787D021.3030305@profihost.ag> References: <5786A606.8030106@profihost.ag> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Return-path: Received: from cloud1-vm154.de-nserver.de ([178.250.10.56]:42557 "EHLO cloud1-vm154.de-nserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751051AbcGNRrR (ORCPT ); Thu, 14 Jul 2016 13:47:17 -0400 In-Reply-To: Sender: linux-bcache-owner@vger.kernel.org List-Id: linux-bcache@vger.kernel.org To: Eric Wheeler Cc: Jackie Tea , linux-bcache@vger.kernel.org Am 13.07.2016 um 23:18 schrieb Eric Wheeler: > On Wed, 13 Jul 2016, Stefan Priebe - Profihost AG wrote: > >> >> Am 13.07.2016 um 08:18 schrieb Eric Wheeler: >>> On Mon, 11 Jul 2016, Eric Wheeler wrote: >>> >>>> On Tue, 5 Jul 2016, Jackie Tea wrote: >>>> >>>>> I've been using bcache on a Xen host (sitting on an LVM partition) for >>>>> many months without problems >>>>> >>>>> Now that I added DRBD on top of bcache, bcache locks up within a few >>>>> hours, giving me this message: >>>> >>>> This might be related to the thread to this thread: >>>> >>>> https://lkml.org/lkml/2016/6/22/172 >>>> and this patch: >>>> https://patchwork.kernel.org/patch/9223697/ >>>> >>>> Notably, this bug was introduced in 4.3 and newer. You might try 4.1 and >>>> see if your problem resolves itself, or 4.8 is supposed to fix this too if >>>> the patch gets in. >>> >>> Make sure you have these patches, too. Looking at your stack, you >>> definitely want "A", maybe the others as well: >>> >>> A. [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs >>> by Ming Lei: https://patchwork.kernel.org/patch/9169483/ >>> >>> B. block: don't make BLK_DEF_MAX_SECTORS too big >>> by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html >>> >>> C. [1/3] block: flush queued bios when process blocks to avoid deadlock >>> by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/ >>> >>> D. dm-crypt: Fix error with too large bios >>> by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/ >>> >>> E. block: fix blk_queue_split() resource exhaustion >>> by Lars Ellenberg: https://patchwork.kernel.org/patch/9223697/ >>> (as above, previous post. >>> >> >> Just want to let you now that here is stated that this does not always >> fix C also a backport to 4.4 and others is currently missing: >> https://bugzilla.kernel.org/show_bug.cgi?id=119841 > > Interesting, so Mikulas' "C" with threading/timer is fixing things that > Lars' "E" did not? Well, here's hoping for a good fix. There are a lot > of great developers working on this. More detail here: > https://www.redhat.com/archives/dm-devel/2016-June/msg00425.html Yes that it makes it really complicated to know which patches to apply to 4.4. Greets, Stefan > > > -- > Eric Wheeler > > >> >> Stefan >> >>> You probably don't need "D" for your setup by looking at lsblk. >>> >>> [ "C" excluded on purpose because "E" supersedes it, but here it is if >>> you'd like to try. Note that "E" and "C" are mutually exclusive: >>> >>> C. [1/3] block: flush queued bios when process blocks to avoid deadlock >>> by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/ >>> (was https://patchwork.kernel.org/patch/7398411/) >>> ] >>> >>> Please try and let us know which (if any of these) fix your issue! >>> >>> -Eric >>> >>> >>> >>> -- >>> Eric Wheeler >>> >>> >>>> >>>> >>>> >>>>> >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136864] INFO: task >>>>> bcache_writebac:1342 blocked for more than 120 seconds. >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136896] Not tainted >>>>> 4.4.0-28-generic #47-Ubuntu >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136916] "echo 0 > >>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136946] bcache_writebac D >>>>> ffff880000983d18 0 1342 2 0x00000000 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136952] ffff880000983d18 >>>>> 0000000000007ff0 ffffffff81e11500 ffff880035705780 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136955] ffff880000984000 >>>>> ffff880005080af0 ffff880005080ad8 ffffffff00000000 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136958] ffffffff00000003 >>>>> ffff880000983d30 ffffffff818235b5 ffff880035705780 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136962] Call Trace: >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136971] [] >>>>> schedule+0x35/0x80 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136974] [] >>>>> rwsem_down_write_failed+0x202/0x350 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136990] [] ? >>>>> read_dirty+0x370/0x370 [bcache] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136995] [] >>>>> call_rwsem_down_write_failed+0x13/0x20 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.136999] [] ? >>>>> down_write+0x2d/0x40 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137009] [] >>>>> bch_writeback_thread+0x5b/0x4c0 [bcache] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137020] [] ? >>>>> read_dirty+0x370/0x370 [bcache] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137024] [] >>>>> kthread+0xd8/0xf0 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137027] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137029] [] >>>>> ret_from_fork+0x3f/0x70 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137032] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137043] INFO: task >>>>> blkback.5.hda:3357 blocked for more than 120 seconds. >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137069] Not tainted >>>>> 4.4.0-28-generic #47-Ubuntu >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137090] "echo 0 > >>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137119] blkback.5.hda D >>>>> ffff88002341ba48 0 3357 2 0x00000000 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137122] ffff88002341ba48 >>>>> ffff88002341ba08 ffffffff81e11500 ffff880034189900 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137124] ffff88002341c000 >>>>> ffff880005080ad8 ffff880005080af0 ffff880023e08e60 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137126] ffff880005080000 >>>>> ffff88002341ba60 ffffffff818235b5 ffff880034189900 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137128] Call Trace: >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137131] [] >>>>> schedule+0x35/0x80 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137133] [] >>>>> rwsem_down_read_failed+0xe0/0x140 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137136] [] ? >>>>> mempool_alloc_slab+0x15/0x20 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137138] [] >>>>> call_rwsem_down_read_failed+0x14/0x30 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137140] [] ? >>>>> down_read+0x20/0x30 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137148] [] >>>>> cached_dev_make_request+0x682/0xcc0 [bcache] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137151] [] >>>>> generic_make_request+0xf2/0x1d0 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137153] [] >>>>> submit_bio+0x76/0x170 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137157] [] >>>>> dispatch_rw_block_io+0x7b7/0xab0 [xen_blkback] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137159] [] ? >>>>> xen_load_sp0+0x84/0x180 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137162] [] ? >>>>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137165] [] >>>>> __do_block_io_op+0x395/0x710 [xen_blkback] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137167] [] ? >>>>> del_timer_sync+0x48/0x50 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137170] [] >>>>> xen_blkif_schedule+0xdd/0x7b0 [xen_blkback] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137172] [] ? >>>>> finish_task_switch+0x7d/0x220 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137175] [] ? >>>>> wake_atomic_t_function+0x60/0x60 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137177] [] ? >>>>> xen_blkif_be_int+0x30/0x30 [xen_blkback] >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137179] [] >>>>> kthread+0xd8/0xf0 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137181] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137183] [] >>>>> ret_from_fork+0x3f/0x70 >>>>> Jul 5 04:06:43 dom0 kernel: [23043.137185] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> >>>>> At the same time (but always after bcache), Xen's blkback will also >>>>> give a similar message: >>>>> >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135027] INFO: task >>>>> bcache_writebac:1342 blocked for more than 120 seconds. >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135056] Not tainted >>>>> 4.4.0-28-generic #47-Ubuntu >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135076] "echo 0 > >>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135105] bcache_writebac D >>>>> ffff880000983d18 0 1342 2 0x00000000 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135109] ffff880000983d18 >>>>> 0000000000007ff0 ffffffff81e11500 ffff880035705780 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135112] ffff880000984000 >>>>> ffff880005080af0 ffff880005080ad8 ffffffff00000000 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135114] ffffffff00000003 >>>>> ffff880000983d30 ffffffff818235b5 ffff880035705780 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135116] Call Trace: >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135124] [] >>>>> schedule+0x35/0x80 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135127] [] >>>>> rwsem_down_write_failed+0x202/0x350 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135139] [] ? >>>>> read_dirty+0x370/0x370 [bcache] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135143] [] >>>>> call_rwsem_down_write_failed+0x13/0x20 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135145] [] ? >>>>> down_write+0x2d/0x40 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135153] [] >>>>> bch_writeback_thread+0x5b/0x4c0 [bcache] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135160] [] ? >>>>> read_dirty+0x370/0x370 [bcache] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135162] [] >>>>> kthread+0xd8/0xf0 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135164] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135166] [] >>>>> ret_from_fork+0x3f/0x70 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135168] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135178] INFO: task >>>>> blkback.5.hda:3357 blocked for more than 120 seconds. >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135204] Not tainted >>>>> 4.4.0-28-generic #47-Ubuntu >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135224] "echo 0 > >>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135286] blkback.5.hda D >>>>> ffff88002341ba48 0 3357 2 0x00000000 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135289] ffff88002341ba48 >>>>> ffff88002341ba08 ffffffff81e11500 ffff880034189900 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135291] ffff88002341c000 >>>>> ffff880005080ad8 ffff880005080af0 ffff880023e08e60 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135293] ffff880005080000 >>>>> ffff88002341ba60 ffffffff818235b5 ffff880034189900 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135295] Call Trace: >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135299] [] >>>>> schedule+0x35/0x80 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135301] [] >>>>> rwsem_down_read_failed+0xe0/0x140 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135304] [] ? >>>>> mempool_alloc_slab+0x15/0x20 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135307] [] >>>>> call_rwsem_down_read_failed+0x14/0x30 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135309] [] ? >>>>> down_read+0x20/0x30 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135316] [] >>>>> cached_dev_make_request+0x682/0xcc0 [bcache] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135319] [] >>>>> generic_make_request+0xf2/0x1d0 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135321] [] >>>>> submit_bio+0x76/0x170 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135326] [] >>>>> dispatch_rw_block_io+0x7b7/0xab0 [xen_blkback] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135328] [] ? >>>>> xen_load_sp0+0x84/0x180 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135331] [] ? >>>>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135334] [] >>>>> __do_block_io_op+0x395/0x710 [xen_blkback] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135337] [] ? >>>>> del_timer_sync+0x48/0x50 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135340] [] >>>>> xen_blkif_schedule+0xdd/0x7b0 [xen_blkback] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135343] [] ? >>>>> finish_task_switch+0x7d/0x220 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135346] [] ? >>>>> wake_atomic_t_function+0x60/0x60 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135348] [] ? >>>>> xen_blkif_be_int+0x30/0x30 [xen_blkback] >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135350] [] >>>>> kthread+0xd8/0xf0 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135352] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135354] [] >>>>> ret_from_fork+0x3f/0x70 >>>>> Jul 5 04:08:43 dom0 kernel: [23163.135356] [] ? >>>>> kthread_create_on_node+0x1e0/0x1e0 >>>>> >>>>> Access to the storage is no longer possible, then (it'll just appear to hang). >>>>> This is on the newest Ubuntu 16.04 LTS, with their 4.4.0-28 kernel. >>>>> >>>>> # lsb_release -d >>>>> Description: Ubuntu 16.04 LTS >>>>> # uname -r >>>>> 4.4.0-28-generic >>>>> >>>>> bcache is set to writeback. >>>>> >>>>> # cat /sys/block/bcache0/bcache/cache_mode >>>>> writethrough [writeback] writearound none >>>>> # cat /sys/block/bcache0/bcache/state >>>>> dirty >>>>> # cat /sys/block/bcache0/bcache/dirty_data >>>>> 56.0k >>>>> >>>>> The relevant disk stack looks like this: >>>>> >>>>> # lsblk >>>>> sdd 8:48 0 232.9G 0 disk >>>>> `-sdd1 8:49 0 232.9G 0 part >>>>> `-md1 9:1 0 232.8G 0 raid1 >>>>> `-ssd-bcache--back 252:4 0 30G 0 lvm >>>>> `-bcache0 251:0 0 2T 0 disk >>>>> `-storage-drback--test 252:6 0 200G 0 lvm >>>>> >>>>> storage-drback--test is the backing device for DRBD. >>>>> >>>>> This is reproducible, but I do not know what causes it, or how to >>>>> cause it on purpose. All I know is that once I put DRBD on top of >>>>> bcache, and produce minor load over time (I just have a single idle >>>>> Windows 2012R2 server on it as test - as Windows would always do >>>>> *something* ;-) it will lock up within 8 hours. >>>>> >>>>> Do you have any clues as to what could be causing this? >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>