task btrfs-transacti:651 blocked for more than 120 seconds

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* task btrfs-transacti:651 blocked for more than 120 seconds
@ 2017-09-28 10:16 Olivier Bonvalet
  2017-09-28 11:18 ` Nikolay Borisov
  0 siblings, 1 reply; 4+ messages in thread
From: Olivier Bonvalet @ 2017-09-28 10:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: xen-devel

Hi !

I have a virtual server (Xen) which very frequently hangs with only
this error in logs :

[ 1330.144124] INFO: task btrfs-transacti:651 blocked for more than 120 seconds.
[ 1330.144141]       Not tainted 4.9-dae-xen #2
[ 1330.144146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.144179] btrfs-transacti D    0   651      2 0x00000000
[ 1330.144184]  ffff8803a6c85b40 0000000000000000 ffff8803af857880 ffff8803a9762180
[ 1330.144190]  ffff8803a7bb8140 ffffc900173bfb10 ffffffff8150ff1f 0000000000000000
[ 1330.144195]  ffff8803a7bb8140 7fffffffffffffff ffffffff81510710 ffffc900173bfc18
[ 1330.144200] Call Trace:
[ 1330.144211]  [<ffffffff8150ff1f>] ? __schedule+0x17f/0x530
[ 1330.144215]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 1330.144218]  [<ffffffff815102fd>] ? schedule+0x2d/0x80
[ 1330.144221]  [<ffffffff815132be>] ? schedule_timeout+0x17e/0x2a0
[ 1330.144226]  [<ffffffff8101bb71>] ? xen_clocksource_get_cycles+0x11/0x20
[ 1330.144231]  [<ffffffff810f2196>] ? ktime_get+0x36/0xa0
[ 1330.144234]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 1330.144237]  [<ffffffff8150fd38>] ? io_schedule_timeout+0x98/0x100
[ 1330.144240]  [<ffffffff81513de1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
[ 1330.144246]  [<ffffffff81510722>] ? bit_wait_io+0x12/0x60
[ 1330.144250]  [<ffffffff815107be>] ? __wait_on_bit+0x4e/0x80
[ 1330.144256]  [<ffffffff8113772c>] ? wait_on_page_bit+0x6c/0x80
[ 1330.144261]  [<ffffffff810d4ab0>] ? autoremove_wake_function+0x30/0x30
[ 1330.144265]  [<ffffffff81137808>] ? __filemap_fdatawait_range+0xc8/0x110
[ 1330.144270]  [<ffffffff81137859>] ? filemap_fdatawait_range+0x9/0x20
[ 1330.144298]  [<ffffffffa014b033>] ? btrfs_wait_ordered_range+0x63/0x100 [btrfs]
[ 1330.144310]  [<ffffffffa0175a68>] ? btrfs_wait_cache_io+0x58/0x1e0 [btrfs]
[ 1330.144320]  [<ffffffffa011ded2>] ? btrfs_start_dirty_block_groups+0x1c2/0x450 [btrfs]
[ 1330.144328]  [<ffffffff810a2ba5>] ? do_group_exit+0x35/0xa0
[ 1330.144338]  [<ffffffffa012efa7>] ? btrfs_commit_transaction+0x147/0x9b0 [btrfs]
[ 1330.144348]  [<ffffffffa012f8a2>] ? start_transaction+0x92/0x3f0 [btrfs]
[ 1330.144357]  [<ffffffffa012a0e7>] ? transaction_kthread+0x1d7/0x1f0 [btrfs]
[ 1330.144366]  [<ffffffffa0129f10>] ? btrfs_cleanup_transaction+0x4f0/0x4f0 [btrfs]
[ 1330.144373]  [<ffffffff810ba352>] ? kthread+0xc2/0xe0
[ 1330.144377]  [<ffffffff810ba290>] ? kthread_create_on_node+0x40/0x40
[ 1330.144381]  [<ffffffff81514405>] ? ret_from_fork+0x25/0x30


It's a Debian Stretch system, running a 4.9.52 Linux kernel (on a Xen 4.8.2 hypervisor).
With an old 4.1.x Linux kernel, I haven't any problem.


Is it a Btrfs bug ? Should I try a more recent kernel ? (which one ?)

Thanks in advance,

Olivier

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: task btrfs-transacti:651 blocked for more than 120 seconds
  2017-09-28 10:16 task btrfs-transacti:651 blocked for more than 120 seconds Olivier Bonvalet
@ 2017-09-28 11:18 ` Nikolay Borisov
  2017-09-28 14:28   ` Re : " Olivier Bonvalet
  0 siblings, 1 reply; 4+ messages in thread
From: Nikolay Borisov @ 2017-09-28 11:18 UTC (permalink / raw)
  To: Olivier Bonvalet, linux-btrfs; +Cc: xen-devel



On 28.09.2017 13:16, Olivier Bonvalet wrote:
> Hi !
> 
> I have a virtual server (Xen) which very frequently hangs with only
> this error in logs :
> 
> [ 1330.144124] INFO: task btrfs-transacti:651 blocked for more than 120 seconds.
> [ 1330.144141]       Not tainted 4.9-dae-xen #2
> [ 1330.144146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1330.144179] btrfs-transacti D    0   651      2 0x00000000
> [ 1330.144184]  ffff8803a6c85b40 0000000000000000 ffff8803af857880 ffff8803a9762180
> [ 1330.144190]  ffff8803a7bb8140 ffffc900173bfb10 ffffffff8150ff1f 0000000000000000
> [ 1330.144195]  ffff8803a7bb8140 7fffffffffffffff ffffffff81510710 ffffc900173bfc18
> [ 1330.144200] Call Trace:
> [ 1330.144211]  [<ffffffff8150ff1f>] ? __schedule+0x17f/0x530
> [ 1330.144215]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
> [ 1330.144218]  [<ffffffff815102fd>] ? schedule+0x2d/0x80
> [ 1330.144221]  [<ffffffff815132be>] ? schedule_timeout+0x17e/0x2a0
> [ 1330.144226]  [<ffffffff8101bb71>] ? xen_clocksource_get_cycles+0x11/0x20
> [ 1330.144231]  [<ffffffff810f2196>] ? ktime_get+0x36/0xa0
> [ 1330.144234]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
> [ 1330.144237]  [<ffffffff8150fd38>] ? io_schedule_timeout+0x98/0x100
> [ 1330.144240]  [<ffffffff81513de1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
> [ 1330.144246]  [<ffffffff81510722>] ? bit_wait_io+0x12/0x60
> [ 1330.144250]  [<ffffffff815107be>] ? __wait_on_bit+0x4e/0x80
> [ 1330.144256]  [<ffffffff8113772c>] ? wait_on_page_bit+0x6c/0x80
> [ 1330.144261]  [<ffffffff810d4ab0>] ? autoremove_wake_function+0x30/0x30
> [ 1330.144265]  [<ffffffff81137808>] ? __filemap_fdatawait_range+0xc8/0x110
> [ 1330.144270]  [<ffffffff81137859>] ? filemap_fdatawait_range+0x9/0x20
> [ 1330.144298]  [<ffffffffa014b033>] ? btrfs_wait_ordered_range+0x63/0x100 [btrfs]
> [ 1330.144310]  [<ffffffffa0175a68>] ? btrfs_wait_cache_io+0x58/0x1e0 [btrfs]
> [ 1330.144320]  [<ffffffffa011ded2>] ? btrfs_start_dirty_block_groups+0x1c2/0x450 [btrfs]
> [ 1330.144328]  [<ffffffff810a2ba5>] ? do_group_exit+0x35/0xa0
> [ 1330.144338]  [<ffffffffa012efa7>] ? btrfs_commit_transaction+0x147/0x9b0 [btrfs]
> [ 1330.144348]  [<ffffffffa012f8a2>] ? start_transaction+0x92/0x3f0 [btrfs]
> [ 1330.144357]  [<ffffffffa012a0e7>] ? transaction_kthread+0x1d7/0x1f0 [btrfs]
> [ 1330.144366]  [<ffffffffa0129f10>] ? btrfs_cleanup_transaction+0x4f0/0x4f0 [btrfs]
> [ 1330.144373]  [<ffffffff810ba352>] ? kthread+0xc2/0xe0
> [ 1330.144377]  [<ffffffff810ba290>] ? kthread_create_on_node+0x40/0x40
> [ 1330.144381]  [<ffffffff81514405>] ? ret_from_fork+0x25/0x30

So what this stack trace means is that transaction commit has hung. And
judging by the called functions (assuming they are correct, though the ?
aren't very encouraging). Concretely, it means that an io has been
started for a certain range of addresses and transaction commit is now
waiting to be awaken upon completion of write. When this occurs can you
see if there is io activity from that particular guest (assuming you
have access to the hypervisor)? It might be a bug in btrfs or you might
be hitting something else in the hypervisor


> 
> 
> It's a Debian Stretch system, running a 4.9.52 Linux kernel (on a Xen 4.8.2 hypervisor).
> With an old 4.1.x Linux kernel, I haven't any problem.
> 
> 
> Is it a Btrfs bug ? Should I try a more recent kernel ? (which one ?)
> 
> Thanks in advance,
> 
> Olivier
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re : task btrfs-transacti:651 blocked for more than 120 seconds
  2017-09-28 11:18 ` Nikolay Borisov
@ 2017-09-28 14:28   ` Olivier Bonvalet
  2017-09-28 16:49     ` Re : [Xen-devel] " Olivier Bonvalet
  0 siblings, 1 reply; 4+ messages in thread
From: Olivier Bonvalet @ 2017-09-28 14:28 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs; +Cc: xen-devel

Le jeudi 28 septembre 2017 à 14:18 +0300, Nikolay Borisov a écrit :
> So what this stack trace means is that transaction commit has hung.
> And
> judging by the called functions (assuming they are correct, though
> the ?
> aren't very encouraging). Concretely, it means that an io has been
> started for a certain range of addresses and transaction commit is
> now
> waiting to be awaken upon completion of write. When this occurs can
> you
> see if there is io activity from that particular guest (assuming you
> have access to the hypervisor)? It might be a bug in btrfs or you
> might
> be hitting something else in the hypervisor


Hello,

thanks for your answer. From the hypervisor, I don't see any IO during
this hang.

I tried to clone the VM to simulate the problem, and I also have the
problem without Btrfs :

[ 3263.452023] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3263.452040]       Tainted: G        W       4.9-dae-xen #2
[ 3263.452044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3263.452052] systemd         D    0     1      0 0x00000000
[ 3263.452060]  ffff8803a71ca000 0000000000000000 ffff8803af857880 ffff8803a9762dc0
[ 3263.452070]  ffff8803a96fcc80 ffffc9001623f990 ffffffff8150ff1f 0000000000000000
[ 3263.452079]  ffff8803a96fcc80 7fffffffffffffff ffffffff81510710 ffffc9001623faa0
[ 3263.452087] Call Trace:
[ 3263.452099]  [<ffffffff8150ff1f>] ? __schedule+0x17f/0x530
[ 3263.452105]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 3263.452110]  [<ffffffff815102fd>] ? schedule+0x2d/0x80
[ 3263.452116]  [<ffffffff815132be>] ? schedule_timeout+0x17e/0x2a0
[ 3263.452121]  [<ffffffff8101bb71>] ? xen_clocksource_get_cycles+0x11/0x20
[ 3263.452126]  [<ffffffff810f2196>] ? ktime_get+0x36/0xa0
[ 3263.452130]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 3263.452134]  [<ffffffff8150fd38>] ? io_schedule_timeout+0x98/0x100
[ 3263.452137]  [<ffffffff81513de1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
[ 3263.452141]  [<ffffffff81510722>] ? bit_wait_io+0x12/0x60
[ 3263.452145]  [<ffffffff815107be>] ? __wait_on_bit+0x4e/0x80
[ 3263.452149]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 3263.452153]  [<ffffffff81510859>] ? out_of_line_wait_on_bit+0x69/0x80
[ 3263.452157]  [<ffffffff810d4ab0>] ? autoremove_wake_function+0x30/0x30
[ 3263.452163]  [<ffffffff81220ed0>] ? ext4_find_entry+0x350/0x5d0
[ 3263.452168]  [<ffffffff811b9020>] ? d_alloc_parallel+0xa0/0x480
[ 3263.452172]  [<ffffffff811b6d18>] ? __d_lookup_done+0x68/0xd0
[ 3263.452175]  [<ffffffff811b7f38>] ? d_splice_alias+0x158/0x3b0
[ 3263.452179]  [<ffffffff81221662>] ? ext4_lookup+0x42/0x1f0
[ 3263.452184]  [<ffffffff811ab28e>] ? lookup_slow+0x8e/0x130
[ 3263.452187]  [<ffffffff811ab71a>] ? walk_component+0x1ca/0x300
[ 3263.452193]  [<ffffffff811ac0fe>] ? link_path_walk+0x18e/0x570
[ 3263.452199]  [<ffffffff811abe13>] ? path_init+0x1c3/0x320
[ 3263.452207]  [<ffffffff811ae4c2>] ? path_openat+0xe2/0x1380
[ 3263.452214]  [<ffffffff811b0329>] ? do_filp_open+0x79/0xd0
[ 3263.452222]  [<ffffffff81185fc1>] ? kmem_cache_alloc+0x71/0x400
[ 3263.452228]  [<ffffffff8119d507>] ? __check_object_size+0xf7/0x1c4
[ 3263.452235]  [<ffffffff8119f8cf>] ? do_sys_open+0x11f/0x1f0
[ 3263.452238]  [<ffffffff815141b7>] ? entry_SYSCALL_64_fastpath+0x1a/0xa9


So I will try to see with Xen developpers.

Thanks,

Olivier

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re : [Xen-devel] Re : task btrfs-transacti:651 blocked for more than 120 seconds
  2017-09-28 14:28   ` Re : " Olivier Bonvalet
@ 2017-09-28 16:49     ` Olivier Bonvalet
  0 siblings, 0 replies; 4+ messages in thread
From: Olivier Bonvalet @ 2017-09-28 16:49 UTC (permalink / raw)
  To: xen-devel; +Cc: linux-btrfs

Le jeudi 28 septembre 2017 à 16:28 +0200, Olivier Bonvalet a écrit :
> [ 3263.452023] INFO: task systemd:1 blocked for more than 120
> seconds.
> [ 3263.452040]       Tainted: G        W       4.9-dae-xen #2
> [ 3263.452044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 3263.452052] systemd         D    0     1      0 0x00000000
> [ 3263.452060]  ffff8803a71ca000 0000000000000000 ffff8803af857880
> ffff8803a9762dc0
> [ 3263.452070]  ffff8803a96fcc80 ffffc9001623f990 ffffffff8150ff1f
> 0000000000000000
> [ 3263.452079]  ffff8803a96fcc80 7fffffffffffffff ffffffff81510710
> ffffc9001623faa0
> [ 3263.452087] Call Trace:
> [ 3263.452099]  [<ffffffff8150ff1f>] ? __schedule+0x17f/0x530
> [ 3263.452105]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
> [ 3263.452110]  [<ffffffff815102fd>] ? schedule+0x2d/0x80
> [ 3263.452116]  [<ffffffff815132be>] ? schedule_timeout+0x17e/0x2a0
> [ 3263.452121]  [<ffffffff8101bb71>] ?
> xen_clocksource_get_cycles+0x11/0x20
> [ 3263.452126]  [<ffffffff810f2196>] ? ktime_get+0x36/0xa0
> [ 3263.452130]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
> [ 3263.452134]  [<ffffffff8150fd38>] ? io_schedule_timeout+0x98/0x100
> [ 3263.452137]  [<ffffffff81513de1>] ?
> _raw_spin_unlock_irqrestore+0x11/0x20
> [ 3263.452141]  [<ffffffff81510722>] ? bit_wait_io+0x12/0x60
> [ 3263.452145]  [<ffffffff815107be>] ? __wait_on_bit+0x4e/0x80
> [ 3263.452149]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
> [ 3263.452153]  [<ffffffff81510859>] ?
> out_of_line_wait_on_bit+0x69/0x80
> [ 3263.452157]  [<ffffffff810d4ab0>] ?
> autoremove_wake_function+0x30/0x30
> [ 3263.452163]  [<ffffffff81220ed0>] ? ext4_find_entry+0x350/0x5d0
> [ 3263.452168]  [<ffffffff811b9020>] ? d_alloc_parallel+0xa0/0x480
> [ 3263.452172]  [<ffffffff811b6d18>] ? __d_lookup_done+0x68/0xd0
> [ 3263.452175]  [<ffffffff811b7f38>] ? d_splice_alias+0x158/0x3b0
> [ 3263.452179]  [<ffffffff81221662>] ? ext4_lookup+0x42/0x1f0
> [ 3263.452184]  [<ffffffff811ab28e>] ? lookup_slow+0x8e/0x130
> [ 3263.452187]  [<ffffffff811ab71a>] ? walk_component+0x1ca/0x300
> [ 3263.452193]  [<ffffffff811ac0fe>] ? link_path_walk+0x18e/0x570
> [ 3263.452199]  [<ffffffff811abe13>] ? path_init+0x1c3/0x320
> [ 3263.452207]  [<ffffffff811ae4c2>] ? path_openat+0xe2/0x1380
> [ 3263.452214]  [<ffffffff811b0329>] ? do_filp_open+0x79/0xd0
> [ 3263.452222]  [<ffffffff81185fc1>] ? kmem_cache_alloc+0x71/0x400
> [ 3263.452228]  [<ffffffff8119d507>] ? __check_object_size+0xf7/0x1c4
> [ 3263.452235]  [<ffffffff8119f8cf>] ? do_sys_open+0x11f/0x1f0
> [ 3263.452238]  [<ffffffff815141b7>] ?
> entry_SYSCALL_64_fastpath+0x1a/0xa9

Just in case, an other example :

[ 1088.476044] INFO: task jbd2/xvdb-8:494 blocked for more than 120 seconds.
[ 1088.476058]       Tainted: G        W       4.9-dae-xen #2
[ 1088.476061] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.476066] jbd2/xvdb-8     D    0   494      2 0x00000000
[ 1088.476072]  ffff8800fd036480 0000000000000000 ffff8803af8d7880 ffff8803a8c6e580
[ 1088.476079]  ffff88038756d280 ffffc9001737fb90 ffffffff8150ff1f 0000100000000001
[ 1088.476085]  ffff88038756d280 7fffffffffffffff ffffffff81510710 ffffc9001737fc98
[ 1088.476091] Call Trace:
[ 1088.476102]  [<ffffffff8150ff1f>] ? __schedule+0x17f/0x530
[ 1088.476107]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 1088.476114]  [<ffffffff815102fd>] ? schedule+0x2d/0x80
[ 1088.476117]  [<ffffffff815132be>] ? schedule_timeout+0x17e/0x2a0
[ 1088.476123]  [<ffffffff8101bb71>] ? xen_clocksource_get_cycles+0x11/0x20
[ 1088.476126]  [<ffffffff8101bb71>] ? xen_clocksource_get_cycles+0x11/0x20
[ 1088.476132]  [<ffffffff810f2196>] ? ktime_get+0x36/0xa0
[ 1088.476136]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 1088.476139]  [<ffffffff8150fd38>] ? io_schedule_timeout+0x98/0x100
[ 1088.476143]  [<ffffffff81513de1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
[ 1088.476147]  [<ffffffff81510722>] ? bit_wait_io+0x12/0x60
[ 1088.476151]  [<ffffffff815107be>] ? __wait_on_bit+0x4e/0x80
[ 1088.476155]  [<ffffffff81510710>] ? bit_wait+0x50/0x50
[ 1088.476159]  [<ffffffff81510859>] ? out_of_line_wait_on_bit+0x69/0x80
[ 1088.476163]  [<ffffffff810d4ab0>] ? autoremove_wake_function+0x30/0x30
[ 1088.476170]  [<ffffffff812528ee>] ? jbd2_journal_commit_transaction+0xe7e/0x1610
[ 1088.476177]  [<ffffffff810eb7f6>] ? lock_timer_base+0x76/0x90
[ 1088.476182]  [<ffffffff81255b0d>] ? kjournald2+0xad/0x230
[ 1088.476189]  [<ffffffff810d4a80>] ? wake_atomic_t_function+0x50/0x50
[ 1088.476193]  [<ffffffff81255a60>] ? commit_timeout+0x10/0x10
[ 1088.476197]  [<ffffffff810a2ba5>] ? do_group_exit+0x35/0xa0
[ 1088.476201]  [<ffffffff810ba352>] ? kthread+0xc2/0xe0
[ 1088.476205]  [<ffffffff810ba290>] ? kthread_create_on_node+0x40/0x40
[ 1088.476209]  [<ffffffff81514405>] ? ret_from_fork+0x25/0x30



and also from the Dom0 (rewritten from screenshot) :

watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [kworker/11:0:26273]
Modules linked in: ...
CPU: 11 PID: 26273 Comm: kworker/11:0 Taineted: G D W L 4.13-dae-dom0 #2
Harware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 10/12/2016
Workqueue: events wait_rcu_exp_gp
task: ... task.stack: ...
RIP: e030:smp_call_function_single+0x6b/0xc0
...
Call Trace:
 ? sync_rcu_exp_select_cpus+0x2b5/0x410
 ? rcu_barrier_func+0x40/0x40
 ? wait_rcu_rxp_gp+0x16/0x30
 ? process_one_work+0x1ad/0x340
 ? worker_thread+0x45/0x3f0
 ? kthread+0xf2/0x130
 ? process_one_work+0x340/0x340
 ? kthread_create_on_node+0x40/0x40
 ? do_group_exit+0x35/0xa0
 ? ret_from_fork+0x25/0x30
...




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-09-28 16:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-28 10:16 task btrfs-transacti:651 blocked for more than 120 seconds Olivier Bonvalet
2017-09-28 11:18 ` Nikolay Borisov
2017-09-28 14:28   ` Re : " Olivier Bonvalet
2017-09-28 16:49     ` Re : [Xen-devel] " Olivier Bonvalet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).