Raid5 hang in 3.14.19

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid5 hang in 3.14.19
@ 2014-09-26 22:33 BillStuff
  2014-09-29  2:25 ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: BillStuff @ 2014-09-26 22:33 UTC (permalink / raw)
  To: linux-raid, Neil Brown

Hi Neil,

I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.

It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.

on this array:
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
       104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
       bitmap: 1/5 pages [4KB], 2048KB chunk

I was running a test doing parallel kernel builds, read/write loops, and 
disk add / remove / check loops,
on both this array and a raid1 array.

I was trying to stress test your recent raid1 fixes, which went well, 
but then after 5 days,
the raid5 array hung up with this in dmesg:

INFO: task kworker/7:2:15726 blocked for more than 120 seconds.
       Tainted: P        W  O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/7:2     D 00000001     0 15726      2 0x00000000
Workqueue: xfs-log/md3 xfs_log_worker [xfs]
  d1809b1c 00000046 e95e0a80 00000001 c15b56a0 00000008 cf92b77a 0001721a
  c17d6700 c17d6700 d198bcc0 d3b6a880 c236dc00 c8290fb0 c2328468 00000352
  000000e2 c209c000 00000001 e9606000 c209c000 d1809b1c c14b05f4 c12d0c4a
Call Trace:
  [<c14b05f4>] ? sch_direct_xmit+0x54/0x180
  [<c12d0c4a>] ? radix_tree_lookup+0xa/0x10
  [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
  [<c153e7f3>] schedule+0x23/0x60
  [<c144fdf1>] md_flush_request+0x51/0xd0
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<f3dc0095>] make_request+0xb55/0xc30 [raid456]
  [<c14be153>] ? ip_finish_output+0x1e3/0x440
  [<c14bf5e7>] ? ip_output+0x77/0x80
  [<c14beeb0>] ? ip_local_out+0x20/0x30
  [<c12ae873>] ? generic_make_request_checks+0x343/0x400
  [<c1049903>] ? internal_add_timer+0x13/0x40
  [<c14501d1>] md_make_request+0xb1/0x1d0
  [<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
  [<c12af629>] generic_make_request+0x89/0xc0
  [<c12af6ba>] submit_bio+0x5a/0x120
  [<c1152cb8>] ? bio_add_page+0x38/0x40
  [<f3c919c0>] _xfs_buf_ioapply+0x250/0x300 [xfs]
  [<f3c91bb4>] xfs_buf_iorequest+0x44/0x80 [xfs]
  [<f3ce7dbe>] ? xlog_bdstrat+0x1e/0x50 [xfs]
  [<f3ce7dbe>] xlog_bdstrat+0x1e/0x50 [xfs]
  [<f3ce9900>] xlog_sync+0x2a0/0x460 [xfs]
  [<f3ce9b34>] xlog_state_release_iclog+0x74/0xa0 [xfs]
  [<f3cea3ef>] xlog_write+0x4af/0x630 [xfs]
  [<f3caa306>] ? kmem_zone_alloc+0x66/0xe0 [xfs]
  [<f3ceb897>] xlog_cil_push+0x1e7/0x3c0 [xfs]
  [<c1054c56>] ? flush_work+0x16/0x100
  [<f3cec059>] xlog_cil_force_lsn+0x79/0x1c0 [xfs]
  [<c106c864>] ? sched_clock_cpu+0x104/0x110
  [<c106d8bc>] ? __dequeue_entity+0x3c/0x40
  [<f3cea670>] _xfs_log_force+0x50/0x230 [xfs]
  [<c153e254>] ? __schedule+0x224/0x7a0
  [<f3cea876>] xfs_log_force+0x26/0x90 [xfs]
  [<f3cebb87>] ? xlog_cil_empty+0x17/0x30 [xfs]
  [<f3cea900>] xfs_log_worker+0x20/0x50 [xfs]
  [<c105521c>] process_one_work+0x10c/0x340
  [<c1055d31>] worker_thread+0x101/0x330
  [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
  [<c105ae21>] kthread+0xa1/0xc0
  [<c1541837>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/u16:2:24280 blocked for more than 120 seconds.
       Tainted: P        W  O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:2   D d56879e0     0 24280      2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-9:3)
  d5687a3c 00000046 00000000 d56879e0 f2fbd600 f3dbc81b 2e86d607 0001721c
  c17d6700 c17d6700 c49cbcc0 e601c6e0 f3dbc81b 00000002 00000001 e9e0fc00
  c2328468 0000003b d5687a28 d5687d74 00000001 d5687a3c c12b0db6 f3dba382
Call Trace:
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<c12b0db6>] ? blk_flush_plug_list+0x86/0x1b0
  [<f3dba382>] ? stripe_set_idx+0xa2/0xc0 [raid456]
  [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
  [<c153e7f3>] schedule+0x23/0x60
  [<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
  [<c145f80b>] ? bitmap_startwrite+0x5b/0x170
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
  [<c10b7193>] ? delayacct_end+0x93/0xa0
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<c14501d1>] md_make_request+0xb1/0x1d0
  [<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
  [<c12af629>] generic_make_request+0x89/0xc0
  [<c12af6ba>] submit_bio+0x5a/0x120
  [<c1152b09>] ? __bio_add_page+0xd9/0x220
  [<c1153248>] ? bvec_alloc+0x78/0xe0
  [<f3c8b612>] xfs_submit_ioend_bio.isra.9+0x32/0x40 [xfs]
  [<f3c8b70c>] xfs_submit_ioend+0xec/0x120 [xfs]
  [<f3c8b981>] xfs_vm_writepage+0x241/0x580 [xfs]
  [<c10ec730>] __writepage+0x10/0x40
  [<c10ec720>] ? mapping_tagged+0x20/0x20
  [<c10ecc30>] write_cache_pages+0x170/0x3d0
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<c10ec720>] ? mapping_tagged+0x20/0x20
  [<c10ecec3>] generic_writepages+0x33/0x60
  [<f3c8aa43>] xfs_vm_writepages+0x33/0x40 [xfs]
  [<c10ee38a>] do_writepages+0x1a/0x40
  [<c11486b2>] __writeback_single_inode+0x42/0x1f0
  [<c107965c>] ? wake_up_bit+0x1c/0x20
  [<c1149423>] writeback_sb_inodes+0x193/0x2e0
  [<c11495ec>] __writeback_inodes_wb+0x7c/0xb0
  [<c11497f2>] wb_writeback+0x1d2/0x270
  [<c1149c11>] bdi_writeback_workfn+0x151/0x380
  [<c153e254>] ? __schedule+0x224/0x7a0
  [<c105521c>] process_one_work+0x10c/0x340
  [<c1069cb0>] ? default_wake_function+0x10/0x20
  [<c1055d31>] worker_thread+0x101/0x330
  [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
  [<c105ae21>] kthread+0xa1/0xc0
  [<c1541837>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/3:3:26958 blocked for more than 120 seconds.
       Tainted: P        W  O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/3:3     D 00000000     0 26958      2 0x00000000
Workqueue: md md_submit_flush_data
  c6d41dd8 00000046 d128391d 00000000 00000000 00000000 2d875421 0001721c
  c17d6700 c17d6700 e9d3bcc0 d4bf5610 00000000 00000000 00000000 ea469748
  df1132aa 0000c335 d4bf5654 d4bf5654 de5a17aa c6d41dec c106fe31 57d98c16
Call Trace:
  [<c106fe31>] ? update_curr+0xf1/0x1a0
  [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
  [<c153e7f3>] schedule+0x23/0x60
  [<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
  [<c107030f>] ? dequeue_entity+0x11f/0x510
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
  [<c153e254>] ? __schedule+0x224/0x7a0
  [<c111bf45>] ? kmem_cache_free+0xe5/0x110
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<c144fd60>] md_submit_flush_data+0x30/0x70
  [<c105521c>] process_one_work+0x10c/0x340
  [<c104a30f>] ? mod_timer+0xef/0x1d0
  [<c1055d31>] worker_thread+0x101/0x330
  [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
  [<c105ae21>] kthread+0xa1/0xc0
  [<c1541837>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110

There's more, mostly xfs threads complaining about being blocked and 
some user processes.
Thinking it was an xfs problem, I did "echo w > /proc/sysrq-trigger" and 
got this in syslog:

Sep 26 12:33:06 xplane kernel: SysRq : Show Blocked State
Sep 26 12:33:06 xplane kernel:   task                PC stack   pid father
Sep 26 12:33:06 xplane kernel: kthreadd        D c106ea4c     0 2      0 
0x00000000
Sep 26 12:33:06 xplane kernel: kswapd0         D 00017219     0 711      
2 0x00000000
Sep 26 12:33:06 xplane kernel: md3_raid5       D ea49d770     0 
1698      2 0x00000000
Sep 26 12:33:06 xplane kernel: xfsaild/md3     D e7d72700     0 
2234      2 0x00000000
Sep 26 12:33:06 xplane kernel: kworker/7:2     D 00000001     0 
15726      2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: xfs-log/md3 xfs_log_worker [xfs]
Sep 26 12:33:06 xplane kernel: kworker/u16:2   D d56879e0     0 
24280      2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: writeback bdi_writeback_workfn 
(flush-9:3)
Sep 26 12:33:06 xplane kernel: kworker/3:3     D 00000000     0 
26958      2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: md md_submit_flush_data
Sep 26 12:33:06 xplane kernel: cat             D 00000000     0 11412  
11410 0x00000000
Sep 26 12:33:06 xplane kernel: cc1             D d00c3bac     0 12102  
12101 0x00000000
Sep 26 12:33:06 xplane kernel: ld              D c16ef900     0 12132  
12130 0x00000000
Sep 26 12:33:06 xplane kernel: make            D 00000000     0 12147  
12047 0x00000000
Sep 26 12:33:06 xplane kernel: make            D d896ea80     0 12374  
11263 0x00000000
Sep 26 12:33:06 xplane kernel: cat             D f2e51d0c     0 23219  
23217 0x00000000
Sep 26 12:33:06 xplane kernel: gcc             D f2e51d0c     0 23564  
23563 0x00000000
Sep 26 12:33:06 xplane kernel: sh              D f2e51d0c     0 25753  
23615 0x00000000

which looks similar to the problem from the above mentioned list-post.

I did a few test writes to the filesystem, and it looked like xfs 
created the files, and could read them
and delete them, but there was never any disk activity and access to 
/sys/block/md3/md/{anything} was hung up

"top" showed consistent I/O wait time, first 25%, then 50%, with no disc 
activity

In the end I had to reboot, and I'm running the test again.

Any thoughts on what's happening here or ideas to help debug it?

Thanks,
Bill


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-26 22:33 Raid5 hang in 3.14.19 BillStuff
@ 2014-09-29  2:25 ` NeilBrown
  2014-09-29  3:56   ` BillStuff
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2014-09-29  2:25 UTC (permalink / raw)
  To: BillStuff; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 10758 bytes --]

On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
wrote:

> Hi Neil,
> 
> I found something that looks similar to the problem described in
> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
> 
> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
> 
> on this array:
> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
>        104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
>        bitmap: 1/5 pages [4KB], 2048KB chunk
> 
> I was running a test doing parallel kernel builds, read/write loops, and 
> disk add / remove / check loops,
> on both this array and a raid1 array.
> 
> I was trying to stress test your recent raid1 fixes, which went well, 
> but then after 5 days,
> the raid5 array hung up with this in dmesg:

I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.

 From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(

The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then

 cat /proc/1698/stack

might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)

NeilBrown




> 
> INFO: task kworker/7:2:15726 blocked for more than 120 seconds.
>        Tainted: P        W  O 3.14.19fe-dirty #2
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/7:2     D 00000001     0 15726      2 0x00000000
> Workqueue: xfs-log/md3 xfs_log_worker [xfs]
>   d1809b1c 00000046 e95e0a80 00000001 c15b56a0 00000008 cf92b77a 0001721a
>   c17d6700 c17d6700 d198bcc0 d3b6a880 c236dc00 c8290fb0 c2328468 00000352
>   000000e2 c209c000 00000001 e9606000 c209c000 d1809b1c c14b05f4 c12d0c4a
> Call Trace:
>   [<c14b05f4>] ? sch_direct_xmit+0x54/0x180
>   [<c12d0c4a>] ? radix_tree_lookup+0xa/0x10
>   [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
>   [<c153e7f3>] schedule+0x23/0x60
>   [<c144fdf1>] md_flush_request+0x51/0xd0
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<f3dc0095>] make_request+0xb55/0xc30 [raid456]
>   [<c14be153>] ? ip_finish_output+0x1e3/0x440
>   [<c14bf5e7>] ? ip_output+0x77/0x80
>   [<c14beeb0>] ? ip_local_out+0x20/0x30
>   [<c12ae873>] ? generic_make_request_checks+0x343/0x400
>   [<c1049903>] ? internal_add_timer+0x13/0x40
>   [<c14501d1>] md_make_request+0xb1/0x1d0
>   [<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
>   [<c12af629>] generic_make_request+0x89/0xc0
>   [<c12af6ba>] submit_bio+0x5a/0x120
>   [<c1152cb8>] ? bio_add_page+0x38/0x40
>   [<f3c919c0>] _xfs_buf_ioapply+0x250/0x300 [xfs]
>   [<f3c91bb4>] xfs_buf_iorequest+0x44/0x80 [xfs]
>   [<f3ce7dbe>] ? xlog_bdstrat+0x1e/0x50 [xfs]
>   [<f3ce7dbe>] xlog_bdstrat+0x1e/0x50 [xfs]
>   [<f3ce9900>] xlog_sync+0x2a0/0x460 [xfs]
>   [<f3ce9b34>] xlog_state_release_iclog+0x74/0xa0 [xfs]
>   [<f3cea3ef>] xlog_write+0x4af/0x630 [xfs]
>   [<f3caa306>] ? kmem_zone_alloc+0x66/0xe0 [xfs]
>   [<f3ceb897>] xlog_cil_push+0x1e7/0x3c0 [xfs]
>   [<c1054c56>] ? flush_work+0x16/0x100
>   [<f3cec059>] xlog_cil_force_lsn+0x79/0x1c0 [xfs]
>   [<c106c864>] ? sched_clock_cpu+0x104/0x110
>   [<c106d8bc>] ? __dequeue_entity+0x3c/0x40
>   [<f3cea670>] _xfs_log_force+0x50/0x230 [xfs]
>   [<c153e254>] ? __schedule+0x224/0x7a0
>   [<f3cea876>] xfs_log_force+0x26/0x90 [xfs]
>   [<f3cebb87>] ? xlog_cil_empty+0x17/0x30 [xfs]
>   [<f3cea900>] xfs_log_worker+0x20/0x50 [xfs]
>   [<c105521c>] process_one_work+0x10c/0x340
>   [<c1055d31>] worker_thread+0x101/0x330
>   [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c1541837>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> INFO: task kworker/u16:2:24280 blocked for more than 120 seconds.
>        Tainted: P        W  O 3.14.19fe-dirty #2
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u16:2   D d56879e0     0 24280      2 0x00000000
> Workqueue: writeback bdi_writeback_workfn (flush-9:3)
>   d5687a3c 00000046 00000000 d56879e0 f2fbd600 f3dbc81b 2e86d607 0001721c
>   c17d6700 c17d6700 c49cbcc0 e601c6e0 f3dbc81b 00000002 00000001 e9e0fc00
>   c2328468 0000003b d5687a28 d5687d74 00000001 d5687a3c c12b0db6 f3dba382
> Call Trace:
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<c12b0db6>] ? blk_flush_plug_list+0x86/0x1b0
>   [<f3dba382>] ? stripe_set_idx+0xa2/0xc0 [raid456]
>   [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
>   [<c153e7f3>] schedule+0x23/0x60
>   [<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
>   [<c145f80b>] ? bitmap_startwrite+0x5b/0x170
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
>   [<c10b7193>] ? delayacct_end+0x93/0xa0
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<c14501d1>] md_make_request+0xb1/0x1d0
>   [<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
>   [<c12af629>] generic_make_request+0x89/0xc0
>   [<c12af6ba>] submit_bio+0x5a/0x120
>   [<c1152b09>] ? __bio_add_page+0xd9/0x220
>   [<c1153248>] ? bvec_alloc+0x78/0xe0
>   [<f3c8b612>] xfs_submit_ioend_bio.isra.9+0x32/0x40 [xfs]
>   [<f3c8b70c>] xfs_submit_ioend+0xec/0x120 [xfs]
>   [<f3c8b981>] xfs_vm_writepage+0x241/0x580 [xfs]
>   [<c10ec730>] __writepage+0x10/0x40
>   [<c10ec720>] ? mapping_tagged+0x20/0x20
>   [<c10ecc30>] write_cache_pages+0x170/0x3d0
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<c10ec720>] ? mapping_tagged+0x20/0x20
>   [<c10ecec3>] generic_writepages+0x33/0x60
>   [<f3c8aa43>] xfs_vm_writepages+0x33/0x40 [xfs]
>   [<c10ee38a>] do_writepages+0x1a/0x40
>   [<c11486b2>] __writeback_single_inode+0x42/0x1f0
>   [<c107965c>] ? wake_up_bit+0x1c/0x20
>   [<c1149423>] writeback_sb_inodes+0x193/0x2e0
>   [<c11495ec>] __writeback_inodes_wb+0x7c/0xb0
>   [<c11497f2>] wb_writeback+0x1d2/0x270
>   [<c1149c11>] bdi_writeback_workfn+0x151/0x380
>   [<c153e254>] ? __schedule+0x224/0x7a0
>   [<c105521c>] process_one_work+0x10c/0x340
>   [<c1069cb0>] ? default_wake_function+0x10/0x20
>   [<c1055d31>] worker_thread+0x101/0x330
>   [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c1541837>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> INFO: task kworker/3:3:26958 blocked for more than 120 seconds.
>        Tainted: P        W  O 3.14.19fe-dirty #2
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/3:3     D 00000000     0 26958      2 0x00000000
> Workqueue: md md_submit_flush_data
>   c6d41dd8 00000046 d128391d 00000000 00000000 00000000 2d875421 0001721c
>   c17d6700 c17d6700 e9d3bcc0 d4bf5610 00000000 00000000 00000000 ea469748
>   df1132aa 0000c335 d4bf5654 d4bf5654 de5a17aa c6d41dec c106fe31 57d98c16
> Call Trace:
>   [<c106fe31>] ? update_curr+0xf1/0x1a0
>   [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
>   [<c153e7f3>] schedule+0x23/0x60
>   [<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
>   [<c107030f>] ? dequeue_entity+0x11f/0x510
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
>   [<c153e254>] ? __schedule+0x224/0x7a0
>   [<c111bf45>] ? kmem_cache_free+0xe5/0x110
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<c144fd60>] md_submit_flush_data+0x30/0x70
>   [<c105521c>] process_one_work+0x10c/0x340
>   [<c104a30f>] ? mod_timer+0xef/0x1d0
>   [<c1055d31>] worker_thread+0x101/0x330
>   [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c1541837>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> 
> There's more, mostly xfs threads complaining about being blocked and 
> some user processes.
> Thinking it was an xfs problem, I did "echo w > /proc/sysrq-trigger" and 
> got this in syslog:
> 
> Sep 26 12:33:06 xplane kernel: SysRq : Show Blocked State
> Sep 26 12:33:06 xplane kernel:   task                PC stack   pid father
> Sep 26 12:33:06 xplane kernel: kthreadd        D c106ea4c     0 2      0 
> 0x00000000
> Sep 26 12:33:06 xplane kernel: kswapd0         D 00017219     0 711      
> 2 0x00000000
> Sep 26 12:33:06 xplane kernel: md3_raid5       D ea49d770     0 
> 1698      2 0x00000000
> Sep 26 12:33:06 xplane kernel: xfsaild/md3     D e7d72700     0 
> 2234      2 0x00000000
> Sep 26 12:33:06 xplane kernel: kworker/7:2     D 00000001     0 
> 15726      2 0x00000000
> Sep 26 12:33:06 xplane kernel: Workqueue: xfs-log/md3 xfs_log_worker [xfs]
> Sep 26 12:33:06 xplane kernel: kworker/u16:2   D d56879e0     0 
> 24280      2 0x00000000
> Sep 26 12:33:06 xplane kernel: Workqueue: writeback bdi_writeback_workfn 
> (flush-9:3)
> Sep 26 12:33:06 xplane kernel: kworker/3:3     D 00000000     0 
> 26958      2 0x00000000
> Sep 26 12:33:06 xplane kernel: Workqueue: md md_submit_flush_data
> Sep 26 12:33:06 xplane kernel: cat             D 00000000     0 11412  
> 11410 0x00000000
> Sep 26 12:33:06 xplane kernel: cc1             D d00c3bac     0 12102  
> 12101 0x00000000
> Sep 26 12:33:06 xplane kernel: ld              D c16ef900     0 12132  
> 12130 0x00000000
> Sep 26 12:33:06 xplane kernel: make            D 00000000     0 12147  
> 12047 0x00000000
> Sep 26 12:33:06 xplane kernel: make            D d896ea80     0 12374  
> 11263 0x00000000
> Sep 26 12:33:06 xplane kernel: cat             D f2e51d0c     0 23219  
> 23217 0x00000000
> Sep 26 12:33:06 xplane kernel: gcc             D f2e51d0c     0 23564  
> 23563 0x00000000
> Sep 26 12:33:06 xplane kernel: sh              D f2e51d0c     0 25753  
> 23615 0x00000000
> 
> which looks similar to the problem from the above mentioned list-post.
> 
> I did a few test writes to the filesystem, and it looked like xfs 
> created the files, and could read them
> and delete them, but there was never any disk activity and access to 
> /sys/block/md3/md/{anything} was hung up
> 
> "top" showed consistent I/O wait time, first 25%, then 50%, with no disc 
> activity
> 
> In the end I had to reboot, and I'm running the test again.
> 
> Any thoughts on what's happening here or ideas to help debug it?
> 
> Thanks,
> Bill
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29  2:25 ` NeilBrown
@ 2014-09-29  3:56   ` BillStuff
  2014-09-29  4:08     ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: BillStuff @ 2014-09-29  3:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 09/28/2014 09:25 PM, NeilBrown wrote:
> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
> wrote:
>
>> Hi Neil,
>>
>> I found something that looks similar to the problem described in
>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
>>
>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
>>
>> on this array:
>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
>>         104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
>>         bitmap: 1/5 pages [4KB], 2048KB chunk
>>
>> I was running a test doing parallel kernel builds, read/write loops, and
>> disk add / remove / check loops,
>> on both this array and a raid1 array.
>>
>> I was trying to stress test your recent raid1 fixes, which went well,
>> but then after 5 days,
>> the raid5 array hung up with this in dmesg:
> I think this is different to the workqueue problem you mentioned, though as I
> don't know exactly what caused either I cannot be certain.
>
>   From the data you provided it looks like everything is waiting on
> get_active_stripe(), or on a process that is waiting on that.
> That seems pretty common whenever anything goes wrong in raid5 :-(
>
> The md3_raid5 task is listed as blocked, but not stack trace is given.
> If the machine is still in the state, then
>
>   cat /proc/1698/stack
>
> might be useful.
> (echo t > /proc/sysrq-trigger is always a good idea)

Might this help? I believe the array was doing a "check" when things 
hung up.

md3_raid5       D ea49d770     0  1698      2 0x00000000
  e833dda8 00000046 c106d92d ea49d770 e9d38554 1cc20b58 1e79a404 0001721a
  c17d6700 c17d6700 e956d610 c2217470 c13af054 e9e8f000 00000000 00000000
  e833dd78 00000000 00000000 00000271 00000000 00000005 00000000 0000a193
Call Trace:
  [<c106d92d>] ? __enqueue_entity+0x6d/0x80
  [<c13af054>] ? scsi_init_io+0x24/0xb0
  [<c1072683>] ? enqueue_task_fair+0x2d3/0x660
  [<c153e7f3>] schedule+0x23/0x60
  [<c153db85>] schedule_timeout+0x145/0x1c0
  [<c1065698>] ? update_rq_clock.part.92+0x18/0x50
  [<c1067a65>] ? check_preempt_curr+0x65/0x90
  [<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
  [<c153ef5b>] wait_for_common+0x9b/0x110
  [<c1069ca0>] ? wake_up_process+0x40/0x40
  [<c153f077>] wait_for_completion_killable+0x17/0x30
  [<c105ad0a>] kthread_create_on_node+0x9a/0x110
  [<c1453ecc>] md_register_thread+0x8c/0xc0
  [<c1453f00>] ? md_register_thread+0xc0/0xc0
  [<c145ad14>] md_check_recovery+0x304/0x490
  [<c12b1192>] ? blk_finish_plug+0x12/0x40
  [<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
  [<c104a022>] ? try_to_del_timer_sync+0x42/0x60
  [<c153db3d>] ? schedule_timeout+0xfd/0x1c0
  [<c1453fe8>] md_thread+0xe8/0x100
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<c1453f00>] ? md_register_thread+0xc0/0xc0
  [<c105ae21>] kthread+0xa1/0xc0
  [<c1541837>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110

I've already rebooted the system, but I did get a snapshot of all the 
blocked processes.
It's kind of long but I can post it if it's useful.

Thanks,
Bill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29  3:56   ` BillStuff
@ 2014-09-29  4:08     ` NeilBrown
  2014-09-29  4:28       ` BillStuff
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2014-09-29  4:08 UTC (permalink / raw)
  To: BillStuff; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2132 bytes --]

On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
wrote:

> On 09/28/2014 09:25 PM, NeilBrown wrote:
> > On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
> > wrote:
> >
> >> Hi Neil,
> >>
> >> I found something that looks similar to the problem described in
> >> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
> >>
> >> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
> >>
> >> on this array:
> >> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
> >>         104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
> >>         bitmap: 1/5 pages [4KB], 2048KB chunk
> >>
> >> I was running a test doing parallel kernel builds, read/write loops, and
> >> disk add / remove / check loops,
> >> on both this array and a raid1 array.
> >>
> >> I was trying to stress test your recent raid1 fixes, which went well,
> >> but then after 5 days,
> >> the raid5 array hung up with this in dmesg:
> > I think this is different to the workqueue problem you mentioned, though as I
> > don't know exactly what caused either I cannot be certain.
> >
> >   From the data you provided it looks like everything is waiting on
> > get_active_stripe(), or on a process that is waiting on that.
> > That seems pretty common whenever anything goes wrong in raid5 :-(
> >
> > The md3_raid5 task is listed as blocked, but not stack trace is given.
> > If the machine is still in the state, then
> >
> >   cat /proc/1698/stack
> >
> > might be useful.
> > (echo t > /proc/sysrq-trigger is always a good idea)
> 
> Might this help? I believe the array was doing a "check" when things 
> hung up.

It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.

That would suggest that we need an async way to start threads...

Thanks,
NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29  4:08     ` NeilBrown
@ 2014-09-29  4:28       ` BillStuff
  2014-09-29  4:43         ` NeilBrown
  2014-09-29 21:59         ` NeilBrown
  0 siblings, 2 replies; 14+ messages in thread
From: BillStuff @ 2014-09-29  4:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 09/28/2014 11:08 PM, NeilBrown wrote:
> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
> wrote:
>
>> On 09/28/2014 09:25 PM, NeilBrown wrote:
>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
>>> wrote:
>>>
>>>> Hi Neil,
>>>>
>>>> I found something that looks similar to the problem described in
>>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
>>>>
>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
>>>>
>>>> on this array:
>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
>>>>          104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
>>>>          bitmap: 1/5 pages [4KB], 2048KB chunk
>>>>
>>>> I was running a test doing parallel kernel builds, read/write loops, and
>>>> disk add / remove / check loops,
>>>> on both this array and a raid1 array.
>>>>
>>>> I was trying to stress test your recent raid1 fixes, which went well,
>>>> but then after 5 days,
>>>> the raid5 array hung up with this in dmesg:
>>> I think this is different to the workqueue problem you mentioned, though as I
>>> don't know exactly what caused either I cannot be certain.
>>>
>>>    From the data you provided it looks like everything is waiting on
>>> get_active_stripe(), or on a process that is waiting on that.
>>> That seems pretty common whenever anything goes wrong in raid5 :-(
>>>
>>> The md3_raid5 task is listed as blocked, but not stack trace is given.
>>> If the machine is still in the state, then
>>>
>>>    cat /proc/1698/stack
>>>
>>> might be useful.
>>> (echo t > /proc/sysrq-trigger is always a good idea)
>> Might this help? I believe the array was doing a "check" when things
>> hung up.
> It looks like it was trying to start doing a 'check'.
> The 'resync' thread hadn't been started yet.
> What is 'kthreadd' doing?
> My guess is that it is in try_to_free_pages() waiting for writeout
> for some xfs file page onto the md array ... which won't progress until
> the thread gets started.
>
> That would suggest that we need an async way to start threads...
>
> Thanks,
> NeilBrown
>

I suspect your guess is correct:

kthreadd        D c106ea4c     0     2      0 0x00000000
  e9d6db58 00000046 e9d6db4c c106ea4c ce493c00 00000001 1e9bb7bd 0001721a
  c17d6700 c17d6700 d3b6a880 e9d38510 f2cf4c00 00000000 f2e51c00 e9d6db60
  f3cec0b6 e9d38510 f2e51d14 f2e51d00 f2e51c00 00043132 00000964 0000a4b0
Call Trace:
  [<c106ea4c>] ? update_blocked_averages+0x1ec/0x700
  [<f3cec0b6>] ? xlog_cil_force_lsn+0xd6/0x1c0 [xfs]
  [<f3cc077b>] ? xfs_bmbt_get_all+0x2b/0x40 [xfs]
  [<c153e7f3>] schedule+0x23/0x60
  [<f3ceaa71>] _xfs_log_force_lsn+0x141/0x270 [xfs]
  [<c1069ca0>] ? wake_up_process+0x40/0x40
  [<f3ceabd8>] xfs_log_force_lsn+0x38/0x90 [xfs]
  [<f3cd7ee0>] __xfs_iunpin_wait+0x80/0x100 [xfs]
  [<f3cdb02d>] ? xfs_iunpin_wait+0x1d/0x30 [xfs]
  [<c10799d0>] ? autoremove_wake_function+0x40/0x40
  [<f3cdb02d>] xfs_iunpin_wait+0x1d/0x30 [xfs]
  [<f3c99938>] xfs_reclaim_inode+0x58/0x2f0 [xfs]
  [<f3c99e04>] xfs_reclaim_inodes_ag+0x234/0x330 [xfs]
  [<f3c9a6a1>] ? xfs_inode_set_reclaim_tag+0x91/0x150 [xfs]
  [<c115cc41>] ? fsnotify_clear_marks_by_inode+0x21/0xe0
  [<f3ca5ac5>] ? xfs_fs_destroy_inode+0xa5/0xd0 [xfs]
  [<c113b061>] ? destroy_inode+0x31/0x50
  [<c113b160>] ? evict+0xe0/0x160
  [<f3c9a7ad>] xfs_reclaim_inodes_nr+0x2d/0x40 [xfs]
  [<f3ca5103>] xfs_fs_free_cached_objects+0x13/0x20 [xfs]
  [<c11278ce>] super_cache_scan+0x12e/0x140
  [<c10f2bb5>] shrink_slab_node+0x125/0x280
  [<c1101a1c>] ? compact_zone+0x2c/0x450
  [<c10f3489>] shrink_slab+0xd9/0xf0
  [<c10f557d>] try_to_free_pages+0x25d/0x4f0
  [<c10eb96d>] __alloc_pages_nodemask+0x52d/0x820
  [<c103c452>] copy_process.part.47+0xd2/0x14e0
  [<c153e254>] ? __schedule+0x224/0x7a0
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
  [<c103da01>] do_fork+0xc1/0x320
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
  [<c103dc8d>] kernel_thread+0x2d/0x40
  [<c105b542>] kthreadd+0x122/0x170
  [<c1541837>] ret_from_kernel_thread+0x1b/0x28
  [<c105b420>] ? kthread_create_on_cpu+0x60/0x60



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29  4:28       ` BillStuff
@ 2014-09-29  4:43         ` NeilBrown
  2014-09-29 21:59         ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2014-09-29  4:43 UTC (permalink / raw)
  To: BillStuff; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4929 bytes --]

On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@sbcglobal.net>
wrote:

> On 09/28/2014 11:08 PM, NeilBrown wrote:
> > On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
> > wrote:
> >
> >> On 09/28/2014 09:25 PM, NeilBrown wrote:
> >>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
> >>> wrote:
> >>>
> >>>> Hi Neil,
> >>>>
> >>>> I found something that looks similar to the problem described in
> >>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
> >>>>
> >>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
> >>>>
> >>>> on this array:
> >>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
> >>>>          104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
> >>>>          bitmap: 1/5 pages [4KB], 2048KB chunk
> >>>>
> >>>> I was running a test doing parallel kernel builds, read/write loops, and
> >>>> disk add / remove / check loops,
> >>>> on both this array and a raid1 array.
> >>>>
> >>>> I was trying to stress test your recent raid1 fixes, which went well,
> >>>> but then after 5 days,
> >>>> the raid5 array hung up with this in dmesg:
> >>> I think this is different to the workqueue problem you mentioned, though as I
> >>> don't know exactly what caused either I cannot be certain.
> >>>
> >>>    From the data you provided it looks like everything is waiting on
> >>> get_active_stripe(), or on a process that is waiting on that.
> >>> That seems pretty common whenever anything goes wrong in raid5 :-(
> >>>
> >>> The md3_raid5 task is listed as blocked, but not stack trace is given.
> >>> If the machine is still in the state, then
> >>>
> >>>    cat /proc/1698/stack
> >>>
> >>> might be useful.
> >>> (echo t > /proc/sysrq-trigger is always a good idea)
> >> Might this help? I believe the array was doing a "check" when things
> >> hung up.
> > It looks like it was trying to start doing a 'check'.
> > The 'resync' thread hadn't been started yet.
> > What is 'kthreadd' doing?
> > My guess is that it is in try_to_free_pages() waiting for writeout
> > for some xfs file page onto the md array ... which won't progress until
> > the thread gets started.
> >
> > That would suggest that we need an async way to start threads...
> >
> > Thanks,
> > NeilBrown
> >
> 
> I suspect your guess is correct:

Yes, looks like it is - thanks.

I'll probably get a workqueue to start the thread, so the md thread doesn't
block on it.

thanks,
NeilBrown


> 
> kthreadd        D c106ea4c     0     2      0 0x00000000
>   e9d6db58 00000046 e9d6db4c c106ea4c ce493c00 00000001 1e9bb7bd 0001721a
>   c17d6700 c17d6700 d3b6a880 e9d38510 f2cf4c00 00000000 f2e51c00 e9d6db60
>   f3cec0b6 e9d38510 f2e51d14 f2e51d00 f2e51c00 00043132 00000964 0000a4b0
> Call Trace:
>   [<c106ea4c>] ? update_blocked_averages+0x1ec/0x700
>   [<f3cec0b6>] ? xlog_cil_force_lsn+0xd6/0x1c0 [xfs]
>   [<f3cc077b>] ? xfs_bmbt_get_all+0x2b/0x40 [xfs]
>   [<c153e7f3>] schedule+0x23/0x60
>   [<f3ceaa71>] _xfs_log_force_lsn+0x141/0x270 [xfs]
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<f3ceabd8>] xfs_log_force_lsn+0x38/0x90 [xfs]
>   [<f3cd7ee0>] __xfs_iunpin_wait+0x80/0x100 [xfs]
>   [<f3cdb02d>] ? xfs_iunpin_wait+0x1d/0x30 [xfs]
>   [<c10799d0>] ? autoremove_wake_function+0x40/0x40
>   [<f3cdb02d>] xfs_iunpin_wait+0x1d/0x30 [xfs]
>   [<f3c99938>] xfs_reclaim_inode+0x58/0x2f0 [xfs]
>   [<f3c99e04>] xfs_reclaim_inodes_ag+0x234/0x330 [xfs]
>   [<f3c9a6a1>] ? xfs_inode_set_reclaim_tag+0x91/0x150 [xfs]
>   [<c115cc41>] ? fsnotify_clear_marks_by_inode+0x21/0xe0
>   [<f3ca5ac5>] ? xfs_fs_destroy_inode+0xa5/0xd0 [xfs]
>   [<c113b061>] ? destroy_inode+0x31/0x50
>   [<c113b160>] ? evict+0xe0/0x160
>   [<f3c9a7ad>] xfs_reclaim_inodes_nr+0x2d/0x40 [xfs]
>   [<f3ca5103>] xfs_fs_free_cached_objects+0x13/0x20 [xfs]
>   [<c11278ce>] super_cache_scan+0x12e/0x140
>   [<c10f2bb5>] shrink_slab_node+0x125/0x280
>   [<c1101a1c>] ? compact_zone+0x2c/0x450
>   [<c10f3489>] shrink_slab+0xd9/0xf0
>   [<c10f557d>] try_to_free_pages+0x25d/0x4f0
>   [<c10eb96d>] __alloc_pages_nodemask+0x52d/0x820
>   [<c103c452>] copy_process.part.47+0xd2/0x14e0
>   [<c153e254>] ? __schedule+0x224/0x7a0
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
>   [<c103da01>] do_fork+0xc1/0x320
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
>   [<c103dc8d>] kernel_thread+0x2d/0x40
>   [<c105b542>] kthreadd+0x122/0x170
>   [<c1541837>] ret_from_kernel_thread+0x1b/0x28
>   [<c105b420>] ? kthread_create_on_cpu+0x60/0x60
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29  4:28       ` BillStuff
  2014-09-29  4:43         ` NeilBrown
@ 2014-09-29 21:59         ` NeilBrown
  2014-09-30  4:19           ` BillStuff
  2014-09-30 21:21           ` BillStuff
  1 sibling, 2 replies; 14+ messages in thread
From: NeilBrown @ 2014-09-29 21:59 UTC (permalink / raw)
  To: BillStuff; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5497 bytes --]

On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@sbcglobal.net>
wrote:

> On 09/28/2014 11:08 PM, NeilBrown wrote:
> > On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
> > wrote:
> >
> >> On 09/28/2014 09:25 PM, NeilBrown wrote:
> >>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
> >>> wrote:
> >>>
> >>>> Hi Neil,
> >>>>
> >>>> I found something that looks similar to the problem described in
> >>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
> >>>>
> >>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
> >>>>
> >>>> on this array:
> >>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
> >>>>          104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
> >>>>          bitmap: 1/5 pages [4KB], 2048KB chunk
> >>>>
> >>>> I was running a test doing parallel kernel builds, read/write loops, and
> >>>> disk add / remove / check loops,
> >>>> on both this array and a raid1 array.
> >>>>
> >>>> I was trying to stress test your recent raid1 fixes, which went well,
> >>>> but then after 5 days,
> >>>> the raid5 array hung up with this in dmesg:
> >>> I think this is different to the workqueue problem you mentioned, though as I
> >>> don't know exactly what caused either I cannot be certain.
> >>>
> >>>    From the data you provided it looks like everything is waiting on
> >>> get_active_stripe(), or on a process that is waiting on that.
> >>> That seems pretty common whenever anything goes wrong in raid5 :-(
> >>>
> >>> The md3_raid5 task is listed as blocked, but not stack trace is given.
> >>> If the machine is still in the state, then
> >>>
> >>>    cat /proc/1698/stack
> >>>
> >>> might be useful.
> >>> (echo t > /proc/sysrq-trigger is always a good idea)
> >> Might this help? I believe the array was doing a "check" when things
> >> hung up.
> > It looks like it was trying to start doing a 'check'.
> > The 'resync' thread hadn't been started yet.
> > What is 'kthreadd' doing?
> > My guess is that it is in try_to_free_pages() waiting for writeout
> > for some xfs file page onto the md array ... which won't progress until
> > the thread gets started.
> >
> > That would suggest that we need an async way to start threads...
> >
> > Thanks,
> > NeilBrown
> >
> 
> I suspect your guess is correct:

Thanks for the confirmation.

I'm thinking of something like that.  Very basic suggestion suggests it
instantly crash.

If you were to apply this patch and run your test for a week or two,  that
would increase my confidence (though of course testing doesn't prove the
absence of bugs....)

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index a79e51d15c2b..580d4b97696c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7770,6 +7770,33 @@ no_add:
 	return spares;
 }
 
+static void md_start_sync(struct work_struct *ws)
+{
+	struct mddev *mddev = container_of(ws, struct mddev, del_work);
+
+	mddev->sync_thread = md_register_thread(md_do_sync,
+						mddev,
+						"resync");
+	if (!mddev->sync_thread) {
+		printk(KERN_ERR "%s: could not start resync"
+		       " thread...\n",
+		       mdname(mddev));
+		/* leave the spares where they are, it shouldn't hurt */
+		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+		clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+		clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
+		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+		clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+		if (test_and_clear_bit(MD_RECOVERY_RECOVER,
+				       &mddev->recovery))
+			if (mddev->sysfs_action)
+				sysfs_notify_dirent_safe(mddev->sysfs_action);
+	} else
+		md_wakeup_thread(mddev->sync_thread);
+	sysfs_notify_dirent_safe(mddev->sysfs_action);
+	md_new_event(mddev);
+}
+
 /*
  * This routine is regularly called by all per-raid-array threads to
  * deal with generic issues like resync and super-block update.
@@ -7823,6 +7850,7 @@ void md_check_recovery(struct mddev *mddev)
 
 	if (mddev_trylock(mddev)) {
 		int spares = 0;
+		bool sync_starting = false;
 
 		if (mddev->ro) {
 			/* On a read-only array we can:
@@ -7921,28 +7949,14 @@ void md_check_recovery(struct mddev *mddev)
 				 */
 				bitmap_write_all(mddev->bitmap);
 			}
-			mddev->sync_thread = md_register_thread(md_do_sync,
-								mddev,
-								"resync");
-			if (!mddev->sync_thread) {
-				printk(KERN_ERR "%s: could not start resync"
-					" thread...\n", 
-					mdname(mddev));
-				/* leave the spares where they are, it shouldn't hurt */
-				clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-				clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
-				clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
-				clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
-				clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
-			} else
-				md_wakeup_thread(mddev->sync_thread);
-			sysfs_notify_dirent_safe(mddev->sysfs_action);
-			md_new_event(mddev);
+			INIT_WORK(&mddev->del_work, md_start_sync);
+			queue_work(md_misc_wq, &mddev->del_work);
+			sync_starting = true;
 		}
 	unlock:
 		wake_up(&mddev->sb_wait);
 
-		if (!mddev->sync_thread) {
+		if (!mddev->sync_thread && !sync_starting) {
 			clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
 			if (test_and_clear_bit(MD_RECOVERY_RECOVER,
 					       &mddev->recovery))


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29 21:59         ` NeilBrown
@ 2014-09-30  4:19           ` BillStuff
  2014-09-30 21:21           ` BillStuff
  1 sibling, 0 replies; 14+ messages in thread
From: BillStuff @ 2014-09-30  4:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 09/29/2014 04:59 PM, NeilBrown wrote:
> On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@sbcglobal.net>
> wrote:
>
>> On 09/28/2014 11:08 PM, NeilBrown wrote:
>>> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
>>> wrote:
>>>
>>>> On 09/28/2014 09:25 PM, NeilBrown wrote:
>>>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
>>>>> wrote:
>>>>>
>>>>>> Hi Neil,
>>>>>>
>>>>>> I found something that looks similar to the problem described in
>>>>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
>>>>>>
>>>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
>>>>>>
>>>>>> on this array:
>>>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
>>>>>>           104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
>>>>>>           bitmap: 1/5 pages [4KB], 2048KB chunk
>>>>>>
>>>>>> I was running a test doing parallel kernel builds, read/write loops, and
>>>>>> disk add / remove / check loops,
>>>>>> on both this array and a raid1 array.
>>>>>>
>>>>>> I was trying to stress test your recent raid1 fixes, which went well,
>>>>>> but then after 5 days,
>>>>>> the raid5 array hung up with this in dmesg:
>>>>> I think this is different to the workqueue problem you mentioned, though as I
>>>>> don't know exactly what caused either I cannot be certain.
>>>>>
>>>>>     From the data you provided it looks like everything is waiting on
>>>>> get_active_stripe(), or on a process that is waiting on that.
>>>>> That seems pretty common whenever anything goes wrong in raid5 :-(
>>>>>
>>>>> The md3_raid5 task is listed as blocked, but not stack trace is given.
>>>>> If the machine is still in the state, then
>>>>>
>>>>>     cat /proc/1698/stack
>>>>>
>>>>> might be useful.
>>>>> (echo t > /proc/sysrq-trigger is always a good idea)
>>>> Might this help? I believe the array was doing a "check" when things
>>>> hung up.
>>> It looks like it was trying to start doing a 'check'.
>>> The 'resync' thread hadn't been started yet.
>>> What is 'kthreadd' doing?
>>> My guess is that it is in try_to_free_pages() waiting for writeout
>>> for some xfs file page onto the md array ... which won't progress until
>>> the thread gets started.
>>>
>>> That would suggest that we need an async way to start threads...
>>>
>>> Thanks,
>>> NeilBrown
>>>
>> I suspect your guess is correct:
> Thanks for the confirmation.
>
> I'm thinking of something like that.  Very basic suggestion suggests it
> instantly crash.
>
> If you were to apply this patch and run your test for a week or two,  that
> would increase my confidence (though of course testing doesn't prove the
> absence of bugs....)
>
> Thanks,
> NeilBrown

Got it running. I'll let you know if anything interesting happens.

Thanks,
Bill
>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index a79e51d15c2b..580d4b97696c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7770,6 +7770,33 @@ no_add:
>   	return spares;
>   }
>   
> +static void md_start_sync(struct work_struct *ws)
> +{
> +	struct mddev *mddev = container_of(ws, struct mddev, del_work);
> +
> +	mddev->sync_thread = md_register_thread(md_do_sync,
> +						mddev,
> +						"resync");
> +	if (!mddev->sync_thread) {
> +		printk(KERN_ERR "%s: could not start resync"
> +		       " thread...\n",
> +		       mdname(mddev));
> +		/* leave the spares where they are, it shouldn't hurt */
> +		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> +		clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
> +		clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
> +		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
> +		clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
> +		if (test_and_clear_bit(MD_RECOVERY_RECOVER,
> +				       &mddev->recovery))
> +			if (mddev->sysfs_action)
> +				sysfs_notify_dirent_safe(mddev->sysfs_action);
> +	} else
> +		md_wakeup_thread(mddev->sync_thread);
> +	sysfs_notify_dirent_safe(mddev->sysfs_action);
> +	md_new_event(mddev);
> +}
> +
>   /*
>    * This routine is regularly called by all per-raid-array threads to
>    * deal with generic issues like resync and super-block update.
> @@ -7823,6 +7850,7 @@ void md_check_recovery(struct mddev *mddev)
>   
>   	if (mddev_trylock(mddev)) {
>   		int spares = 0;
> +		bool sync_starting = false;
>   
>   		if (mddev->ro) {
>   			/* On a read-only array we can:
> @@ -7921,28 +7949,14 @@ void md_check_recovery(struct mddev *mddev)
>   				 */
>   				bitmap_write_all(mddev->bitmap);
>   			}
> -			mddev->sync_thread = md_register_thread(md_do_sync,
> -								mddev,
> -								"resync");
> -			if (!mddev->sync_thread) {
> -				printk(KERN_ERR "%s: could not start resync"
> -					" thread...\n",
> -					mdname(mddev));
> -				/* leave the spares where they are, it shouldn't hurt */
> -				clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
> -				clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> -				clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
> -				clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
> -				clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
> -			} else
> -				md_wakeup_thread(mddev->sync_thread);
> -			sysfs_notify_dirent_safe(mddev->sysfs_action);
> -			md_new_event(mddev);
> +			INIT_WORK(&mddev->del_work, md_start_sync);
> +			queue_work(md_misc_wq, &mddev->del_work);
> +			sync_starting = true;
>   		}
>   	unlock:
>   		wake_up(&mddev->sb_wait);
>   
> -		if (!mddev->sync_thread) {
> +		if (!mddev->sync_thread && !sync_starting) {
>   			clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
>   			if (test_and_clear_bit(MD_RECOVERY_RECOVER,
>   					       &mddev->recovery))
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-29 21:59         ` NeilBrown
  2014-09-30  4:19           ` BillStuff
@ 2014-09-30 21:21           ` BillStuff
  2014-09-30 22:54             ` NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: BillStuff @ 2014-09-30 21:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 09/29/2014 04:59 PM, NeilBrown wrote:
> On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@sbcglobal.net>
> wrote:
>
>> On 09/28/2014 11:08 PM, NeilBrown wrote:
>>> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
>>> wrote:
>>>
>>>> On 09/28/2014 09:25 PM, NeilBrown wrote:
>>>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
>>>>> wrote:
>>>>>
>>>>>> Hi Neil,
>>>>>>
>>>>>> I found something that looks similar to the problem described in
>>>>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
>>>>>>
>>>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
>>>>>>
>>>>>> on this array:
>>>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
>>>>>>           104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
>>>>>>           bitmap: 1/5 pages [4KB], 2048KB chunk
>>>>>>
>>>>>> I was running a test doing parallel kernel builds, read/write loops, and
>>>>>> disk add / remove / check loops,
>>>>>> on both this array and a raid1 array.
>>>>>>
>>>>>> I was trying to stress test your recent raid1 fixes, which went well,
>>>>>> but then after 5 days,
>>>>>> the raid5 array hung up with this in dmesg:
>>>>> I think this is different to the workqueue problem you mentioned, though as I
>>>>> don't know exactly what caused either I cannot be certain.
>>>>>
>>>>>     From the data you provided it looks like everything is waiting on
>>>>> get_active_stripe(), or on a process that is waiting on that.
>>>>> That seems pretty common whenever anything goes wrong in raid5 :-(
>>>>>
>>>>> The md3_raid5 task is listed as blocked, but not stack trace is given.
>>>>> If the machine is still in the state, then
>>>>>
>>>>>     cat /proc/1698/stack
>>>>>
>>>>> might be useful.
>>>>> (echo t > /proc/sysrq-trigger is always a good idea)
>>>> Might this help? I believe the array was doing a "check" when things
>>>> hung up.
>>> It looks like it was trying to start doing a 'check'.
>>> The 'resync' thread hadn't been started yet.
>>> What is 'kthreadd' doing?
>>> My guess is that it is in try_to_free_pages() waiting for writeout
>>> for some xfs file page onto the md array ... which won't progress until
>>> the thread gets started.
>>>
>>> That would suggest that we need an async way to start threads...
>>>
>>> Thanks,
>>> NeilBrown
>>>
>> I suspect your guess is correct:
> Thanks for the confirmation.
>
> I'm thinking of something like that.  Very basic suggestion suggests it
> instantly crash.
>
> If you were to apply this patch and run your test for a week or two,  that
> would increase my confidence (though of course testing doesn't prove the
> absence of bugs....)
>
> Thanks,
> NeilBrown
>

Neil,

It locked up already, but in a different way.
In this case it was triggered by a re-add in a fail / remove / add loop.

It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync 
before it got fully started.


INFO: task md3_raid5:1698 blocked for more than 120 seconds.
       Tainted: P           O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac
  c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088
  00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b
Call Trace:
  [<c106d92d>] ? __enqueue_entity+0x6d/0x80
  [<c1072683>] ? enqueue_task_fair+0x2d3/0x660
  [<c153e893>] schedule+0x23/0x60
  [<c153dc25>] schedule_timeout+0x145/0x1c0
  [<c1069cb0>] ? default_wake_function+0x10/0x20
  [<c1065698>] ? update_rq_clock.part.92+0x18/0x50
  [<c1067a65>] ? check_preempt_curr+0x65/0x90
  [<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
  [<c153effb>] wait_for_common+0x9b/0x110
  [<c1069ca0>] ? wake_up_process+0x40/0x40
  [<c153f087>] wait_for_completion+0x17/0x20
  [<c105b241>] kthread_stop+0x41/0xb0
  [<c14540e1>] md_unregister_thread+0x31/0x40
  [<c145a799>] md_reap_sync_thread+0x19/0x140
  [<c145aba8>] md_check_recovery+0xe8/0x480
  [<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
  [<c104a022>] ? try_to_del_timer_sync+0x42/0x60
  [<c104a081>] ? del_timer_sync+0x41/0x50
  [<c153dbdd>] ? schedule_timeout+0xfd/0x1c0
  [<c10497c0>] ? detach_if_pending+0xa0/0xa0
  [<c10797b4>] ? finish_wait+0x44/0x60
  [<c1454098>] md_thread+0xe8/0x100
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<c1453fb0>] ? md_start_sync+0xb0/0xb0
  [<c105ae21>] kthread+0xa1/0xc0
  [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/0:0:12491 blocked for more than 120 seconds.
       Tainted: P           O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea
  c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c
  c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c
Call Trace:
  [<c10795ff>] ? __wake_up+0x3f/0x50
  [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<c153e893>] schedule+0x23/0x60
  [<c153dc25>] schedule_timeout+0x145/0x1c0
  [<c12b1192>] ? blk_finish_plug+0x12/0x40
  [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
  [<c153effb>] wait_for_common+0x9b/0x110
  [<c1069ca0>] ? wake_up_process+0x40/0x40
  [<c153f087>] wait_for_completion+0x17/0x20
  [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
  [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
  [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
  [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
  [<f3cefa99>] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs]
  [<f3caa977>] xfs_alloc_read_agfl+0x97/0xc0 [xfs]
  [<f3cad8a3>] xfs_alloc_fix_freelist+0x193/0x430 [xfs]
  [<f3ce83fe>] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs]
  [<f3cae17e>] xfs_free_extent+0x8e/0x100 [xfs]
  [<c111be1e>] ? kmem_cache_alloc+0xae/0xf0
  [<f3c8e12c>] xfs_bmap_finish+0x13c/0x190 [xfs]
  [<f3cda6f2>] xfs_itruncate_extents+0x1b2/0x2b0 [xfs]
  [<f3c8f0c3>] xfs_free_eofblocks+0x243/0x2e0 [xfs]
  [<f3c9abc6>] xfs_inode_free_eofblocks+0x96/0x140 [xfs]
  [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
  [<f3c9956d>] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs]
  [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
  [<c12d1631>] ? radix_tree_gang_lookup_tag+0x71/0xb0
  [<f3ce55fb>] ? xfs_perag_get_tag+0x2b/0xb0 [xfs]
  [<f3c9a56d>] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs]
  [<f3c9a59a>] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs]
  [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
  [<f3c9a82e>] xfs_icache_free_eofblocks+0x2e/0x40 [xfs]
  [<f3c9a858>] xfs_eofblocks_worker+0x18/0x30 [xfs]
  [<c105521c>] process_one_work+0x10c/0x340
  [<c1055d31>] worker_thread+0x101/0x330
  [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
  [<c105ae21>] kthread+0xa1/0xc0
  [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task xfs_fsr:24262 blocked for more than 120 seconds.
       Tainted: P           O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000
  c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0
  c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0
Call Trace:
  [<c10795ff>] ? __wake_up+0x3f/0x50
  [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
  [<c153e893>] schedule+0x23/0x60
  [<c153dc25>] schedule_timeout+0x145/0x1c0
  [<c12b1192>] ? blk_finish_plug+0x12/0x40
  [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
  [<c153effb>] wait_for_common+0x9b/0x110
  [<c1069ca0>] ? wake_up_process+0x40/0x40
  [<c153f087>] wait_for_completion+0x17/0x20
  [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
  [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
  [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
  [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
  [<f3cefa29>] xfs_trans_read_buf_map+0x179/0x3e0 [xfs]
  [<f3cde727>] xfs_imap_to_bp+0x67/0xe0 [xfs]
  [<f3cdec7d>] xfs_iread+0x7d/0x3e0 [xfs]
  [<f3c996e8>] ? xfs_inode_alloc+0x58/0x1b0 [xfs]
  [<f3c9a092>] xfs_iget+0x192/0x580 [xfs]
  [<f3caa219>] ? kmem_free+0x19/0x50 [xfs]
  [<f3ca0ce6>] xfs_bulkstat_one_int+0x86/0x2c0 [xfs]
  [<f3ca0f54>] xfs_bulkstat_one+0x34/0x40 [xfs]
  [<f3ca0c10>] ? xfs_internal_inum+0xa0/0xa0 [xfs]
  [<f3ca1360>] xfs_bulkstat+0x400/0x870 [xfs]
  [<f3c9ae86>] xfs_ioc_bulkstat+0xb6/0x160 [xfs]
  [<f3ca0f20>] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs]
  [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
  [<f3c9d433>] xfs_file_ioctl+0x783/0xa60 [xfs]
  [<c106bf9b>] ? __update_cpu_load+0xab/0xd0
  [<c105dc58>] ? hrtimer_forward+0xa8/0x1b0
  [<c12d4050>] ? timerqueue_add+0x50/0xb0
  [<c108d143>] ? ktime_get+0x53/0xe0
  [<c1094035>] ? clockevents_program_event+0x95/0x130
  [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
  [<c11349b2>] do_vfs_ioctl+0x2e2/0x4c0
  [<c1095699>] ? tick_program_event+0x29/0x30
  [<c105e42c>] ? hrtimer_interrupt+0x13c/0x2a0
  [<c12d8dfb>] ? lockref_put_or_lock+0xb/0x30
  [<c1134c08>] SyS_ioctl+0x78/0x80
  [<c1540fa8>] syscall_call+0x7/0x7
INFO: task md3_resync:24817 blocked for more than 120 seconds.
       Tainted: P           O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0
  c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001
  00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292
Call Trace:
  [<c12adbc5>] ? queue_unplugged+0x45/0x90
  [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
  [<c153e893>] schedule+0x23/0x60
  [<c145782d>] md_do_sync+0x8dd/0x1010
  [<c1079990>] ? __wake_up_sync+0x20/0x20
  [<c1453fb0>] ? md_start_sync+0xb0/0xb0
  [<c1454098>] md_thread+0xe8/0x100
  [<c107945f>] ? __wake_up_locked+0x1f/0x30
  [<c1453fb0>] ? md_start_sync+0xb0/0xb0
  [<c105ae21>] kthread+0xa1/0xc0
  [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
  [<c105ad80>] ? kthread_create_on_node+0x110/0x110

Again, the blocked task info is edited for space; hopefully these are 
the important ones,
but I've got more if it helps.

Thanks,
Bill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-30 21:21           ` BillStuff
@ 2014-09-30 22:54             ` NeilBrown
  2014-10-05 16:05               ` BillStuff
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2014-09-30 22:54 UTC (permalink / raw)
  To: BillStuff; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 13903 bytes --]

On Tue, 30 Sep 2014 16:21:32 -0500 BillStuff <billstuff2001@sbcglobal.net>
wrote:

> On 09/29/2014 04:59 PM, NeilBrown wrote:
> > On Sun, 28 Sep 2014 23:28:17 -0500 BillStuff <billstuff2001@sbcglobal.net>
> > wrote:
> >
> >> On 09/28/2014 11:08 PM, NeilBrown wrote:
> >>> On Sun, 28 Sep 2014 22:56:19 -0500 BillStuff <billstuff2001@sbcglobal.net>
> >>> wrote:
> >>>
> >>>> On 09/28/2014 09:25 PM, NeilBrown wrote:
> >>>>> On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff <billstuff2001@sbcglobal.net>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Neil,
> >>>>>>
> >>>>>> I found something that looks similar to the problem described in
> >>>>>> "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
> >>>>>>
> >>>>>> It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
> >>>>>>
> >>>>>> on this array:
> >>>>>> md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
> >>>>>>           104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
> >>>>>>           bitmap: 1/5 pages [4KB], 2048KB chunk
> >>>>>>
> >>>>>> I was running a test doing parallel kernel builds, read/write loops, and
> >>>>>> disk add / remove / check loops,
> >>>>>> on both this array and a raid1 array.
> >>>>>>
> >>>>>> I was trying to stress test your recent raid1 fixes, which went well,
> >>>>>> but then after 5 days,
> >>>>>> the raid5 array hung up with this in dmesg:
> >>>>> I think this is different to the workqueue problem you mentioned, though as I
> >>>>> don't know exactly what caused either I cannot be certain.
> >>>>>
> >>>>>     From the data you provided it looks like everything is waiting on
> >>>>> get_active_stripe(), or on a process that is waiting on that.
> >>>>> That seems pretty common whenever anything goes wrong in raid5 :-(
> >>>>>
> >>>>> The md3_raid5 task is listed as blocked, but not stack trace is given.
> >>>>> If the machine is still in the state, then
> >>>>>
> >>>>>     cat /proc/1698/stack
> >>>>>
> >>>>> might be useful.
> >>>>> (echo t > /proc/sysrq-trigger is always a good idea)
> >>>> Might this help? I believe the array was doing a "check" when things
> >>>> hung up.
> >>> It looks like it was trying to start doing a 'check'.
> >>> The 'resync' thread hadn't been started yet.
> >>> What is 'kthreadd' doing?
> >>> My guess is that it is in try_to_free_pages() waiting for writeout
> >>> for some xfs file page onto the md array ... which won't progress until
> >>> the thread gets started.
> >>>
> >>> That would suggest that we need an async way to start threads...
> >>>
> >>> Thanks,
> >>> NeilBrown
> >>>
> >> I suspect your guess is correct:
> > Thanks for the confirmation.
> >
> > I'm thinking of something like that.  Very basic suggestion suggests it
> > instantly crash.
> >
> > If you were to apply this patch and run your test for a week or two,  that
> > would increase my confidence (though of course testing doesn't prove the
> > absence of bugs....)
> >
> > Thanks,
> > NeilBrown
> >
> 
> Neil,
> 
> It locked up already, but in a different way.
> In this case it was triggered by a re-add in a fail / remove / add loop.
> 
> It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync 
> before it got fully started.
> 
> 
> INFO: task md3_raid5:1698 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac
>   c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088
>   00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b
> Call Trace:
>   [<c106d92d>] ? __enqueue_entity+0x6d/0x80
>   [<c1072683>] ? enqueue_task_fair+0x2d3/0x660
>   [<c153e893>] schedule+0x23/0x60
>   [<c153dc25>] schedule_timeout+0x145/0x1c0
>   [<c1069cb0>] ? default_wake_function+0x10/0x20
>   [<c1065698>] ? update_rq_clock.part.92+0x18/0x50
>   [<c1067a65>] ? check_preempt_curr+0x65/0x90
>   [<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
>   [<c153effb>] wait_for_common+0x9b/0x110
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<c153f087>] wait_for_completion+0x17/0x20
>   [<c105b241>] kthread_stop+0x41/0xb0
>   [<c14540e1>] md_unregister_thread+0x31/0x40
>   [<c145a799>] md_reap_sync_thread+0x19/0x140
>   [<c145aba8>] md_check_recovery+0xe8/0x480
>   [<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
>   [<c104a022>] ? try_to_del_timer_sync+0x42/0x60
>   [<c104a081>] ? del_timer_sync+0x41/0x50
>   [<c153dbdd>] ? schedule_timeout+0xfd/0x1c0
>   [<c10497c0>] ? detach_if_pending+0xa0/0xa0
>   [<c10797b4>] ? finish_wait+0x44/0x60
>   [<c1454098>] md_thread+0xe8/0x100
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<c1453fb0>] ? md_start_sync+0xb0/0xb0
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> INFO: task kworker/0:0:12491 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea
>   c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c
>   c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c
> Call Trace:
>   [<c10795ff>] ? __wake_up+0x3f/0x50
>   [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<c153e893>] schedule+0x23/0x60
>   [<c153dc25>] schedule_timeout+0x145/0x1c0
>   [<c12b1192>] ? blk_finish_plug+0x12/0x40
>   [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
>   [<c153effb>] wait_for_common+0x9b/0x110
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<c153f087>] wait_for_completion+0x17/0x20
>   [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
>   [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
>   [<f3cefa99>] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs]
>   [<f3caa977>] xfs_alloc_read_agfl+0x97/0xc0 [xfs]
>   [<f3cad8a3>] xfs_alloc_fix_freelist+0x193/0x430 [xfs]
>   [<f3ce83fe>] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs]
>   [<f3cae17e>] xfs_free_extent+0x8e/0x100 [xfs]
>   [<c111be1e>] ? kmem_cache_alloc+0xae/0xf0
>   [<f3c8e12c>] xfs_bmap_finish+0x13c/0x190 [xfs]
>   [<f3cda6f2>] xfs_itruncate_extents+0x1b2/0x2b0 [xfs]
>   [<f3c8f0c3>] xfs_free_eofblocks+0x243/0x2e0 [xfs]
>   [<f3c9abc6>] xfs_inode_free_eofblocks+0x96/0x140 [xfs]
>   [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
>   [<f3c9956d>] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs]
>   [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
>   [<c12d1631>] ? radix_tree_gang_lookup_tag+0x71/0xb0
>   [<f3ce55fb>] ? xfs_perag_get_tag+0x2b/0xb0 [xfs]
>   [<f3c9a56d>] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs]
>   [<f3c9a59a>] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs]
>   [<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
>   [<f3c9a82e>] xfs_icache_free_eofblocks+0x2e/0x40 [xfs]
>   [<f3c9a858>] xfs_eofblocks_worker+0x18/0x30 [xfs]
>   [<c105521c>] process_one_work+0x10c/0x340
>   [<c1055d31>] worker_thread+0x101/0x330
>   [<c1055c30>] ? manage_workers.isra.27+0x250/0x250
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> INFO: task xfs_fsr:24262 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000
>   c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0
>   c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0
> Call Trace:
>   [<c10795ff>] ? __wake_up+0x3f/0x50
>   [<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
>   [<c153e893>] schedule+0x23/0x60
>   [<c153dc25>] schedule_timeout+0x145/0x1c0
>   [<c12b1192>] ? blk_finish_plug+0x12/0x40
>   [<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
>   [<c153effb>] wait_for_common+0x9b/0x110
>   [<c1069ca0>] ? wake_up_process+0x40/0x40
>   [<c153f087>] wait_for_completion+0x17/0x20
>   [<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
>   [<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
>   [<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
>   [<f3cefa29>] xfs_trans_read_buf_map+0x179/0x3e0 [xfs]
>   [<f3cde727>] xfs_imap_to_bp+0x67/0xe0 [xfs]
>   [<f3cdec7d>] xfs_iread+0x7d/0x3e0 [xfs]
>   [<f3c996e8>] ? xfs_inode_alloc+0x58/0x1b0 [xfs]
>   [<f3c9a092>] xfs_iget+0x192/0x580 [xfs]
>   [<f3caa219>] ? kmem_free+0x19/0x50 [xfs]
>   [<f3ca0ce6>] xfs_bulkstat_one_int+0x86/0x2c0 [xfs]
>   [<f3ca0f54>] xfs_bulkstat_one+0x34/0x40 [xfs]
>   [<f3ca0c10>] ? xfs_internal_inum+0xa0/0xa0 [xfs]
>   [<f3ca1360>] xfs_bulkstat+0x400/0x870 [xfs]
>   [<f3c9ae86>] xfs_ioc_bulkstat+0xb6/0x160 [xfs]
>   [<f3ca0f20>] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs]
>   [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
>   [<f3c9d433>] xfs_file_ioctl+0x783/0xa60 [xfs]
>   [<c106bf9b>] ? __update_cpu_load+0xab/0xd0
>   [<c105dc58>] ? hrtimer_forward+0xa8/0x1b0
>   [<c12d4050>] ? timerqueue_add+0x50/0xb0
>   [<c108d143>] ? ktime_get+0x53/0xe0
>   [<c1094035>] ? clockevents_program_event+0x95/0x130
>   [<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
>   [<c11349b2>] do_vfs_ioctl+0x2e2/0x4c0
>   [<c1095699>] ? tick_program_event+0x29/0x30
>   [<c105e42c>] ? hrtimer_interrupt+0x13c/0x2a0
>   [<c12d8dfb>] ? lockref_put_or_lock+0xb/0x30
>   [<c1134c08>] SyS_ioctl+0x78/0x80
>   [<c1540fa8>] syscall_call+0x7/0x7
> INFO: task md3_resync:24817 blocked for more than 120 seconds.
>        Tainted: P           O 3.14.19fe-dirty #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0
>   c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001
>   00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292
> Call Trace:
>   [<c12adbc5>] ? queue_unplugged+0x45/0x90
>   [<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
>   [<c153e893>] schedule+0x23/0x60
>   [<c145782d>] md_do_sync+0x8dd/0x1010
>   [<c1079990>] ? __wake_up_sync+0x20/0x20
>   [<c1453fb0>] ? md_start_sync+0xb0/0xb0
>   [<c1454098>] md_thread+0xe8/0x100
>   [<c107945f>] ? __wake_up_locked+0x1f/0x30
>   [<c1453fb0>] ? md_start_sync+0xb0/0xb0
>   [<c105ae21>] kthread+0xa1/0xc0
>   [<c15418b7>] ret_from_kernel_thread+0x1b/0x28
>   [<c105ad80>] ? kthread_create_on_node+0x110/0x110
> 
> Again, the blocked task info is edited for space; hopefully these are 
> the important ones,
> but I've got more if it helps.

Thanks for the testing!  You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing.  It doesn't.

If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.

This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.

Thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 211a615ae415..3df888b80e76 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7850,7 +7850,6 @@ void md_check_recovery(struct mddev *mddev)
 
 	if (mddev_trylock(mddev)) {
 		int spares = 0;
-		bool sync_starting = false;
 
 		if (mddev->ro) {
 			/* On a read-only array we can:
@@ -7914,7 +7913,7 @@ void md_check_recovery(struct mddev *mddev)
 
 		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
-			goto unlock;
+			goto not_running;
 		/* no recovery is running.
 		 * remove any failed drives, then
 		 * add spares if possible.
@@ -7926,7 +7925,7 @@ void md_check_recovery(struct mddev *mddev)
 			if (mddev->pers->check_reshape == NULL ||
 			    mddev->pers->check_reshape(mddev) != 0)
 				/* Cannot proceed */
-				goto unlock;
+				goto not_running;
 			set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 		} else if ((spares = remove_and_add_spares(mddev, NULL))) {
@@ -7939,7 +7938,7 @@ void md_check_recovery(struct mddev *mddev)
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 		} else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 			/* nothing to be done ... */
-			goto unlock;
+			goto not_running;
 
 		if (mddev->pers->sync_request) {
 			if (spares) {
@@ -7951,18 +7950,18 @@ void md_check_recovery(struct mddev *mddev)
 			}
 			INIT_WORK(&mddev->del_work, md_start_sync);
 			queue_work(md_misc_wq, &mddev->del_work);
-			sync_starting = true;
+			goto unlock;
 		}
-	unlock:
-		wake_up(&mddev->sb_wait);
-
-		if (!mddev->sync_thread && !sync_starting) {
+	not_running:
+		if (!mddev->sync_thread) {
 			clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
 			if (test_and_clear_bit(MD_RECOVERY_RECOVER,
 					       &mddev->recovery))
 				if (mddev->sysfs_action)
 					sysfs_notify_dirent_safe(mddev->sysfs_action);
 		}
+	unlock:
+		wake_up(&mddev->sb_wait);
 		mddev_unlock(mddev);
 	}
 }

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-09-30 22:54             ` NeilBrown
@ 2014-10-05 16:05               ` BillStuff
  2014-10-14  1:42                 ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: BillStuff @ 2014-10-05 16:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 09/30/2014 05:54 PM, NeilBrown wrote:
> [removed alot of stuff about raid5 check hanging]
> Thanks for the testing!  You have included enough information.
> I didn't really like that 'sync_starting' variable when I wrote the patch,
> but it seemed do the right thing.  It doesn't.
>
> If md_check_recovery() runs again immediately after scheduling the sync
> thread to run, it will not have set sync_starting but will find ->sync_thread
> is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
> flag is still clear and ->sync_thread is not NULL so it will try to stop the
> thread, which deadlocks.
>
> This patch on top of what you have should fix it... but I might end up
> redoing the logic a bit.
>
Neil,

With the second patch, my test has been running well for close to 5 days 
now,
but something odd happened yesterday.

It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect 
it happened silently on my
raid5 test also.

md1 : active raid1 sdh3[0] sdg3[1]
       76959296 blocks [2/2] [UU]
       [=>...................]  check =  9.7% (7532800/76959296) 
finish=108.8min speed=10628K/sec
       bitmap: 0/1 pages [0KB], 65536KB chunk

Again this is running kernel builds, read / write loops, and remove/ 
(re)add / check loops.
At some point while removing a disk from the array, something bad 
happened, the errors below
appeared in the log, and the ext3 filesystem remounted readonly. xfs 
plowed right on through.
There's no evidence of any read or write errors to the member disks, 
either in logs or on the
disks themselves, and the raid1 checks came back with zero mismatches. 
fsck for the ext3 fs
complained a lot but the filesystem came back into a useable state and 
I've restarted my tests.
Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around 
12:27:04. This was
immediately after (or during?) removing sdh3 from md1.

Oct  4 12:26:01 xplane kernel: md: unbind<sdg3>
Oct  4 12:26:01 xplane kernel: md: export_rdev(sdg3)
Oct  4 12:26:11 xplane kernel: md: bind<sdg3>
Oct  4 12:26:11 xplane kernel: RAID1 conf printout:
Oct  4 12:26:11 xplane kernel:  --- wd:1 rd:2
Oct  4 12:26:11 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  4 12:26:11 xplane kernel:  disk 1, wo:1, o:1, dev:sdg3
Oct  4 12:26:12 xplane kernel: md: recovery of RAID array md1
Oct  4 12:26:12 xplane kernel: md: minimum _guaranteed_  speed: 10000 
KB/sec/disk.
Oct  4 12:26:12 xplane kernel: md: using maximum available idle IO 
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  4 12:26:12 xplane kernel: md: using 128k window, over a total of 
76959296k.
Oct  4 12:27:03 xplane kernel: md: md1: recovery done.
Oct  4 12:27:03 xplane kernel: RAID1 conf printout:
Oct  4 12:27:03 xplane kernel:  --- wd:2 rd:2
Oct  4 12:27:03 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  4 12:27:03 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3, 
disabling device.
Oct  4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1 
devices.
Oct  4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331468
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331469
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331470
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331471
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331472
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331473
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331474
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331475
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331476
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
block 6331477
Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct  4 12:27:05 xplane kernel: Aborting journal on device dm-1.
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_blocks: 
Journal has aborted
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: 
ext3_journal_start_sb: Detected aborted journal
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting 
filesystem read-only
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: 
ext3_journal_start_sb: Detected aborted journal
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in 
ext3_writeback_write_end: IO failure
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
Journal has aborted
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
Journal has aborted
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_orphan_add: 
Journal has aborted
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
Journal has aborted
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
Journal has aborted
Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in 
ext3_dirty_inode: IO failure
Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
Oct  4 12:27:05 xplane kernel:  disk 0, wo:1, o:0, dev:sdh3
Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:14 xplane kernel: md: unbind<sdh3>
Oct  4 12:27:14 xplane kernel: md: export_rdev(sdh3)
Oct  4 12:27:24 xplane kernel: md: bind<sdh3>
Oct  4 12:27:24 xplane kernel: RAID1 conf printout:
Oct  4 12:27:24 xplane kernel:  --- wd:1 rd:2
Oct  4 12:27:24 xplane kernel:  disk 0, wo:1, o:1, dev:sdh3
Oct  4 12:27:24 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:24 xplane kernel: md: recovery of RAID array md1
Oct  4 12:27:24 xplane kernel: md: minimum _guaranteed_  speed: 10000 
KB/sec/disk.
Oct  4 12:27:24 xplane kernel: md: using maximum available idle IO 
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  4 12:27:24 xplane kernel: md: using 128k window, over a total of 
76959296k.
Oct  4 12:27:33 xplane kernel: md: md1: recovery done.
Oct  4 12:27:33 xplane kernel: RAID1 conf printout:
Oct  4 12:27:33 xplane kernel:  --- wd:2 rd:2
Oct  4 12:27:33 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  4 12:27:33 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  4 12:27:35 xplane kernel: md: data-check of RAID array md1
Oct  4 12:27:35 xplane kernel: md: minimum _guaranteed_  speed: 10000 
KB/sec/disk.
Oct  4 12:27:35 xplane kernel: md: using maximum available idle IO 
bandwidth (but not more than 200000 KB/sec) for data-check.
Oct  4 12:27:35 xplane kernel: md: using 128k window, over a total of 
76959296k.
Oct  4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing 
b_committed_data

It seems like something got out of sync as the disk was being removed, 
but before the remove completed.

Again, this is 3.14.19 with these 8 patches:
       md/raid1: intialise start_next_window for READ case to avoid hang
       md/raid1:  be more cautious where we read-balance during resync.
       md/raid1: clean up request counts properly in close_sync()
       md/raid1: make sure resync waits for conflicting writes to complete.
       md/raid1: Don't use next_resync to determine how far resync has 
progressed
       md/raid1: update next_resync under resync_lock.
       md/raid1: count resync requests in nr_pending.
       md/raid1: fix_read_error should act on all non-faulty devices.

       and the two patches for the check start hang.

Any ideas on what happened here?

Thanks,
Bill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-10-05 16:05               ` BillStuff
@ 2014-10-14  1:42                 ` NeilBrown
  2014-10-14 14:56                   ` BillStuff
  2014-10-14 16:55                   ` BillStuff
  0 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2014-10-14  1:42 UTC (permalink / raw)
  To: BillStuff; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 10266 bytes --]

On Sun, 05 Oct 2014 11:05:08 -0500 BillStuff <billstuff2001@sbcglobal.net>
wrote:

> On 09/30/2014 05:54 PM, NeilBrown wrote:
> > [removed alot of stuff about raid5 check hanging]
> > Thanks for the testing!  You have included enough information.
> > I didn't really like that 'sync_starting' variable when I wrote the patch,
> > but it seemed do the right thing.  It doesn't.
> >
> > If md_check_recovery() runs again immediately after scheduling the sync
> > thread to run, it will not have set sync_starting but will find ->sync_thread
> > is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
> > flag is still clear and ->sync_thread is not NULL so it will try to stop the
> > thread, which deadlocks.
> >
> > This patch on top of what you have should fix it... but I might end up
> > redoing the logic a bit.
> >
> Neil,
> 
> With the second patch, my test has been running well for close to 5 days 
> now,

That is good news - thanks.

> but something odd happened yesterday.
> 
> It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect 
> it happened silently on my
> raid5 test also.
> 
> md1 : active raid1 sdh3[0] sdg3[1]
>        76959296 blocks [2/2] [UU]
>        [=>...................]  check =  9.7% (7532800/76959296) 
> finish=108.8min speed=10628K/sec
>        bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> 
> 
> Again this is running kernel builds, read / write loops, and remove/ 
> (re)add / check loops.
> At some point while removing a disk from the array, something bad 
> happened, the errors below
> appeared in the log, and the ext3 filesystem remounted readonly. xfs 
> plowed right on through.
> There's no evidence of any read or write errors to the member disks, 
> either in logs or on the
> disks themselves, and the raid1 checks came back with zero mismatches. 
> fsck for the ext3 fs
> complained a lot but the filesystem came back into a useable state and 
> I've restarted my tests.
> Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around 
> 12:27:04. This was
> immediately after (or during?) removing sdh3 from md1.
> 
> Oct  4 12:26:01 xplane kernel: md: unbind<sdg3>
> Oct  4 12:26:01 xplane kernel: md: export_rdev(sdg3)
> Oct  4 12:26:11 xplane kernel: md: bind<sdg3>

Removed and re-added sdg3


> Oct  4 12:26:11 xplane kernel: RAID1 conf printout:
> Oct  4 12:26:11 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:26:11 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
> Oct  4 12:26:11 xplane kernel:  disk 1, wo:1, o:1, dev:sdg3

->add_disk included sdg3 in the array not in-sync.

> Oct  4 12:26:12 xplane kernel: md: recovery of RAID array md1
> Oct  4 12:26:12 xplane kernel: md: minimum _guaranteed_  speed: 10000 
> KB/sec/disk.
> Oct  4 12:26:12 xplane kernel: md: using maximum available idle IO 
> bandwidth (but not more than 200000 KB/sec) for recovery.
> Oct  4 12:26:12 xplane kernel: md: using 128k window, over a total of 
> 76959296k.

Recovery starts


> Oct  4 12:27:03 xplane kernel: md: md1: recovery done.

Recovery finishes after 51 seconds.

> Oct  4 12:27:03 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:03 xplane kernel:  --- wd:2 rd:2
> Oct  4 12:27:03 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
> Oct  4 12:27:03 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

->spare_active() marks sdg3 at fully in-sync (not long "wo" - write only).

> Oct  4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3, 
> disabling device.

sdh3 fails a second later.  Presumably this was requested by your testing
harness.


> Oct  4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1 
> devices.

raid.c:error() confirms the device will not be ignored.


> Oct  4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331468
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3

Write errors start happening.

You should only get a write error if no writes successfully completed to
in_sync, non-faulty devices.
It is possible that the write to sdg3 completed before it was marked in-sync,
and the write to sdh3 completed after it was marked as faulty.
How long after recovery completes do you fail the next device?
The logs suggest it is the next second, which could be anywhere from 1msec
to 1998 msecs.


NeilBrown



> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331469
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331470
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331471
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331472
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331473
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331474
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331475
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331476
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical 
> block 6331477
> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Oct  4 12:27:05 xplane kernel: Aborting journal on device dm-1.
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_blocks: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: 
> ext3_journal_start_sb: Detected aborted journal
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting 
> filesystem read-only
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: 
> ext3_journal_start_sb: Detected aborted journal
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in 
> ext3_writeback_write_end: IO failure
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_orphan_add: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode: 
> Journal has aborted
> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in 
> ext3_dirty_inode: IO failure
> Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:27:05 xplane kernel:  disk 0, wo:1, o:0, dev:sdh3
> Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

raid1_remove_disk is called to disconnect the device from the array.


> Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

which it has now done.

> Oct  4 12:27:14 xplane kernel: md: unbind<sdh3>
> Oct  4 12:27:14 xplane kernel: md: export_rdev(sdh3)
> Oct  4 12:27:24 xplane kernel: md: bind<sdh3>
> Oct  4 12:27:24 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:24 xplane kernel:  --- wd:1 rd:2
> Oct  4 12:27:24 xplane kernel:  disk 0, wo:1, o:1, dev:sdh3
> Oct  4 12:27:24 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> Oct  4 12:27:24 xplane kernel: md: recovery of RAID array md1
> Oct  4 12:27:24 xplane kernel: md: minimum _guaranteed_  speed: 10000 
> KB/sec/disk.
> Oct  4 12:27:24 xplane kernel: md: using maximum available idle IO 
> bandwidth (but not more than 200000 KB/sec) for recovery.
> Oct  4 12:27:24 xplane kernel: md: using 128k window, over a total of 
> 76959296k.
> Oct  4 12:27:33 xplane kernel: md: md1: recovery done.
> Oct  4 12:27:33 xplane kernel: RAID1 conf printout:
> Oct  4 12:27:33 xplane kernel:  --- wd:2 rd:2
> Oct  4 12:27:33 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
> Oct  4 12:27:33 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> Oct  4 12:27:35 xplane kernel: md: data-check of RAID array md1
> Oct  4 12:27:35 xplane kernel: md: minimum _guaranteed_  speed: 10000 
> KB/sec/disk.
> Oct  4 12:27:35 xplane kernel: md: using maximum available idle IO 
> bandwidth (but not more than 200000 KB/sec) for data-check.
> Oct  4 12:27:35 xplane kernel: md: using 128k window, over a total of 
> 76959296k.
> Oct  4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing 
> b_committed_data
> 
> It seems like something got out of sync as the disk was being removed, 
> but before the remove completed.
> 
> Again, this is 3.14.19 with these 8 patches:
>        md/raid1: intialise start_next_window for READ case to avoid hang
>        md/raid1:  be more cautious where we read-balance during resync.
>        md/raid1: clean up request counts properly in close_sync()
>        md/raid1: make sure resync waits for conflicting writes to complete.
>        md/raid1: Don't use next_resync to determine how far resync has 
> progressed
>        md/raid1: update next_resync under resync_lock.
>        md/raid1: count resync requests in nr_pending.
>        md/raid1: fix_read_error should act on all non-faulty devices.
> 
>        and the two patches for the check start hang.
> 
> Any ideas on what happened here?
> 
> Thanks,
> Bill
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-10-14  1:42                 ` NeilBrown
@ 2014-10-14 14:56                   ` BillStuff
  2014-10-14 16:55                   ` BillStuff
  1 sibling, 0 replies; 14+ messages in thread
From: BillStuff @ 2014-10-14 14:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 10/13/2014 08:42 PM, NeilBrown wrote:
> On Sun, 05 Oct 2014 11:05:08 -0500 BillStuff <billstuff2001@sbcglobal.net>
> wrote:
>
>> On 09/30/2014 05:54 PM, NeilBrown wrote:
>>> [removed alot of stuff about raid5 check hanging]
>>> Thanks for the testing!  You have included enough information.
>>> I didn't really like that 'sync_starting' variable when I wrote the patch,
>>> but it seemed do the right thing.  It doesn't.
>>>
>>> If md_check_recovery() runs again immediately after scheduling the sync
>>> thread to run, it will not have set sync_starting but will find ->sync_thread
>>> is NULL and so will clear MD_RECOVERY_RUNNING.  The next time it runs, that
>>> flag is still clear and ->sync_thread is not NULL so it will try to stop the
>>> thread, which deadlocks.
>>>
>>> This patch on top of what you have should fix it... but I might end up
>>> redoing the logic a bit.
>>>
>> Neil,
>>
>> With the second patch, my test has been running well for close to 5 days
>> now,
> That is good news - thanks.

Now two weeks in, this seems to be working ok with the exception of this 
write error,
and another "write didn't make it to both disks during recovery" hiccup 
I'll describe below.

>> but something odd happened yesterday.
>>
>> It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect
>> it happened silently on my
>> raid5 test also.
>>
>> md1 : active raid1 sdh3[0] sdg3[1]
>>         76959296 blocks [2/2] [UU]
>>         [=>...................]  check =  9.7% (7532800/76959296)
>> finish=108.8min speed=10628K/sec
>>         bitmap: 0/1 pages [0KB], 65536KB chunk
>>
>>
>>
>> Again this is running kernel builds, read / write loops, and remove/
>> (re)add / check loops.
>> At some point while removing a disk from the array, something bad
>> happened, the errors below
>> appeared in the log, and the ext3 filesystem remounted readonly. xfs
>> plowed right on through.
>> There's no evidence of any read or write errors to the member disks,
>> either in logs or on the
>> disks themselves, and the raid1 checks came back with zero mismatches.
>> fsck for the ext3 fs
>> complained a lot but the filesystem came back into a useable state and
>> I've restarted my tests.
>> Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around
>> 12:27:04. This was
>> immediately after (or during?) removing sdh3 from md1.
>>
>> Oct  4 12:26:01 xplane kernel: md: unbind<sdg3>
>> Oct  4 12:26:01 xplane kernel: md: export_rdev(sdg3)
>> Oct  4 12:26:11 xplane kernel: md: bind<sdg3>
> Removed and re-added sdg3
>
>
>> Oct  4 12:26:11 xplane kernel: RAID1 conf printout:
>> Oct  4 12:26:11 xplane kernel:  --- wd:1 rd:2
>> Oct  4 12:26:11 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
>> Oct  4 12:26:11 xplane kernel:  disk 1, wo:1, o:1, dev:sdg3
> ->add_disk included sdg3 in the array not in-sync.
>
>> Oct  4 12:26:12 xplane kernel: md: recovery of RAID array md1
>> Oct  4 12:26:12 xplane kernel: md: minimum _guaranteed_  speed: 10000
>> KB/sec/disk.
>> Oct  4 12:26:12 xplane kernel: md: using maximum available idle IO
>> bandwidth (but not more than 200000 KB/sec) for recovery.
>> Oct  4 12:26:12 xplane kernel: md: using 128k window, over a total of
>> 76959296k.
> Recovery starts
>
>
>> Oct  4 12:27:03 xplane kernel: md: md1: recovery done.
> Recovery finishes after 51 seconds.
>
>> Oct  4 12:27:03 xplane kernel: RAID1 conf printout:
>> Oct  4 12:27:03 xplane kernel:  --- wd:2 rd:2
>> Oct  4 12:27:03 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
>> Oct  4 12:27:03 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> ->spare_active() marks sdg3 at fully in-sync (not long "wo" - write only).
>
>> Oct  4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3,
>> disabling device.
> sdh3 fails a second later.  Presumably this was requested by your testing
> harness.
>
>
>> Oct  4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1
>> devices.
> raid.c:error() confirms the device will not be ignored.
>
>
>> Oct  4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331468
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
> Write errors start happening.
>
> You should only get a write error if no writes successfully completed to
> in_sync, non-faulty devices.
> It is possible that the write to sdg3 completed before it was marked in-sync,
> and the write to sdh3 completed after it was marked as faulty.
> How long after recovery completes do you fail the next device?
> The logs suggest it is the next second, which could be anywhere from 1msec
> to 1998 msecs.
>
>
> NeilBrown

The script sleeps for 1 second and then fails / removes / adds the
opposite disk. After doing all disks in the array it does a "check",
then starts over.  I'm sure there's some script housekeeping time in there,
but it should be right around 1 second.

The other hiccup that happened may be related; hopefully at least 
informative.

Half of my test is the fail / remove / add loop. The other half continually
writes files and saves a checksum. Later, when the file is deleted, it 
re-checks
the sum. A few days ago this second check failed:

sdg3 in sync, sdh3 (re)added at 10:48:48

Oct  8 10:48:48 xplane kernel: md: bind<sdh3>
Oct  8 10:48:49 xplane kernel: RAID1 conf printout:
Oct  8 10:48:49 xplane kernel:  --- wd:1 rd:2
Oct  8 10:48:49 xplane kernel:  disk 0, wo:1, o:1, dev:sdh3
Oct  8 10:48:49 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
Oct  8 10:48:49 xplane kernel: md: recovery of RAID array md1
Oct  8 10:48:49 xplane kernel: md: minimum _guaranteed_  speed: 10000 
KB/sec/disk.
Oct  8 10:48:49 xplane kernel: md: using maximum available idle IO 
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  8 10:48:49 xplane kernel: md: using 128k window, over a total of 
76959296k.

testfile /mnt/lvtest/tmp/tmpfile02518.0xd6.tmp written at 10:51:47

Oct  8 10:51:54 xplane kernel: md: md1: recovery done.
Oct  8 10:51:54 xplane kernel: RAID1 conf printout:
Oct  8 10:51:54 xplane kernel:  --- wd:2 rd:2
Oct  8 10:51:54 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
Oct  8 10:51:54 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3

md5sum for /mnt/lvtest/tmp/tmpfile02518.0xd6.tmp different at 11:30 
during QC check

Nothing bad showed up in the logs, for disks, filesystems or the array.

 From looking at the file, it appears a chunk of it didn't get written 
to /dev/sdh3.
When the second QC was run, the read was from /dev/sdh3 and the sum was 
different.

I did a "check" and I got mismatch_cnt = 256; looked at the differences 
between
/dev/sdg3 and /dev/sdh3 and they're around 50GB into the 73GB (I was 
expecting to find
them near the tail end, although this might be less meaningful with the 
bitmap.)

I'm now trying to reproduce these two failures, now on 3.14.21 with the 
2 resync start
patches, without LVM in the middle.

Thanks always for the help.
-Bill



>
>
>
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331469
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331470
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331471
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331472
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331473
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331474
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331475
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331476
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
>> block 6331477
>> Oct  4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
>> Oct  4 12:27:05 xplane kernel: Aborting journal on device dm-1.
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_blocks:
>> Journal has aborted
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error:
>> ext3_journal_start_sb: Detected aborted journal
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting
>> filesystem read-only
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error:
>> ext3_journal_start_sb: Detected aborted journal
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
>> ext3_writeback_write_end: IO failure
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
>> Journal has aborted
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
>> Journal has aborted
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_orphan_add:
>> Journal has aborted
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
>> Journal has aborted
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
>> Journal has aborted
>> Oct  4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
>> ext3_dirty_inode: IO failure
>> Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
>> Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
>> Oct  4 12:27:05 xplane kernel:  disk 0, wo:1, o:0, dev:sdh3
>> Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> raid1_remove_disk is called to disconnect the device from the array.
>
>
>> Oct  4 12:27:05 xplane kernel: RAID1 conf printout:
>> Oct  4 12:27:05 xplane kernel:  --- wd:1 rd:2
>> Oct  4 12:27:05 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
> which it has now done.
>
>> Oct  4 12:27:14 xplane kernel: md: unbind<sdh3>
>> Oct  4 12:27:14 xplane kernel: md: export_rdev(sdh3)
>> Oct  4 12:27:24 xplane kernel: md: bind<sdh3>
>> Oct  4 12:27:24 xplane kernel: RAID1 conf printout:
>> Oct  4 12:27:24 xplane kernel:  --- wd:1 rd:2
>> Oct  4 12:27:24 xplane kernel:  disk 0, wo:1, o:1, dev:sdh3
>> Oct  4 12:27:24 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
>> Oct  4 12:27:24 xplane kernel: md: recovery of RAID array md1
>> Oct  4 12:27:24 xplane kernel: md: minimum _guaranteed_  speed: 10000
>> KB/sec/disk.
>> Oct  4 12:27:24 xplane kernel: md: using maximum available idle IO
>> bandwidth (but not more than 200000 KB/sec) for recovery.
>> Oct  4 12:27:24 xplane kernel: md: using 128k window, over a total of
>> 76959296k.
>> Oct  4 12:27:33 xplane kernel: md: md1: recovery done.
>> Oct  4 12:27:33 xplane kernel: RAID1 conf printout:
>> Oct  4 12:27:33 xplane kernel:  --- wd:2 rd:2
>> Oct  4 12:27:33 xplane kernel:  disk 0, wo:0, o:1, dev:sdh3
>> Oct  4 12:27:33 xplane kernel:  disk 1, wo:0, o:1, dev:sdg3
>> Oct  4 12:27:35 xplane kernel: md: data-check of RAID array md1
>> Oct  4 12:27:35 xplane kernel: md: minimum _guaranteed_  speed: 10000
>> KB/sec/disk.
>> Oct  4 12:27:35 xplane kernel: md: using maximum available idle IO
>> bandwidth (but not more than 200000 KB/sec) for data-check.
>> Oct  4 12:27:35 xplane kernel: md: using 128k window, over a total of
>> 76959296k.
>> Oct  4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing
>> b_committed_data
>>
>> It seems like something got out of sync as the disk was being removed,
>> but before the remove completed.
>>
>> Again, this is 3.14.19 with these 8 patches:
>>         md/raid1: intialise start_next_window for READ case to avoid hang
>>         md/raid1:  be more cautious where we read-balance during resync.
>>         md/raid1: clean up request counts properly in close_sync()
>>         md/raid1: make sure resync waits for conflicting writes to complete.
>>         md/raid1: Don't use next_resync to determine how far resync has
>> progressed
>>         md/raid1: update next_resync under resync_lock.
>>         md/raid1: count resync requests in nr_pending.
>>         md/raid1: fix_read_error should act on all non-faulty devices.
>>
>>         and the two patches for the check start hang.
>>
>> Any ideas on what happened here?
>>
>> Thanks,
>> Bill
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Raid5 hang in 3.14.19
  2014-10-14  1:42                 ` NeilBrown
  2014-10-14 14:56                   ` BillStuff
@ 2014-10-14 16:55                   ` BillStuff
  1 sibling, 0 replies; 14+ messages in thread
From: BillStuff @ 2014-10-14 16:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[snip]
On 10/13/2014 08:42 PM, NeilBrown wrote:
> Write errors start happening.
>
> You should only get a write error if no writes successfully completed to
> in_sync, non-faulty devices.
> It is possible that the write to sdg3 completed before it was marked in-sync,
> and the write to sdh3 completed after it was marked as faulty.
> How long after recovery completes do you fail the next device?
> The logs suggest it is the next second, which could be anywhere from 1msec
> to 1998 msecs.
>
>
> NeilBrown
>

FYI Neil: Running through my logs, I noticed 6 of these failures in my 
testing
over the past few days. All "recovery completed, wait a second, fail the 
other
drive" cases. Same signature in the logs. Apparently things kept working 
until
the filesystem tripped on its journal and fell over.

So the upshot is this seems reasonably reproducible.

-Bill

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-10-14 16:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-26 22:33 Raid5 hang in 3.14.19 BillStuff
2014-09-29  2:25 ` NeilBrown
2014-09-29  3:56   ` BillStuff
2014-09-29  4:08     ` NeilBrown
2014-09-29  4:28       ` BillStuff
2014-09-29  4:43         ` NeilBrown
2014-09-29 21:59         ` NeilBrown
2014-09-30  4:19           ` BillStuff
2014-09-30 21:21           ` BillStuff
2014-09-30 22:54             ` NeilBrown
2014-10-05 16:05               ` BillStuff
2014-10-14  1:42                 ` NeilBrown
2014-10-14 14:56                   ` BillStuff
2014-10-14 16:55                   ` BillStuff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).