From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Raid5 hang in 3.14.19 Date: Mon, 29 Sep 2014 12:25:33 +1000 Message-ID: <20140929122533.3b91a543@notabene.brown> References: <5425E9D6.1050102@sbcglobal.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/bw8h551aXmEE3ft4vLGhTcp"; protocol="application/pgp-signature" Return-path: In-Reply-To: <5425E9D6.1050102@sbcglobal.net> Sender: linux-raid-owner@vger.kernel.org To: BillStuff Cc: linux-raid List-Id: linux-raid.ids --Sig_/bw8h551aXmEE3ft4vLGhTcp Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 26 Sep 2014 17:33:58 -0500 BillStuff wrote: > Hi Neil, >=20 > I found something that looks similar to the problem described in > "Re: seems like a deadlock in workqueue when md do a flush" from Sept 14t= h. >=20 > It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs. >=20 > on this array: > md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0] > 104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU] > bitmap: 1/5 pages [4KB], 2048KB chunk >=20 > I was running a test doing parallel kernel builds, read/write loops, and= =20 > disk add / remove / check loops, > on both this array and a raid1 array. >=20 > I was trying to stress test your recent raid1 fixes, which went well,=20 > but then after 5 days, > the raid5 array hung up with this in dmesg: I think this is different to the workqueue problem you mentioned, though as= I don't know exactly what caused either I cannot be certain. From the data you provided it looks like everything is waiting on get_active_stripe(), or on a process that is waiting on that. That seems pretty common whenever anything goes wrong in raid5 :-( The md3_raid5 task is listed as blocked, but not stack trace is given. If the machine is still in the state, then cat /proc/1698/stack might be useful. (echo t > /proc/sysrq-trigger is always a good idea) NeilBrown >=20 > INFO: task kworker/7:2:15726 blocked for more than 120 seconds. > Tainted: P W O 3.14.19fe-dirty #2 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/7:2 D 00000001 0 15726 2 0x00000000 > Workqueue: xfs-log/md3 xfs_log_worker [xfs] > d1809b1c 00000046 e95e0a80 00000001 c15b56a0 00000008 cf92b77a 0001721a > c17d6700 c17d6700 d198bcc0 d3b6a880 c236dc00 c8290fb0 c2328468 00000352 > 000000e2 c209c000 00000001 e9606000 c209c000 d1809b1c c14b05f4 c12d0c4a > Call Trace: > [] ? sch_direct_xmit+0x54/0x180 > [] ? radix_tree_lookup+0xa/0x10 > [] ? prepare_to_wait_event+0x6b/0xd0 > [] schedule+0x23/0x60 > [] md_flush_request+0x51/0xd0 > [] ? __wake_up_sync+0x20/0x20 > [] make_request+0xb55/0xc30 [raid456] > [] ? ip_finish_output+0x1e3/0x440 > [] ? ip_output+0x77/0x80 > [] ? ip_local_out+0x20/0x30 > [] ? generic_make_request_checks+0x343/0x400 > [] ? internal_add_timer+0x13/0x40 > [] md_make_request+0xb1/0x1d0 > [] ? mempool_alloc_slab+0x13/0x20 > [] generic_make_request+0x89/0xc0 > [] submit_bio+0x5a/0x120 > [] ? bio_add_page+0x38/0x40 > [] _xfs_buf_ioapply+0x250/0x300 [xfs] > [] xfs_buf_iorequest+0x44/0x80 [xfs] > [] ? xlog_bdstrat+0x1e/0x50 [xfs] > [] xlog_bdstrat+0x1e/0x50 [xfs] > [] xlog_sync+0x2a0/0x460 [xfs] > [] xlog_state_release_iclog+0x74/0xa0 [xfs] > [] xlog_write+0x4af/0x630 [xfs] > [] ? kmem_zone_alloc+0x66/0xe0 [xfs] > [] xlog_cil_push+0x1e7/0x3c0 [xfs] > [] ? flush_work+0x16/0x100 > [] xlog_cil_force_lsn+0x79/0x1c0 [xfs] > [] ? sched_clock_cpu+0x104/0x110 > [] ? __dequeue_entity+0x3c/0x40 > [] _xfs_log_force+0x50/0x230 [xfs] > [] ? __schedule+0x224/0x7a0 > [] xfs_log_force+0x26/0x90 [xfs] > [] ? xlog_cil_empty+0x17/0x30 [xfs] > [] xfs_log_worker+0x20/0x50 [xfs] > [] process_one_work+0x10c/0x340 > [] worker_thread+0x101/0x330 > [] ? manage_workers.isra.27+0x250/0x250 > [] kthread+0xa1/0xc0 > [] ret_from_kernel_thread+0x1b/0x28 > [] ? kthread_create_on_node+0x110/0x110 > INFO: task kworker/u16:2:24280 blocked for more than 120 seconds. > Tainted: P W O 3.14.19fe-dirty #2 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/u16:2 D d56879e0 0 24280 2 0x00000000 > Workqueue: writeback bdi_writeback_workfn (flush-9:3) > d5687a3c 00000046 00000000 d56879e0 f2fbd600 f3dbc81b 2e86d607 0001721c > c17d6700 c17d6700 c49cbcc0 e601c6e0 f3dbc81b 00000002 00000001 e9e0fc00 > c2328468 0000003b d5687a28 d5687d74 00000001 d5687a3c c12b0db6 f3dba382 > Call Trace: > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] ? blk_flush_plug_list+0x86/0x1b0 > [] ? stripe_set_idx+0xa2/0xc0 [raid456] > [] ? prepare_to_wait_event+0x6b/0xd0 > [] schedule+0x23/0x60 > [] get_active_stripe+0x19b/0x5a0 [raid456] > [] ? bitmap_startwrite+0x5b/0x170 > [] ? __wake_up_sync+0x20/0x20 > [] make_request+0x170/0xc30 [raid456] > [] ? delayacct_end+0x93/0xa0 > [] ? __wake_up_sync+0x20/0x20 > [] md_make_request+0xb1/0x1d0 > [] ? mempool_alloc_slab+0x13/0x20 > [] generic_make_request+0x89/0xc0 > [] submit_bio+0x5a/0x120 > [] ? __bio_add_page+0xd9/0x220 > [] ? bvec_alloc+0x78/0xe0 > [] xfs_submit_ioend_bio.isra.9+0x32/0x40 [xfs] > [] xfs_submit_ioend+0xec/0x120 [xfs] > [] xfs_vm_writepage+0x241/0x580 [xfs] > [] __writepage+0x10/0x40 > [] ? mapping_tagged+0x20/0x20 > [] write_cache_pages+0x170/0x3d0 > [] ? raid5_unplug+0xbb/0x120 [raid456] > [] ? mapping_tagged+0x20/0x20 > [] generic_writepages+0x33/0x60 > [] xfs_vm_writepages+0x33/0x40 [xfs] > [] do_writepages+0x1a/0x40 > [] __writeback_single_inode+0x42/0x1f0 > [] ? wake_up_bit+0x1c/0x20 > [] writeback_sb_inodes+0x193/0x2e0 > [] __writeback_inodes_wb+0x7c/0xb0 > [] wb_writeback+0x1d2/0x270 > [] bdi_writeback_workfn+0x151/0x380 > [] ? __schedule+0x224/0x7a0 > [] process_one_work+0x10c/0x340 > [] ? default_wake_function+0x10/0x20 > [] worker_thread+0x101/0x330 > [] ? manage_workers.isra.27+0x250/0x250 > [] kthread+0xa1/0xc0 > [] ret_from_kernel_thread+0x1b/0x28 > [] ? kthread_create_on_node+0x110/0x110 > INFO: task kworker/3:3:26958 blocked for more than 120 seconds. > Tainted: P W O 3.14.19fe-dirty #2 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/3:3 D 00000000 0 26958 2 0x00000000 > Workqueue: md md_submit_flush_data > c6d41dd8 00000046 d128391d 00000000 00000000 00000000 2d875421 0001721c > c17d6700 c17d6700 e9d3bcc0 d4bf5610 00000000 00000000 00000000 ea469748 > df1132aa 0000c335 d4bf5654 d4bf5654 de5a17aa c6d41dec c106fe31 57d98c16 > Call Trace: > [] ? update_curr+0xf1/0x1a0 > [] ? prepare_to_wait_event+0x6b/0xd0 > [] schedule+0x23/0x60 > [] get_active_stripe+0x19b/0x5a0 [raid456] > [] ? dequeue_entity+0x11f/0x510 > [] ? __wake_up_sync+0x20/0x20 > [] make_request+0x170/0xc30 [raid456] > [] ? __schedule+0x224/0x7a0 > [] ? kmem_cache_free+0xe5/0x110 > [] ? __wake_up_sync+0x20/0x20 > [] md_submit_flush_data+0x30/0x70 > [] process_one_work+0x10c/0x340 > [] ? mod_timer+0xef/0x1d0 > [] worker_thread+0x101/0x330 > [] ? manage_workers.isra.27+0x250/0x250 > [] kthread+0xa1/0xc0 > [] ret_from_kernel_thread+0x1b/0x28 > [] ? kthread_create_on_node+0x110/0x110 >=20 > There's more, mostly xfs threads complaining about being blocked and=20 > some user processes. > Thinking it was an xfs problem, I did "echo w > /proc/sysrq-trigger" and= =20 > got this in syslog: >=20 > Sep 26 12:33:06 xplane kernel: SysRq : Show Blocked State > Sep 26 12:33:06 xplane kernel: task PC stack pid father > Sep 26 12:33:06 xplane kernel: kthreadd D c106ea4c 0 2 0= =20 > 0x00000000 > Sep 26 12:33:06 xplane kernel: kswapd0 D 00017219 0 711 = =20 > 2 0x00000000 > Sep 26 12:33:06 xplane kernel: md3_raid5 D ea49d770 0=20 > 1698 2 0x00000000 > Sep 26 12:33:06 xplane kernel: xfsaild/md3 D e7d72700 0=20 > 2234 2 0x00000000 > Sep 26 12:33:06 xplane kernel: kworker/7:2 D 00000001 0=20 > 15726 2 0x00000000 > Sep 26 12:33:06 xplane kernel: Workqueue: xfs-log/md3 xfs_log_worker [xfs] > Sep 26 12:33:06 xplane kernel: kworker/u16:2 D d56879e0 0=20 > 24280 2 0x00000000 > Sep 26 12:33:06 xplane kernel: Workqueue: writeback bdi_writeback_workfn= =20 > (flush-9:3) > Sep 26 12:33:06 xplane kernel: kworker/3:3 D 00000000 0=20 > 26958 2 0x00000000 > Sep 26 12:33:06 xplane kernel: Workqueue: md md_submit_flush_data > Sep 26 12:33:06 xplane kernel: cat D 00000000 0 11412 =20 > 11410 0x00000000 > Sep 26 12:33:06 xplane kernel: cc1 D d00c3bac 0 12102 =20 > 12101 0x00000000 > Sep 26 12:33:06 xplane kernel: ld D c16ef900 0 12132 =20 > 12130 0x00000000 > Sep 26 12:33:06 xplane kernel: make D 00000000 0 12147 =20 > 12047 0x00000000 > Sep 26 12:33:06 xplane kernel: make D d896ea80 0 12374 =20 > 11263 0x00000000 > Sep 26 12:33:06 xplane kernel: cat D f2e51d0c 0 23219 =20 > 23217 0x00000000 > Sep 26 12:33:06 xplane kernel: gcc D f2e51d0c 0 23564 =20 > 23563 0x00000000 > Sep 26 12:33:06 xplane kernel: sh D f2e51d0c 0 25753 =20 > 23615 0x00000000 >=20 > which looks similar to the problem from the above mentioned list-post. >=20 > I did a few test writes to the filesystem, and it looked like xfs=20 > created the files, and could read them > and delete them, but there was never any disk activity and access to=20 > /sys/block/md3/md/{anything} was hung up >=20 > "top" showed consistent I/O wait time, first 25%, then 50%, with no disc= =20 > activity >=20 > In the end I had to reboot, and I'm running the test again. >=20 > Any thoughts on what's happening here or ideas to help debug it? >=20 > Thanks, > Bill >=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --Sig_/bw8h551aXmEE3ft4vLGhTcp Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBVCjDHjnsnt1WYoG5AQLIWRAAktBR7UXU8UxtHfl/pKjTlfCgO27dKxAR wdB0RVDGgjcouU1/zvFl6AV1ANs4Hk9GwfB4KHoEzLv2s6pBajQP6NW4H5oyWpMt +tdkhPZcQS9xitsorQnMHQItuJ8Eq4lgxIwXbE3ef1Fd/sipSOuqF9N32ta2/yJx MXZ2KWZ9U9aCw8v3QxuqvHVh0g6qX66LC/55JoHsKoMN63cikh3xY0NuGu0b3rz6 gDWZUX9tQEkOh7UqwsL50bt4esxXCWUoTrd5O1y9wpDnyYcTLQzCLeZ1EqYrTunh ct3uQc6sGkDQKGfz2xl0tptvgMK1GMRNJtDOO5DYnFyK3K3lf0oEeddDKkQAP2TN JOfS3dFc56Z/qi5Y/xzzU/yntE6QcLBoixWSWl+kSvcJDC6FHWE5t++pWx/aMl2Y ebH3wSVE9V7cgkip8YIwMIRuH8EgfpJufYxv1DTGQ/kiEwaZ5ZxcE5WMw/i3ZhfB +2I9t/smODyeCgEkhID05dfh/HYCToKVjG4KxUUHkdTUNN6sbw0TvKQdq46jC+2R T1ISojQ3lCVk6OpzC5v9uON+wJSiVyF4VSEK2WfnJ2nIeEHl1AzIYREZWfZpUB2v yy/6PKVMe0DCduG6QEt7P8EjcKN64JAodRew9sIlsSOypKSC8jKfoxxoHQroNIBt 4+Vai4Edfeo= =0gOF -----END PGP SIGNATURE----- --Sig_/bw8h551aXmEE3ft4vLGhTcp--