From: Xiao Ni <xni@redhat.com>
To: linux-raid <linux-raid@vger.kernel.org>
Cc: neilb@suse.com, shli@kernel.org
Subject: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared
Date: Sat, 2 Sep 2017 04:01:35 -0400 (EDT) [thread overview]
Message-ID: <546311999.4473128.1504339295016.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <221835411.4473056.1504338574607.JavaMail.zimbra@redhat.com>
Hi Neil and Shaohua
I'm trying to reproduce the problem the problem "raid5_finish_reshape is stuck at revalidate_disk".
But there is a new stuck. I tried with 4.13.0-rc5 and latest mdadm. The steps are:
#!/bin/bash
num=0
while [ 1 ]
do
echo "*************************$num"
mdadm -Ss
mdadm --create --run /dev/md0 --level 5 --metadata 1.2 --raid-devices 5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 --spare-devices 1 /dev/loop5 --chunk 512
mdadm --wait /dev/md0
mkfs -t ext4 /dev/md0
mount -t ext4 /dev/md0 /mnt/md_test
dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100
mdadm --add /dev/md0 /dev/loop6
mdadm --grow --raid-devices 6 /dev/md0
dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
mdadm --wait /dev/md0
((num++))
umount /dev/md0
done
The calltrace messages are:
Sep 1 13:57:25 localhost kernel: INFO: task kworker/u8:4:21401 blocked for more than 120 seconds.
Sep 1 13:57:25 localhost kernel: Tainted: G OE 4.13.0-rc5 #2
Sep 1 13:57:25 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 1 13:57:25 localhost kernel: kworker/u8:4 D 0 21401 2 0x00000080
Sep 1 13:57:25 localhost kernel: Workqueue: writeback wb_workfn (flush-9:0)
Sep 1 13:57:25 localhost kernel: Call Trace:
Sep 1 13:57:25 localhost kernel: __schedule+0x28d/0x890
Sep 1 13:57:25 localhost kernel: schedule+0x36/0x80
Sep 1 13:57:25 localhost kernel: md_write_start+0xf0/0x220 [md_mod]
Sep 1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep 1 13:57:25 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Sep 1 13:57:25 localhost kernel: ? bio_split+0x5d/0x90
Sep 1 13:57:25 localhost kernel: ? blk_queue_split+0xd2/0x630
Sep 1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep 1 13:57:25 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
....
All the processes send bio to md are stuck at md_write_start.
wait_event(mddev->sb_wait,
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);
Sep 1 13:57:26 localhost kernel: INFO: task md0_reshape:23605 blocked for more than 120 seconds.
Sep 1 13:57:26 localhost kernel: Tainted: G OE 4.13.0-rc5 #2
Sep 1 13:57:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 1 13:57:26 localhost kernel: md0_reshape D 0 23605 2 0x00000080
Sep 1 13:57:26 localhost kernel: Call Trace:
Sep 1 13:57:26 localhost kernel: __schedule+0x28d/0x890
Sep 1 13:57:26 localhost kernel: schedule+0x36/0x80
Sep 1 13:57:26 localhost kernel: raid5_sync_request+0x2cf/0x370 [raid456]
Sep 1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep 1 13:57:26 localhost kernel: md_do_sync+0xafe/0xee0 [md_mod]
Sep 1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep 1 13:57:26 localhost kernel: md_thread+0x132/0x180 [md_mod]
Sep 1 13:57:26 localhost kernel: kthread+0x109/0x140
Sep 1 13:57:26 localhost kernel: ? find_pers+0x70/0x70 [md_mod]
Sep 1 13:57:26 localhost kernel: ? kthread_park+0x60/0x60
Sep 1 13:57:26 localhost kernel: ? do_syscall_64+0x67/0x150
Sep 1 13:57:26 localhost kernel: ret_from_fork+0x25/0x30
It stucks at:
/* Allow raid5_quiesce to complete */
wait_event(conf->wait_for_overlap, conf->quiesce != 2);
Sep 1 13:57:26 localhost kernel: INFO: task mdadm:23613 blocked for more than 120 seconds.
Sep 1 13:57:26 localhost kernel: Tainted: G OE 4.13.0-rc5 #2
Sep 1 13:57:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 1 13:57:26 localhost kernel: mdadm D 0 23613 1 0x00000080
Sep 1 13:57:26 localhost kernel: Call Trace:
Sep 1 13:57:26 localhost kernel: __schedule+0x28d/0x890
Sep 1 13:57:26 localhost kernel: schedule+0x36/0x80
Sep 1 13:57:26 localhost kernel: raid5_quiesce+0x274/0x2b0 [raid456]
Sep 1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep 1 13:57:26 localhost kernel: suspend_lo_store+0x82/0xe0 [md_mod]
Sep 1 13:57:26 localhost kernel: md_attr_store+0x80/0xc0 [md_mod]
Sep 1 13:57:26 localhost kernel: sysfs_kf_write+0x3a/0x50
Sep 1 13:57:26 localhost kernel: kernfs_fop_write+0xff/0x180
Sep 1 13:57:26 localhost kernel: __vfs_write+0x37/0x170
Sep 1 13:57:26 localhost kernel: ? selinux_file_permission+0xe5/0x120
Sep 1 13:57:26 localhost kernel: ? security_file_permission+0x3b/0xc0
Sep 1 13:57:26 localhost kernel: vfs_write+0xb2/0x1b0
Sep 1 13:57:26 localhost kernel: ? syscall_trace_enter+0x1d0/0x2b0
Sep 1 13:57:26 localhost kernel: SyS_write+0x55/0xc0
Sep 1 13:57:26 localhost kernel: do_syscall_64+0x67/0x150
It's stuck at:
conf->quiesce = 2;
wait_event_cmd(conf->wait_for_quiescent,
atomic_read(&conf->active_stripes) == 0 &&
atomic_read(&conf->active_aligned_reads) == 0,
unlock_all_device_hash_locks_irq(conf),
lock_all_device_hash_locks_irq(conf));
conf->quiesce = 1;
[root@dell-pr1700-02 ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] loop1[1] loop0[0]
2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU]
[>....................] reshape = 0.5% (2560/509952) finish=162985.0min speed=0K/sec
unused devices: <none>
I can reproduce this. So I add some logs by printk to check MD_SB_CHANGE_PENDING of mddev->flags
and mddev->suspended.
mddev->suspend : 0
mddev->flags : SP CHANGE PENDING is set
conf->quiesce : 2
The I echo active > /sys/block/md0/md/array_state, reshape can start and finish successfully.
I notice there are some fixes for raid5 stuck problem. Not sure whether it's introduced by those
patches.
Best Regards
Xiao
next parent reply other threads:[~2017-09-02 8:01 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <221835411.4473056.1504338574607.JavaMail.zimbra@redhat.com>
2017-09-02 8:01 ` Xiao Ni [this message]
2017-09-04 2:16 ` Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared NeilBrown
2017-09-04 2:45 ` Xiao Ni
2017-09-04 3:52 ` Xiao Ni
2017-09-04 5:34 ` NeilBrown
2017-09-04 7:36 ` Xiao Ni
2017-09-05 1:36 ` NeilBrown
2017-09-05 2:15 ` Xiao Ni
2017-09-07 1:37 ` Xiao Ni
2017-09-07 5:37 ` NeilBrown
2017-09-11 0:14 ` Xiao Ni
2017-09-11 3:36 ` NeilBrown
2017-09-11 5:03 ` Xiao Ni
2017-09-30 9:44 ` Xiao Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=546311999.4473128.1504339295016.JavaMail.zimbra@redhat.com \
--to=xni@redhat.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.