All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiao Ni <xni@redhat.com>
To: linux-raid <linux-raid@vger.kernel.org>
Cc: neilb@suse.com, shli@kernel.org
Subject: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared
Date: Sat, 2 Sep 2017 04:01:35 -0400 (EDT)	[thread overview]
Message-ID: <546311999.4473128.1504339295016.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <221835411.4473056.1504338574607.JavaMail.zimbra@redhat.com>

Hi Neil and Shaohua

I'm trying to reproduce the problem the problem "raid5_finish_reshape is stuck at revalidate_disk".
But there is a new stuck. I tried with 4.13.0-rc5 and latest mdadm. The steps are:

#!/bin/bash
num=0
while [ 1 ]
do
	echo "*************************$num"
	mdadm -Ss
	mdadm --create --run /dev/md0 --level 5 --metadata 1.2 --raid-devices 5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 --spare-devices 1 /dev/loop5 --chunk 512
	mdadm --wait /dev/md0
	mkfs -t ext4 /dev/md0
	mount -t ext4 /dev/md0 /mnt/md_test
	dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=100
	mdadm --add /dev/md0 /dev/loop6
	mdadm --grow --raid-devices 6 /dev/md0
	dd if=/dev/urandom of=/mnt/md_test/testfile bs=1M count=1000
	mdadm --wait /dev/md0
	((num++))
	umount /dev/md0
done

The calltrace messages are:
Sep  1 13:57:25 localhost kernel: INFO: task kworker/u8:4:21401 blocked for more than 120 seconds.
Sep  1 13:57:25 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #2
Sep  1 13:57:25 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 13:57:25 localhost kernel: kworker/u8:4    D    0 21401      2 0x00000080
Sep  1 13:57:25 localhost kernel: Workqueue: writeback wb_workfn (flush-9:0)
Sep  1 13:57:25 localhost kernel: Call Trace:
Sep  1 13:57:25 localhost kernel: __schedule+0x28d/0x890
Sep  1 13:57:25 localhost kernel: schedule+0x36/0x80
Sep  1 13:57:25 localhost kernel: md_write_start+0xf0/0x220 [md_mod]
Sep  1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep  1 13:57:25 localhost kernel: raid5_make_request+0x89/0x8b0 [raid456]
Sep  1 13:57:25 localhost kernel: ? bio_split+0x5d/0x90
Sep  1 13:57:25 localhost kernel: ? blk_queue_split+0xd2/0x630
Sep  1 13:57:25 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep  1 13:57:25 localhost kernel: md_make_request+0xf5/0x260 [md_mod]
....

All the processes send bio to md are stuck at md_write_start.
        wait_event(mddev->sb_wait,
                   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) && !mddev->suspended);


Sep  1 13:57:26 localhost kernel: INFO: task md0_reshape:23605 blocked for more than 120 seconds.
Sep  1 13:57:26 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #2
Sep  1 13:57:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 13:57:26 localhost kernel: md0_reshape     D    0 23605      2 0x00000080
Sep  1 13:57:26 localhost kernel: Call Trace:
Sep  1 13:57:26 localhost kernel: __schedule+0x28d/0x890
Sep  1 13:57:26 localhost kernel: schedule+0x36/0x80
Sep  1 13:57:26 localhost kernel: raid5_sync_request+0x2cf/0x370 [raid456]
Sep  1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep  1 13:57:26 localhost kernel: md_do_sync+0xafe/0xee0 [md_mod]
Sep  1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep  1 13:57:26 localhost kernel: md_thread+0x132/0x180 [md_mod]
Sep  1 13:57:26 localhost kernel: kthread+0x109/0x140
Sep  1 13:57:26 localhost kernel: ? find_pers+0x70/0x70 [md_mod]
Sep  1 13:57:26 localhost kernel: ? kthread_park+0x60/0x60
Sep  1 13:57:26 localhost kernel: ? do_syscall_64+0x67/0x150
Sep  1 13:57:26 localhost kernel: ret_from_fork+0x25/0x30

It stucks at:
        /* Allow raid5_quiesce to complete */
        wait_event(conf->wait_for_overlap, conf->quiesce != 2);

Sep  1 13:57:26 localhost kernel: INFO: task mdadm:23613 blocked for more than 120 seconds.
Sep  1 13:57:26 localhost kernel:      Tainted: G           OE   4.13.0-rc5 #2
Sep  1 13:57:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 13:57:26 localhost kernel: mdadm           D    0 23613      1 0x00000080
Sep  1 13:57:26 localhost kernel: Call Trace:
Sep  1 13:57:26 localhost kernel: __schedule+0x28d/0x890
Sep  1 13:57:26 localhost kernel: schedule+0x36/0x80
Sep  1 13:57:26 localhost kernel: raid5_quiesce+0x274/0x2b0 [raid456]
Sep  1 13:57:26 localhost kernel: ? remove_wait_queue+0x60/0x60
Sep  1 13:57:26 localhost kernel: suspend_lo_store+0x82/0xe0 [md_mod]
Sep  1 13:57:26 localhost kernel: md_attr_store+0x80/0xc0 [md_mod]
Sep  1 13:57:26 localhost kernel: sysfs_kf_write+0x3a/0x50
Sep  1 13:57:26 localhost kernel: kernfs_fop_write+0xff/0x180
Sep  1 13:57:26 localhost kernel: __vfs_write+0x37/0x170
Sep  1 13:57:26 localhost kernel: ? selinux_file_permission+0xe5/0x120
Sep  1 13:57:26 localhost kernel: ? security_file_permission+0x3b/0xc0
Sep  1 13:57:26 localhost kernel: vfs_write+0xb2/0x1b0
Sep  1 13:57:26 localhost kernel: ? syscall_trace_enter+0x1d0/0x2b0
Sep  1 13:57:26 localhost kernel: SyS_write+0x55/0xc0
Sep  1 13:57:26 localhost kernel: do_syscall_64+0x67/0x150

It's stuck at:
conf->quiesce = 2;
wait_event_cmd(conf->wait_for_quiescent,
   atomic_read(&conf->active_stripes) == 0 &&
   atomic_read(&conf->active_aligned_reads) == 0,
   unlock_all_device_hash_locks_irq(conf),
   lock_all_device_hash_locks_irq(conf));
conf->quiesce = 1;

[root@dell-pr1700-02 ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 loop6[7] loop4[6] loop5[5](S) loop3[3] loop2[2] loop1[1] loop0[0]
      2039808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU]
      [>....................]  reshape =  0.5% (2560/509952) finish=162985.0min speed=0K/sec
      
unused devices: <none>


I can reproduce this. So I add some logs by printk to check MD_SB_CHANGE_PENDING of mddev->flags
and mddev->suspended.
mddev->suspend : 0
mddev->flags   : SP CHANGE PENDING is set
conf->quiesce  : 2

The I echo active > /sys/block/md0/md/array_state, reshape can start and finish successfully.

I notice there are some fixes for raid5 stuck problem. Not sure whether it's introduced by those
patches. 

Best Regards
Xiao




       reply	other threads:[~2017-09-02  8:01 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <221835411.4473056.1504338574607.JavaMail.zimbra@redhat.com>
2017-09-02  8:01 ` Xiao Ni [this message]
2017-09-04  2:16   ` Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared NeilBrown
2017-09-04  2:45     ` Xiao Ni
2017-09-04  3:52       ` Xiao Ni
2017-09-04  5:34         ` NeilBrown
2017-09-04  7:36           ` Xiao Ni
2017-09-05  1:36             ` NeilBrown
2017-09-05  2:15               ` Xiao Ni
2017-09-07  1:37                 ` Xiao Ni
2017-09-07  5:37                   ` NeilBrown
2017-09-11  0:14                     ` Xiao Ni
2017-09-11  3:36                       ` NeilBrown
2017-09-11  5:03                         ` Xiao Ni
2017-09-30  9:44               ` Xiao Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=546311999.4473128.1504339295016.JavaMail.zimbra@redhat.com \
    --to=xni@redhat.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.