From: Xiao Ni <xni@redhat.com>
To: NeilBrown <neilb@suse.com>
Cc: linux-raid <linux-raid@vger.kernel.org>, shli@kernel.org
Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared
Date: Mon, 11 Sep 2017 01:03:11 -0400 (EDT) [thread overview]
Message-ID: <55785648.9562287.1505106191546.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <871snddhza.fsf@notabene.neil.brown.name>
----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: "linux-raid" <linux-raid@vger.kernel.org>, shli@kernel.org
> Sent: Monday, September 11, 2017 11:36:25 AM
> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared
>
> On Sun, Sep 10 2017, Xiao Ni wrote:
>
> > ----- Original Message -----
> >> From: "NeilBrown" <neilb@suse.com>
> >> To: "Xiao Ni" <xni@redhat.com>, "linux-raid" <linux-raid@vger.kernel.org>
> >> Cc: shli@kernel.org
> >> Sent: Thursday, September 7, 2017 1:37:45 PM
> >> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be
> >> cleared
> >>
> >> On Wed, Sep 06 2017, Xiao Ni wrote:
> >>
> >> > ----- Original Message -----
> >> >> From: "Xiao Ni" <xni@redhat.com>
> >> >> To: "NeilBrown" <neilb@suse.com>, "linux-raid"
> >> >> <linux-raid@vger.kernel.org>
> >> >> Cc: shli@kernel.org
> >> >> Sent: Tuesday, September 5, 2017 10:15:00 AM
> >> >> Subject: Re: Stuck in md_write_start because MD_SB_CHANGE_PENDING can't
> >> >> be
> >> >> cleared
> >> >>
> >> >>
> >> >>
> >> >> On 09/05/2017 09:36 AM, NeilBrown wrote:
> >> >> > On Mon, Sep 04 2017, Xiao Ni wrote:
> >> >> >
> >> >> >>
> >> >> >> In function handle_stripe:
> >> >> >> 4697 if (s.handle_bad_blocks ||
> >> >> >> 4698 test_bit(MD_SB_CHANGE_PENDING,
> >> >> >> &conf->mddev->sb_flags)) {
> >> >> >> 4699 set_bit(STRIPE_HANDLE, &sh->state);
> >> >> >> 4700 goto finish;
> >> >> >> 4701 }
> >> >> >>
> >> >> >> Because MD_SB_CHANGE_PENDING is set, so the stripes can't be
> >> >> >> handled.
> >> >> >>
> >> >> > Right, of course. I see what is happening now.
> >> >> >
> >> >> > - raid5d cannot complete stripes until the metadata is written
> >> >> > - the metadata cannot be written until raid5d gets the mddev_lock
> >> >> > - mddev_lock is held by the write to suspend_hi
> >> >> > - the write to suspend_hi is waiting for raid5_quiesce
> >> >> > - raid5_quiesce is waiting for some stripes to complete.
> >> >> >
> >> >> > We could declare that ->quiesce(, 1) cannot be called while holding
> >> >> > the
> >> >> > lock.
> >> >> > We could possible allow it but only if md_update_sb() is called
> >> >> > first,
> >> >> > though that might still be racy.
> >> >> >
> >> >> > ->quiesce(, 1) is currently called from:
> >> >> > mddev_suspend
> >> >> > suspend_lo_store
> >> >> > suspend_hi_store
> >> >> > __md_stop_writes
> >> >> > mddev_detach
> >> >> > set_bitmap_file
> >> >> > update_array_info (when setting/removing internal bitmap)
> >> >> > md_do_sync
> >> >> >
> >> >> > and most of those are call with the lock held, or take the lock.
> >> >> >
> >> >> > Maybe we should *require* that mddev_lock is held when calling
> >> >> > ->quiesce() and have ->quiesce() do the metadata update.
> >> >> >
> >> >> > Something like the following maybe. Can you test it?
> >> >>
> >> >> Hi Neil
> >> >>
> >> >> Thanks for the analysis. I need to thing for a while :)
> >> >> I already added the patch and the test is running now. It usually needs
> >> >> more than 5
> >> >> hours to reproduce this problem. I'll let it run more than 24 hours.
> >> >> I'll update the test
> >> >> result later.
> >> >
> >> > Hi Neil
> >> >
> >> > The problem still exists. But it doesn't show calltrace this time. It
> >> > was stuck yesterday. I didn't notice that because there has no
> >> > calltrace.
> >> >
> >> > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
> >> >
> >> > It shows that raid5d is still spinning.
> >>
> >> Thanks for testing. I've thought some more and I think there is a better
> >> approach.
> >> The fact that we need to take the mutex to write the super block has
> >> caused problems several times before and is a key part of the problem
> >> now.
> >> Maybe we should relax that. Obviously we don't want two threads
> >> updating the metadata at the same time, but it should be safe to
> >> update it in parallel with other uses of reconfix_mutex.
> >>
> >> Holding mddev->lock while copying data from the struct mddev to the
> >> superblock (which we do) should ensure that the superblock is internally
> >> consistent, and that should be enough.
> >>
> >> So I propose the following patch. It certainly needs review and
> >> testing, but I think it should make a big improvement.
> >>
> >> Thanks,
> >> NeilBrown
> >>
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index b01e458d31e9..414a4c1a052d 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -2388,6 +2388,15 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >> return;
> >> }
> >>
> >> + if (!force_change && !(mddev->sb_flags & ~BIT(MD_SB_UPDATE_ACTIVE)))
> >> + return;
> >> +
> >> + wait_event(mddev->sb_wait,
> >> + !test_and_set_bit(MD_SB_UPDATE_ACTIVE, &mddev->sb_flags));
> >> +
> >> + if (!force_change && !(mddev->sb_flags & ~BIT(MD_SB_UPDATE_ACTIVE)))
> >> + goto out;
> >> +
> >> repeat:
> >> if (mddev_is_clustered(mddev)) {
> >> if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags))
> >> @@ -2402,7 +2411,7 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >> bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
> >> BIT(MD_SB_CHANGE_DEVS) |
> >> BIT(MD_SB_CHANGE_CLEAN));
> >> - return;
> >> + goto out;
> >> }
> >> }
> >>
> >> @@ -2432,8 +2441,7 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >> wake_up(&rdev->blocked_wait);
> >> }
> >> }
> >> - wake_up(&mddev->sb_wait);
> >> - return;
> >> + goto out;
> >> }
> >>
> >> spin_lock(&mddev->lock);
> >> @@ -2544,6 +2552,9 @@ void md_update_sb(struct mddev *mddev, int
> >> force_change)
> >> BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_CLEAN)))
> >> /* have to write it out again */
> >> goto repeat;
> >> +
> >> +out:
> >> + clear_bit_unlock(MD_SB_UPDATE_ACTIVE, &mddev->sb_flags);
> >> wake_up(&mddev->sb_wait);
> >> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> >> sysfs_notify(&mddev->kobj, NULL, "sync_completed");
> >> @@ -5606,8 +5617,7 @@ int md_run(struct mddev *mddev)
> >> set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> >> set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >>
> >> - if (mddev->sb_flags)
> >> - md_update_sb(mddev, 0);
> >> + md_update_sb(mddev, 0);
> >>
> >> md_new_event(mddev);
> >> sysfs_notify_dirent_safe(mddev->sysfs_state);
> >> @@ -8643,17 +8653,14 @@ void md_check_recovery(struct mddev *mddev)
> >>
> >> if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
> >> return;
> >> - if ( ! (
> >> - (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
> >> + if ((
> >> test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> >> test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> >> (mddev->external == 0 && mddev->safemode == 1) ||
> >> (mddev->safemode == 2
> >> && !mddev->in_sync && mddev->recovery_cp == MaxSector)
> >> - ))
> >> - return;
> >> -
> >> - if (mddev_trylock(mddev)) {
> >> + ) &&
> >> + mddev_trylock(mddev)) {
> >> int spares = 0;
> >>
> >> if (!mddev->external && mddev->safemode == 1)
> >> @@ -8706,9 +8713,6 @@ void md_check_recovery(struct mddev *mddev)
> >> spin_unlock(&mddev->lock);
> >> }
> >>
> >> - if (mddev->sb_flags)
> >> - md_update_sb(mddev, 0);
> >> -
> >> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> >> !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> >> /* resync/recovery still happening */
> >> @@ -8786,6 +8790,7 @@ void md_check_recovery(struct mddev *mddev)
> >> wake_up(&mddev->sb_wait);
> >> mddev_unlock(mddev);
> >> }
> >> + md_update_sb(mddev, 0);
> >> }
> >> EXPORT_SYMBOL(md_check_recovery);
> >>
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 09db03455801..bc8633cf7c40 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -243,6 +243,7 @@ enum mddev_sb_flags {
> >> MD_SB_CHANGE_CLEAN, /* transition to or from 'clean' */
> >> MD_SB_CHANGE_PENDING, /* switch from 'clean' to 'active' in progress */
> >> MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */
> >> + MD_SB_UPDATE_ACTIVE, /* A thread is running in md_update_sb */
> >> };
> >>
> >> struct mddev {
> >> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> >> index 2dcbafa8e66c..76169dd8ff7c 100644
> >> --- a/drivers/md/raid5-cache.c
> >> +++ b/drivers/md/raid5-cache.c
> >> @@ -1334,21 +1334,10 @@ static void
> >> r5l_write_super_and_discard_space(struct
> >> r5l_log *log,
> >> mddev = log->rdev->mddev;
> >> /*
> >> * Discard could zero data, so before discard we must make sure
> >> - * superblock is updated to new log tail. Updating superblock (either
> >> - * directly call md_update_sb() or depend on md thread) must hold
> >> - * reconfig mutex. On the other hand, raid5_quiesce is called with
> >> - * reconfig_mutex hold. The first step of raid5_quiesce() is waitting
> >> - * for all IO finish, hence waitting for reclaim thread, while reclaim
> >> - * thread is calling this function and waitting for reconfig mutex. So
> >> - * there is a deadlock. We workaround this issue with a trylock.
> >> - * FIXME: we could miss discard if we can't take reconfig mutex
> >> + * superblock is updated to new log tail.
> >> */
> >> - set_mask_bits(&mddev->sb_flags, 0,
> >> - BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> >> - if (!mddev_trylock(mddev))
> >> - return;
> >> +
> >> md_update_sb(mddev, 1);
> >> - mddev_unlock(mddev);
> >>
> >> /* discard IO error really doesn't matter, ignore it */
> >> if (log->last_checkpoint < end) {
> >>
> >
> > Hi Neil
> >
> > The test have run for three days and the problem is fixed by this patch.
> > Thanks for the help.
>
> Thanks for testing. I'll look over the patch again and see if there is
> any chance that the locking change could introduce other problems.
Ok, if you have new patches I can do the test.
>
> >
> > Could you help to look at https://www.spinics.net/lists/raid/msg58918.html.
> > The bug which is fixed by your patch was found when I try to reproduce that
> > bug. I did a simply analysis, but I'm not sure whether I'm right or not.
>
> It might be the same bug, but if it is there should be other processes
> stuck in a D wait, one of them holding the reconfig_mutex and waiting
> for the array to quiesce.
Hmm but raid5d is stuck in this problem. It's stuck at md_check_recovery.
And conf->quiesce is 0 in this problem.
>
> Where there any other processes in D wait?
Yes, there is another process is in D state:
[440793.893432] INFO: task jbd2/md0-8:27647 blocked for more than 120 seconds.
[440793.900387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[440793.908299] jbd2/md0-8 D ffff8801fe815e00 0 27647 2 0x00000080
[440793.908300] ffff88021ae9fc90 0000000000000046 ffff88022084edd0 ffff88021ae9ffd8
[440793.915846] ffff88021ae9ffd8 ffff88021ae9ffd8 ffff88022084edd0 ffff88021ae9fda0
[440793.923390] ffff88005399e0c0 ffff88022084edd0 ffff88021ae9fd88 ffff8801fe815e00
[440793.930931] Call Trace:
[440793.933467] [<ffffffff81689d39>] schedule+0x29/0x70
[440793.938522] [<ffffffffa0a15238>] jbd2_journal_commit_transaction+0x248/0x19a0 [jbd2]
[440793.946434] [<ffffffff810c780e>] ? account_entity_dequeue+0xae/0xd0
[440793.952876] [<ffffffff810cb2fc>] ? dequeue_entity+0x11c/0x5d0
[440793.958797] [<ffffffff81028569>] ? __switch_to+0xd9/0x4c0
[440793.964372] [<ffffffff810af390>] ? wake_up_atomic_t+0x30/0x30
[440793.970293] [<ffffffff810963ce>] ? try_to_del_timer_sync+0x5e/0x90
[440793.976645] [<ffffffffa0a1b849>] kjournald2+0xc9/0x260 [jbd2]
[440793.982565] [<ffffffff810af390>] ? wake_up_atomic_t+0x30/0x30
[440793.988485] [<ffffffffa0a1b780>] ? commit_timeout+0x10/0x10 [jbd2]
[440793.994839] [<ffffffff810ae2bf>] kthread+0xcf/0xe0
[440793.999803] [<ffffffff810ae1f0>] ? kthread_create_on_node+0x140/0x140
[440794.006419] [<ffffffff81695598>] ret_from_fork+0x58/0x90
[440794.011906] [<ffffffff810ae1f0>] ? kthread_create_on_node+0x140/0x140
Regards
Xiao
next prev parent reply other threads:[~2017-09-11 5:03 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <221835411.4473056.1504338574607.JavaMail.zimbra@redhat.com>
2017-09-02 8:01 ` Stuck in md_write_start because MD_SB_CHANGE_PENDING can't be cleared Xiao Ni
2017-09-04 2:16 ` NeilBrown
2017-09-04 2:45 ` Xiao Ni
2017-09-04 3:52 ` Xiao Ni
2017-09-04 5:34 ` NeilBrown
2017-09-04 7:36 ` Xiao Ni
2017-09-05 1:36 ` NeilBrown
2017-09-05 2:15 ` Xiao Ni
2017-09-07 1:37 ` Xiao Ni
2017-09-07 5:37 ` NeilBrown
2017-09-11 0:14 ` Xiao Ni
2017-09-11 3:36 ` NeilBrown
2017-09-11 5:03 ` Xiao Ni [this message]
2017-09-30 9:44 ` Xiao Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55785648.9562287.1505106191546.JavaMail.zimbra@redhat.com \
--to=xni@redhat.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.