From: NeilBrown <neilb@suse.com>
To: Nix <nix@esperi.org.uk>
Cc: linux-raid@vger.kernel.org
Subject: Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
Date: Tue, 23 May 2017 11:07:21 +1000 [thread overview]
Message-ID: <874lwcjs8m.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <87o9ukykmk.fsf@esperi.org.uk>
[-- Attachment #1: Type: text/plain, Size: 5211 bytes --]
On Mon, May 22 2017, Nix wrote:
> [linux-bcache removed, not a bcache bug]
>
> On 22 May 2017, NeilBrown said this:
>
>> Congratulations. You have found a bug that dates from 2011.
>
> Oh so not that old then! (My personal record, in a previous job, was a
> bug in allegedly critical functionality that entirely stopped it working
> and had been broken for around fourteen years. Like hell it was
> critical.)
>
>> Commit: 68866e425be2 ("MD: no sync IO while suspended")
>>
>> (I think).
>>
>> A write request gets to raid5_make_request() before mddev_suspend() sets
>> mddev->suspended.
>> raid5_make_request() needs md_write_start() to mark the array "dirty"
>> which it asks the md thread to do, but before the thread gets to the
>> task, mddev->suspend has been set, so md_check_recovery() doesn't update
>> the superblock, so md_write_start() doesn't proceed, so the request
>> never completes, so mddev_suspend() blocks indefinitely.
>
> Ooof. Looks like it's fsync()ing something -- not sure what, though.
> The previous time it was a metadata update...
>
> Hm. I note that ->events is meant to be for things like device additions
> and array starts. RAID5 -> RAID6 reshaping appears to push this up a
> lot:
>
> Events : 1475948
>
> (it was at around 4000 before this -- itself rather mystifying, given
> that I've only rebooted this machine 27 times, and the array's probably
> been assembled less than 20 times, and was created with the same number
> of devices it has now.)
Originally, Events was for any metadata change, including
clean <-> dirty
transitions. This kept waking up spares though, so I change to
change by -1 when changing from dirty to clean, and not bother with
telling the spares. That was a little problematic too, so no
it always increments except f or dirty->clean transitions where there
are spares.
For a reshape, the reshape is checkpointed quite frequently, and each
checkpoint increases the event counter.
>
> The other array has only 92 events (it was at 30-something before I
> tried this reshape).
>
>> I think that md_check_recovery() need to test for mddev->suspended
>> somewhere else.
>> It needs to allow superblock updates, and probably needs to reap the
>> recovery thread, but must not start a new recovery thread.
>
> My question would be, is all the intricate stuff md_check_recovery() is
> doing valid on an mddev that's not only suspended but detached? Because
> we detach right after the suspension in level_store()...
mddev_detach() kills the thread. After it completes,
md_check_recovery() cannot possibly run.
However I now see that my patch was insufficient.
mddev_suspend() is called while mddev_lock() is held, and that prevents
md_check_recovery() from writing the superblock, so it can still
deadlock.
This will require more careful analysis.
>
> (btw, you probably want to remove the comment above enum array_state
> that says that "suspended" is "not supported yet", too.)
Yeah.... though it isn't supported by all personalities I think.
>
>> Probably something like this:
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index f6ae1d67bcd0..dbca31be22a1 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>> */
>> void md_check_recovery(struct mddev *mddev)
>> {
>> - if (mddev->suspended)
>> - return;
>>
>> if (mddev->bitmap)
>> bitmap_daemon_work(mddev);
>> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>> clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>>
>> if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
>> + mddev->suspended ||
>> test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>> goto not_running;
>> /* no recovery is running.
>>
>> though it's late so don't trust anything I write.
>
> This may end up clearing MD_RECOVERY_INTR and MD_RECOVERY_DONE, but I
> guess in this context it's safe to do so... looks OK otherwise to the
> totally ignorant fool reading this patch!
>
> Hm, being picky now, md_check_recovery() might be slightly easier to
> read if that governing mutex_trylock() conditional was inverted:
>
> if (mddev_trylock(mddev))
> return;
>
> Then you could de-indent most of the function by one indentation
> level...
You could send a patch if you liked.... but as it looks like I have more
work to do there, I'll probably do it.
>
>> If you try again it will almost certainly succeed. I suspect this is a
>> hard race to hit - well done!!!
>
> I'll give it a try -- I hit it twice in succession, once with a
> --backup-file, once without. Since mdadm does not warn about the lack of
> a --backup-file, I guess the statement in the manual that it is
> essential to provide one when changing RAID levels is untrue: I suspect
> that it's necessary *if* you're not increasing the number of disks at
> the same time, but since I'm growing into a spare, adding a
> --backup-file only slows it down?
>
> I might run a backup first though. :)
Always a good idea, if you can.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
next prev parent reply other threads:[~2017-05-23 1:07 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-22 9:13 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request Nix
2017-05-22 11:35 ` NeilBrown
2017-05-22 15:30 ` Nix
2017-05-22 19:07 ` Wols Lists
2017-05-22 20:43 ` Nix
2017-05-23 1:20 ` NeilBrown
2017-05-23 10:10 ` Nix
2017-05-23 1:39 ` NeilBrown
2017-05-23 14:47 ` Wols Lists
2017-05-24 1:50 ` NeilBrown
2017-05-23 1:07 ` NeilBrown [this message]
2017-05-22 21:38 ` Nix
2017-05-23 14:16 ` Wols Lists
2017-05-23 15:00 ` Nix
2017-05-24 1:24 ` NeilBrown
2017-05-24 13:28 ` Nix
2017-05-25 1:31 ` NeilBrown
2017-05-25 12:14 ` Nix
2017-05-24 19:42 ` Nix
2017-05-24 22:57 ` Shaohua Li
2017-05-25 1:30 ` NeilBrown
2017-05-25 1:46 ` Shaohua Li
2017-05-26 3:23 ` NeilBrown
2017-05-26 16:40 ` Shaohua Li
2017-05-28 23:17 ` NeilBrown
2017-05-30 17:41 ` Shaohua Li
2017-06-05 6:49 ` [PATCH] md: fix deadlock between mddev_suspend() and md_write_start() NeilBrown
2017-06-06 0:01 ` Shaohua Li
2017-06-07 1:45 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874lwcjs8m.fsf@notabene.neil.brown.name \
--to=neilb@suse.com \
--cc=linux-raid@vger.kernel.org \
--cc=nix@esperi.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).