Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.com>
To: Nix <nix@esperi.org.uk>
Cc: linux-raid@vger.kernel.org
Subject: Re: 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request
Date: Tue, 23 May 2017 11:07:21 +1000	[thread overview]
Message-ID: <874lwcjs8m.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <87o9ukykmk.fsf@esperi.org.uk>

[-- Attachment #1: Type: text/plain, Size: 5211 bytes --]

On Mon, May 22 2017, Nix wrote:

> [linux-bcache removed, not a bcache bug]
>
> On 22 May 2017, NeilBrown said this:
>
>> Congratulations.  You have found a bug that dates from 2011.
>
> Oh so not that old then! (My personal record, in a previous job, was a
> bug in allegedly critical functionality that entirely stopped it working
> and had been broken for around fourteen years. Like hell it was
> critical.)
>
>> Commit: 68866e425be2 ("MD: no sync IO while suspended")
>>
>> (I think).
>>
>> A write request gets to raid5_make_request() before mddev_suspend() sets
>> mddev->suspended.
>> raid5_make_request() needs md_write_start() to mark the array "dirty"
>> which it asks the md thread to do, but before the thread gets to the
>> task, mddev->suspend has been set, so md_check_recovery() doesn't update
>> the superblock, so md_write_start() doesn't proceed, so the request
>> never completes, so mddev_suspend() blocks indefinitely.
>
> Ooof. Looks like it's fsync()ing something -- not sure what, though.
> The previous time it was a metadata update...
>
> Hm. I note that ->events is meant to be for things like device additions
> and array starts. RAID5 -> RAID6 reshaping appears to push this up a
> lot:
>
>          Events : 1475948
>
> (it was at around 4000 before this -- itself rather mystifying, given
> that I've only rebooted this machine 27 times, and the array's probably
> been assembled less than 20 times, and was created with the same number
> of devices it has now.)

Originally, Events was for any metadata change, including
  clean <-> dirty
transitions.  This kept waking up spares though, so I change to
change by -1 when changing from dirty to clean, and not bother with
telling the spares.  That was a little problematic too, so no
it always increments except f or dirty->clean transitions where there
are spares.

For a reshape, the reshape is checkpointed quite frequently, and each
checkpoint increases the event counter.

>
> The other array has only 92 events (it was at 30-something before I
> tried this reshape).
>
>> I think that md_check_recovery() need to test for mddev->suspended
>> somewhere else.
>> It needs to allow superblock updates, and probably needs to reap the
>> recovery thread, but must not start a new recovery thread.
>
> My question would be, is all the intricate stuff md_check_recovery() is
> doing valid on an mddev that's not only suspended but detached? Because
> we detach right after the suspension in level_store()...

mddev_detach() kills the thread.  After it completes,
md_check_recovery() cannot possibly run.

However I now see that my patch was insufficient.
mddev_suspend() is called while mddev_lock() is held, and that prevents
md_check_recovery() from writing the superblock, so it can still
deadlock.
This will require more careful analysis.

>
> (btw, you probably want to remove the comment above enum array_state
> that says that "suspended" is "not supported yet", too.)

Yeah.... though it isn't supported by all personalities I think.

>
>> Probably something like this:
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index f6ae1d67bcd0..dbca31be22a1 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8364,8 +8364,6 @@ static void md_start_sync(struct work_struct *ws)
>>   */
>>  void md_check_recovery(struct mddev *mddev)
>>  {
>> -	if (mddev->suspended)
>> -		return;
>>  
>>  	if (mddev->bitmap)
>>  		bitmap_daemon_work(mddev);
>> @@ -8484,6 +8482,7 @@ void md_check_recovery(struct mddev *mddev)
>>  		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
>>  
>>  		if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
>> +		    mddev->suspended ||
>>  		    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
>>  			goto not_running;
>>  		/* no recovery is running.
>>
>> though it's late so don't trust anything I write.
>
> This may end up clearing MD_RECOVERY_INTR and MD_RECOVERY_DONE, but I
> guess in this context it's safe to do so... looks OK otherwise to the
> totally ignorant fool reading this patch!
>
> Hm, being picky now, md_check_recovery() might be slightly easier to
> read if that governing mutex_trylock() conditional was inverted:
>
> 	if (mddev_trylock(mddev))
> 		return;
>
> Then you could de-indent most of the function by one indentation
> level...

You could send a patch if you liked.... but as it looks like I have more
work to do there, I'll probably do it.

>
>> If you try again it will almost certainly succeed.  I suspect this is a
>> hard race to hit - well done!!!
>
> I'll give it a try -- I hit it twice in succession, once with a
> --backup-file, once without. Since mdadm does not warn about the lack of
> a --backup-file, I guess the statement in the manual that it is
> essential to provide one when changing RAID levels is untrue: I suspect
> that it's necessary *if* you're not increasing the number of disks at
> the same time, but since I'm growing into a spare, adding a
> --backup-file only slows it down?
>
> I might run a backup first though. :)

Always a good idea, if you can.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

next prev parent reply	other threads:[~2017-05-23  1:07 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-22  9:13 4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request Nix
2017-05-22 11:35 ` NeilBrown
2017-05-22 15:30   ` Nix
2017-05-22 19:07     ` Wols Lists
2017-05-22 20:43       ` Nix
2017-05-23  1:20         ` NeilBrown
2017-05-23 10:10           ` Nix
2017-05-23  1:39       ` NeilBrown
2017-05-23 14:47         ` Wols Lists
2017-05-24  1:50           ` NeilBrown
2017-05-23  1:07     ` NeilBrown [this message]
2017-05-22 21:38   ` Nix
2017-05-23 14:16     ` Wols Lists
2017-05-23 15:00       ` Nix
2017-05-24  1:24     ` NeilBrown
2017-05-24 13:28       ` Nix
2017-05-25  1:31         ` NeilBrown
2017-05-25 12:14           ` Nix
2017-05-24 19:42       ` Nix
2017-05-24 22:57       ` Shaohua Li
2017-05-25  1:30         ` NeilBrown
2017-05-25  1:46           ` Shaohua Li
2017-05-26  3:23             ` NeilBrown
2017-05-26 16:40               ` Shaohua Li
2017-05-28 23:17         ` NeilBrown
2017-05-30 17:41           ` Shaohua Li
2017-06-05  6:49             ` [PATCH] md: fix deadlock between mddev_suspend() and md_write_start() NeilBrown
2017-06-06  0:01               ` Shaohua Li
2017-06-07  1:45                 ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874lwcjs8m.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=nix@esperi.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).