Linux RAID subsystem development
 help / color / mirror / Atom feed
From: bingjingc <bingjingc@synology.com>
To: Guoqing Jiang <gqjiang@suse.com>
Cc: linux-raid@vger.kernel.org, shli@kernel.org
Subject: Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
Date: Fri, 29 Dec 2017 14:02:44 +0800	[thread overview]
Message-ID: <4f1dbb814cafc8dc21cc7c2d7f0af30c@synology.com> (raw)
In-Reply-To: <3321c473-5d17-3065-08bc-b5e51f906383@suse.com>


Guoqing Jiang 於 2017-12-28 15:59 寫到:
> On 12/27/2017 01:47 PM, bingjingc wrote:
>> 
>> There is a potential deadlock if mount/umount happens when
>> raid5_finish_reshape() tries to grow the size of emulated disk.
>> 
>> How the deadlock happens?
>> 1) The raid5 resync thread finished reshape (expanding array).
>> 2) The mount or umount thread holds VFS sb->s_umount lock and tries to
>>    write through critical data into raid5 emulated block device. So it
>>    waits for raid5 kernel thread handling stripes in order to finish 
>> it
>>    I/Os.
>> 3) In the routine of raid5 kernel thread, md_check_recovery() will be
>>    called first in order to reap the raid5 resync thread. That is,
>>    raid5_finish_reshape() will be called. In this function, it will 
>> try
>>    to update conf and call VFS revalidate_disk() to grow the raid5
>>    emulated block device. It will try to acquire VFS sb->s_umount 
>> lock.
>> The raid5 kernel thread cannot continue, so no one can handle mount/
>> umount I/Os (stripes). Once the write-through I/Os cannot be finished,
>> mount/umount will not release sb->s_umount lock. The deadlock happens.
>> 
>> The raid5 kernel thread is an emulated block device. It is responible 
>> to
>> handle I/Os (stripes) from upper layers. The emulated block device
>> should not request any I/Os on itself. That is, it should not call VFS
>> layer functions. (If it did, it will try to acquire VFS locks to
>> guarantee the I/Os sequence.) So we have the resync thread to send
>> resync I/O requests and to wait for the results.
>> 
>> For solving this potential deadlock, we can put the size growth of the
>> emulated block device as the final step of reshape thread.
> 
> Maybe another way to avoid the deadlock is to ensure the reshaping 
> array
> can't be mount at all. One reason is, if the reshaping is finished, but 
> fs still
> don't know about  the new array size since metadata is not refreshed. 
> Just
> my $0.02.
> 

It may take hours or couples of days to expand an array. So I bet that 
it's a
feature that an expanding array can still serve fs requests. Once it 
finished,
the new array size for metadata can be passed to fs layer via 
revalidate_disk().
 From then, fs is able to be expanded, too. There are no needs to restart 
that
array. Based on this point, I don't want to break this feature.

>> 
>> Reported-by: Alex Wu <alexwu@synology.com>
>> Reviewed-by: Alex Wu <alexwu@synology.com>
>> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
>> Signed-off-by: BingJing Chang <bingjingc@synology.com>
>> ---
>>  drivers/md/md.c    | 13 +++++++++++++
>>  drivers/md/raid5.c |  8 +-------
>>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> I think raid10 has the similar issue, so at least you need to change
> raid10 either.

Thank you for your notice. It should be a similar issue. Let me verify
whether it is/isn't really reproducible in raid10 first. And we will 
provide
a fix for it.

Thanks,
BingJing

> 
> Thanks,
> Guoqing
> 
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 4e4dee0..77d6d85 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8527,6 +8527,19 @@ void md_do_sync(struct md_thread *thread)
>>       * raid */
>>      set_mask_bits(&mddev->sb_flags, 0,
>>                BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
>> +
>> +    if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> +            !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>> +            mddev->delta_disks > 0 &&
>> +            mddev->pers->finish_reshape &&
>> +            mddev->pers->size &&
>> +            mddev->queue) {
>> +        mddev_lock_nointr(mddev);
>> +        md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
>> +        mddev_unlock(mddev);
>> +        set_capacity(mddev->gendisk, mddev->array_sectors);
>> +        revalidate_disk(mddev->gendisk);
>> +    }
>> 
>>      spin_lock(&mddev->lock);
>>      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 98ce427..0f68faa 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev 
>> *mddev)
>> 
>>      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>> 
>> -        if (mddev->delta_disks > 0) {
>> -            md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
>> -            if (mddev->queue) {
>> -                set_capacity(mddev->gendisk, mddev->array_sectors);
>> -                revalidate_disk(mddev->gendisk);
>> -            }
>> -        } else {
>> +        if (mddev->delta_disks <= 0) {
>>              int d;
>>              spin_lock_irq(&conf->device_lock);
>>              mddev->degraded = raid5_calc_degraded(conf);


  reply	other threads:[~2017-12-29  6:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-27  5:47 [PATCH] md/raid5: fix a potential deadlock of raid5 reshape bingjingc
2017-12-28  7:59 ` Guoqing Jiang
2017-12-29  6:02   ` bingjingc [this message]
2017-12-29  6:42     ` Guoqing Jiang
2017-12-29  9:44       ` bingjingc
2018-02-17 20:52         ` Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f1dbb814cafc8dc21cc7c2d7f0af30c@synology.com \
    --to=bingjingc@synology.com \
    --cc=gqjiang@suse.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox