Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
@ 2017-12-27  5:47 bingjingc
  2017-12-28  7:59 ` Guoqing Jiang
  0 siblings, 1 reply; 6+ messages in thread
From: bingjingc @ 2017-12-27  5:47 UTC (permalink / raw)
  To: linux-raid; +Cc: shli


There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
    write through critical data into raid5 emulated block device. So it
    waits for raid5 kernel thread handling stripes in order to finish it
    I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
    called first in order to reap the raid5 resync thread. That is,
    raid5_finish_reshape() will be called. In this function, it will try
    to update conf and call VFS revalidate_disk() to grow the raid5
    emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
---
  drivers/md/md.c    | 13 +++++++++++++
  drivers/md/raid5.c |  8 +-------
  2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4e4dee0..77d6d85 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8527,6 +8527,19 @@ void md_do_sync(struct md_thread *thread)
  	 * raid */
  	set_mask_bits(&mddev->sb_flags, 0,
  		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
+
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+			!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
+			mddev->delta_disks > 0 &&
+			mddev->pers->finish_reshape &&
+			mddev->pers->size &&
+			mddev->queue) {
+		mddev_lock_nointr(mddev);
+		md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
+		mddev_unlock(mddev);
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+		revalidate_disk(mddev->gendisk);
+	}

  	spin_lock(&mddev->lock);
  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 98ce427..0f68faa 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev 
*mddev)

  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {

-		if (mddev->delta_disks > 0) {
-			md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
-			if (mddev->queue) {
-				set_capacity(mddev->gendisk, mddev->array_sectors);
-				revalidate_disk(mddev->gendisk);
-			}
-		} else {
+		if (mddev->delta_disks <= 0) {
  			int d;
  			spin_lock_irq(&conf->device_lock);
  			mddev->degraded = raid5_calc_degraded(conf);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
  2017-12-27  5:47 [PATCH] md/raid5: fix a potential deadlock of raid5 reshape bingjingc
@ 2017-12-28  7:59 ` Guoqing Jiang
  2017-12-29  6:02   ` bingjingc
  0 siblings, 1 reply; 6+ messages in thread
From: Guoqing Jiang @ 2017-12-28  7:59 UTC (permalink / raw)
  To: bingjingc, linux-raid; +Cc: shli



On 12/27/2017 01:47 PM, bingjingc wrote:
>
> There is a potential deadlock if mount/umount happens when
> raid5_finish_reshape() tries to grow the size of emulated disk.
>
> How the deadlock happens?
> 1) The raid5 resync thread finished reshape (expanding array).
> 2) The mount or umount thread holds VFS sb->s_umount lock and tries to
>    write through critical data into raid5 emulated block device. So it
>    waits for raid5 kernel thread handling stripes in order to finish it
>    I/Os.
> 3) In the routine of raid5 kernel thread, md_check_recovery() will be
>    called first in order to reap the raid5 resync thread. That is,
>    raid5_finish_reshape() will be called. In this function, it will try
>    to update conf and call VFS revalidate_disk() to grow the raid5
>    emulated block device. It will try to acquire VFS sb->s_umount lock.
> The raid5 kernel thread cannot continue, so no one can handle mount/
> umount I/Os (stripes). Once the write-through I/Os cannot be finished,
> mount/umount will not release sb->s_umount lock. The deadlock happens.
>
> The raid5 kernel thread is an emulated block device. It is responible to
> handle I/Os (stripes) from upper layers. The emulated block device
> should not request any I/Os on itself. That is, it should not call VFS
> layer functions. (If it did, it will try to acquire VFS locks to
> guarantee the I/Os sequence.) So we have the resync thread to send
> resync I/O requests and to wait for the results.
>
> For solving this potential deadlock, we can put the size growth of the
> emulated block device as the final step of reshape thread.

Maybe another way to avoid the deadlock is to ensure the reshaping array
can't be mount at all. One reason is, if the reshaping is finished, but 
fs still
don't know about  the new array size since metadata is not refreshed. Just
my $0.02.

>
> Reported-by: Alex Wu <alexwu@synology.com>
> Reviewed-by: Alex Wu <alexwu@synology.com>
> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
> Signed-off-by: BingJing Chang <bingjingc@synology.com>
> ---
>  drivers/md/md.c    | 13 +++++++++++++
>  drivers/md/raid5.c |  8 +-------
>  2 files changed, 14 insertions(+), 7 deletions(-)

I think raid10 has the similar issue, so at least you need to change 
raid10 either.

Thanks,
Guoqing

>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4e4dee0..77d6d85 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8527,6 +8527,19 @@ void md_do_sync(struct md_thread *thread)
>       * raid */
>      set_mask_bits(&mddev->sb_flags, 0,
>                BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
> +
> +    if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> +            !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> +            mddev->delta_disks > 0 &&
> +            mddev->pers->finish_reshape &&
> +            mddev->pers->size &&
> +            mddev->queue) {
> +        mddev_lock_nointr(mddev);
> +        md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
> +        mddev_unlock(mddev);
> +        set_capacity(mddev->gendisk, mddev->array_sectors);
> +        revalidate_disk(mddev->gendisk);
> +    }
>
>      spin_lock(&mddev->lock);
>      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 98ce427..0f68faa 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev 
> *mddev)
>
>      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>
> -        if (mddev->delta_disks > 0) {
> -            md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
> -            if (mddev->queue) {
> -                set_capacity(mddev->gendisk, mddev->array_sectors);
> -                revalidate_disk(mddev->gendisk);
> -            }
> -        } else {
> +        if (mddev->delta_disks <= 0) {
>              int d;
>              spin_lock_irq(&conf->device_lock);
>              mddev->degraded = raid5_calc_degraded(conf);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
  2017-12-28  7:59 ` Guoqing Jiang
@ 2017-12-29  6:02   ` bingjingc
  2017-12-29  6:42     ` Guoqing Jiang
  0 siblings, 1 reply; 6+ messages in thread
From: bingjingc @ 2017-12-29  6:02 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli


Guoqing Jiang 於 2017-12-28 15:59 寫到:
> On 12/27/2017 01:47 PM, bingjingc wrote:
>> 
>> There is a potential deadlock if mount/umount happens when
>> raid5_finish_reshape() tries to grow the size of emulated disk.
>> 
>> How the deadlock happens?
>> 1) The raid5 resync thread finished reshape (expanding array).
>> 2) The mount or umount thread holds VFS sb->s_umount lock and tries to
>>    write through critical data into raid5 emulated block device. So it
>>    waits for raid5 kernel thread handling stripes in order to finish 
>> it
>>    I/Os.
>> 3) In the routine of raid5 kernel thread, md_check_recovery() will be
>>    called first in order to reap the raid5 resync thread. That is,
>>    raid5_finish_reshape() will be called. In this function, it will 
>> try
>>    to update conf and call VFS revalidate_disk() to grow the raid5
>>    emulated block device. It will try to acquire VFS sb->s_umount 
>> lock.
>> The raid5 kernel thread cannot continue, so no one can handle mount/
>> umount I/Os (stripes). Once the write-through I/Os cannot be finished,
>> mount/umount will not release sb->s_umount lock. The deadlock happens.
>> 
>> The raid5 kernel thread is an emulated block device. It is responible 
>> to
>> handle I/Os (stripes) from upper layers. The emulated block device
>> should not request any I/Os on itself. That is, it should not call VFS
>> layer functions. (If it did, it will try to acquire VFS locks to
>> guarantee the I/Os sequence.) So we have the resync thread to send
>> resync I/O requests and to wait for the results.
>> 
>> For solving this potential deadlock, we can put the size growth of the
>> emulated block device as the final step of reshape thread.
> 
> Maybe another way to avoid the deadlock is to ensure the reshaping 
> array
> can't be mount at all. One reason is, if the reshaping is finished, but 
> fs still
> don't know about  the new array size since metadata is not refreshed. 
> Just
> my $0.02.
> 

It may take hours or couples of days to expand an array. So I bet that 
it's a
feature that an expanding array can still serve fs requests. Once it 
finished,
the new array size for metadata can be passed to fs layer via 
revalidate_disk().
 From then, fs is able to be expanded, too. There are no needs to restart 
that
array. Based on this point, I don't want to break this feature.

>> 
>> Reported-by: Alex Wu <alexwu@synology.com>
>> Reviewed-by: Alex Wu <alexwu@synology.com>
>> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
>> Signed-off-by: BingJing Chang <bingjingc@synology.com>
>> ---
>>  drivers/md/md.c    | 13 +++++++++++++
>>  drivers/md/raid5.c |  8 +-------
>>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> I think raid10 has the similar issue, so at least you need to change
> raid10 either.

Thank you for your notice. It should be a similar issue. Let me verify
whether it is/isn't really reproducible in raid10 first. And we will 
provide
a fix for it.

Thanks,
BingJing

> 
> Thanks,
> Guoqing
> 
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 4e4dee0..77d6d85 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8527,6 +8527,19 @@ void md_do_sync(struct md_thread *thread)
>>       * raid */
>>      set_mask_bits(&mddev->sb_flags, 0,
>>                BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
>> +
>> +    if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> +            !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>> +            mddev->delta_disks > 0 &&
>> +            mddev->pers->finish_reshape &&
>> +            mddev->pers->size &&
>> +            mddev->queue) {
>> +        mddev_lock_nointr(mddev);
>> +        md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
>> +        mddev_unlock(mddev);
>> +        set_capacity(mddev->gendisk, mddev->array_sectors);
>> +        revalidate_disk(mddev->gendisk);
>> +    }
>> 
>>      spin_lock(&mddev->lock);
>>      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 98ce427..0f68faa 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev 
>> *mddev)
>> 
>>      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>> 
>> -        if (mddev->delta_disks > 0) {
>> -            md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
>> -            if (mddev->queue) {
>> -                set_capacity(mddev->gendisk, mddev->array_sectors);
>> -                revalidate_disk(mddev->gendisk);
>> -            }
>> -        } else {
>> +        if (mddev->delta_disks <= 0) {
>>              int d;
>>              spin_lock_irq(&conf->device_lock);
>>              mddev->degraded = raid5_calc_degraded(conf);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
  2017-12-29  6:02   ` bingjingc
@ 2017-12-29  6:42     ` Guoqing Jiang
  2017-12-29  9:44       ` bingjingc
  0 siblings, 1 reply; 6+ messages in thread
From: Guoqing Jiang @ 2017-12-29  6:42 UTC (permalink / raw)
  To: bingjingc, Guoqing Jiang; +Cc: linux-raid, shli



On 12/29/2017 02:02 PM, bingjingc wrote:
>
> Guoqing Jiang 於 2017-12-28 15:59 寫到:
>> On 12/27/2017 01:47 PM, bingjingc wrote:
>>>
>>> There is a potential deadlock if mount/umount happens when
>>> raid5_finish_reshape() tries to grow the size of emulated disk.
>>>
>>> How the deadlock happens?
>>> 1) The raid5 resync thread finished reshape (expanding array).
>>> 2) The mount or umount thread holds VFS sb->s_umount lock and tries to
>>>    write through critical data into raid5 emulated block device. So it
>>>    waits for raid5 kernel thread handling stripes in order to finish it
>>>    I/Os.
>>> 3) In the routine of raid5 kernel thread, md_check_recovery() will be
>>>    called first in order to reap the raid5 resync thread. That is,
>>>    raid5_finish_reshape() will be called. In this function, it will try
>>>    to update conf and call VFS revalidate_disk() to grow the raid5
>>>    emulated block device. It will try to acquire VFS sb->s_umount lock.
>>> The raid5 kernel thread cannot continue, so no one can handle mount/
>>> umount I/Os (stripes). Once the write-through I/Os cannot be finished,
>>> mount/umount will not release sb->s_umount lock. The deadlock happens.
>>>
>>> The raid5 kernel thread is an emulated block device. It is 
>>> responible to
>>> handle I/Os (stripes) from upper layers. The emulated block device
>>> should not request any I/Os on itself. That is, it should not call VFS
>>> layer functions. (If it did, it will try to acquire VFS locks to
>>> guarantee the I/Os sequence.) So we have the resync thread to send
>>> resync I/O requests and to wait for the results.
>>>
>>> For solving this potential deadlock, we can put the size growth of the
>>> emulated block device as the final step of reshape thread.
>>
>> Maybe another way to avoid the deadlock is to ensure the reshaping array
>> can't be mount at all. One reason is, if the reshaping is finished, 
>> but fs still
>> don't know about  the new array size since metadata is not refreshed. 
>> Just
>> my $0.02.
>>
>
> It may take hours or couples of days to expand an array. So I bet that 
> it's a
> feature that an expanding array can still serve fs requests. Once it 
> finished,
> the new array size for metadata can be passed to fs layer via 
> revalidate_disk().
> From then, fs is able to be expanded, too. There are no needs to 
> restart that
> array. Based on this point, I don't want to break this feature.

Sorry, the metadata which I mentioned before is the metadata for file 
system.
How fs know about the new size without update fs's metadata? Pls see the 
wiki:
https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
  2017-12-29  6:42     ` Guoqing Jiang
@ 2017-12-29  9:44       ` bingjingc
  2018-02-17 20:52         ` Shaohua Li
  0 siblings, 1 reply; 6+ messages in thread
From: bingjingc @ 2017-12-29  9:44 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Guoqing Jiang, linux-raid, shli

Subject: [PATCH] md: fix a potential deadlock of raid5/raid10 reshape

There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
    write through critical data into raid5 emulated block device. So it
    waits for raid5 kernel thread handling stripes in order to finish it
    I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
    called first in order to reap the raid5 resync thread. That is,
    raid5_finish_reshape() will be called. In this function, it will try
    to update conf and call VFS revalidate_disk() to grow the raid5
    emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.

Reported-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
---
  drivers/md/md.c     | 13 +++++++++++++
  drivers/md/raid10.c |  8 +-------
  drivers/md/raid5.c  |  8 +-------
  3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4e4dee0..52092de 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8528,6 +8528,19 @@ void md_do_sync(struct md_thread *thread)
  	set_mask_bits(&mddev->sb_flags, 0,
  		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));

+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+			!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
+			mddev->delta_disks > 0 &&
+			mddev->pers->finish_reshape &&
+			mddev->pers->size &&
+			mddev->queue) {
+		mddev_lock_nointr(mddev);
+		md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
+		mddev_unlock(mddev);
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+		revalidate_disk(mddev->gendisk);
+	}
+
  	spin_lock(&mddev->lock);
  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
  		/* We completed so min/max setting can be forgotten if used. */
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c131835..8a46473 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4818,17 +4818,11 @@ static void raid10_finish_reshape(struct mddev 
*mddev)
  		return;

  	if (mddev->delta_disks > 0) {
-		sector_t size = raid10_size(mddev, 0, 0);
-		md_set_array_sectors(mddev, size);
  		if (mddev->recovery_cp > mddev->resync_max_sectors) {
  			mddev->recovery_cp = mddev->resync_max_sectors;
  			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
  		}
-		mddev->resync_max_sectors = size;
-		if (mddev->queue) {
-			set_capacity(mddev->gendisk, mddev->array_sectors);
-			revalidate_disk(mddev->gendisk);
-		}
+		mddev->resync_max_sectors = mddev->array_sectors;
  	} else {
  		int d;
  		rcu_read_lock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 98ce427..0f68faa 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev 
*mddev)

  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {

-		if (mddev->delta_disks > 0) {
-			md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
-			if (mddev->queue) {
-				set_capacity(mddev->gendisk, mddev->array_sectors);
-				revalidate_disk(mddev->gendisk);
-			}
-		} else {
+		if (mddev->delta_disks <= 0) {
  			int d;
  			spin_lock_irq(&conf->device_lock);
  			mddev->degraded = raid5_calc_degraded(conf);
-- 
2.7.4



Guoqing Jiang 於 2017-12-29 14:42 寫到:
> On 12/29/2017 02:02 PM, bingjingc wrote:
>> 
>> Guoqing Jiang 於 2017-12-28 15:59 寫到:
>>> On 12/27/2017 01:47 PM, bingjingc wrote:
>>>> 
>>>> There is a potential deadlock if mount/umount happens when
>>>> raid5_finish_reshape() tries to grow the size of emulated disk.
>>>> 
>>>> How the deadlock happens?
>>>> 1) The raid5 resync thread finished reshape (expanding array).
>>>> 2) The mount or umount thread holds VFS sb->s_umount lock and tries 
>>>> to
>>>>    write through critical data into raid5 emulated block device. So 
>>>> it
>>>>    waits for raid5 kernel thread handling stripes in order to finish 
>>>> it
>>>>    I/Os.
>>>> 3) In the routine of raid5 kernel thread, md_check_recovery() will 
>>>> be
>>>>    called first in order to reap the raid5 resync thread. That is,
>>>>    raid5_finish_reshape() will be called. In this function, it will 
>>>> try
>>>>    to update conf and call VFS revalidate_disk() to grow the raid5
>>>>    emulated block device. It will try to acquire VFS sb->s_umount 
>>>> lock.
>>>> The raid5 kernel thread cannot continue, so no one can handle mount/
>>>> umount I/Os (stripes). Once the write-through I/Os cannot be 
>>>> finished,
>>>> mount/umount will not release sb->s_umount lock. The deadlock 
>>>> happens.
>>>> 
>>>> The raid5 kernel thread is an emulated block device. It is 
>>>> responible to
>>>> handle I/Os (stripes) from upper layers. The emulated block device
>>>> should not request any I/Os on itself. That is, it should not call 
>>>> VFS
>>>> layer functions. (If it did, it will try to acquire VFS locks to
>>>> guarantee the I/Os sequence.) So we have the resync thread to send
>>>> resync I/O requests and to wait for the results.
>>>> 
>>>> For solving this potential deadlock, we can put the size growth of 
>>>> the
>>>> emulated block device as the final step of reshape thread.
>>> 
>>> Maybe another way to avoid the deadlock is to ensure the reshaping 
>>> array
>>> can't be mount at all. One reason is, if the reshaping is finished, 
>>> but fs still
>>> don't know about  the new array size since metadata is not refreshed. 
>>> Just
>>> my $0.02.
>>> 
>> 
>> It may take hours or couples of days to expand an array. So I bet that 
>> it's a
>> feature that an expanding array can still serve fs requests. Once it 
>> finished,
>> the new array size for metadata can be passed to fs layer via 
>> revalidate_disk().
>> From then, fs is able to be expanded, too. There are no needs to 
>> restart that
>> array. Based on this point, I don't want to break this feature.
> 
> Sorry, the metadata which I mentioned before is the metadata for file 
> system.
> How fs know about the new size without update fs's metadata? Pls see 
> the wiki:
> https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem

Sorry, you're right. It requires few more commands to grow the
filesystem's size. I used wrong wordings. In fact, what I want to
mention is the metadata of the emulated disk. In fs/block_dev.c,
revalidate_disk() will call check_disk_size_change(disk, bdev) to
adjust the monitored block device size. I guess that these structures
belong to VFS. So for MD, it keeps the same behavior to issue the
upper layer about the size change.

Thanks,
BingJing

> 
> Thanks,
> Guoqing

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape
  2017-12-29  9:44       ` bingjingc
@ 2018-02-17 20:52         ` Shaohua Li
  0 siblings, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2018-02-17 20:52 UTC (permalink / raw)
  To: bingjingc; +Cc: Guoqing Jiang, Guoqing Jiang, linux-raid

On Fri, Dec 29, 2017 at 05:44:23PM +0800, bingjingc wrote:
> Subject: [PATCH] md: fix a potential deadlock of raid5/raid10 reshape
> 
> There is a potential deadlock if mount/umount happens when
> raid5_finish_reshape() tries to grow the size of emulated disk.
> 
> How the deadlock happens?
> 1) The raid5 resync thread finished reshape (expanding array).
> 2) The mount or umount thread holds VFS sb->s_umount lock and tries to
>    write through critical data into raid5 emulated block device. So it
>    waits for raid5 kernel thread handling stripes in order to finish it
>    I/Os.
> 3) In the routine of raid5 kernel thread, md_check_recovery() will be
>    called first in order to reap the raid5 resync thread. That is,
>    raid5_finish_reshape() will be called. In this function, it will try
>    to update conf and call VFS revalidate_disk() to grow the raid5
>    emulated block device. It will try to acquire VFS sb->s_umount lock.
> The raid5 kernel thread cannot continue, so no one can handle mount/
> umount I/Os (stripes). Once the write-through I/Os cannot be finished,
> mount/umount will not release sb->s_umount lock. The deadlock happens.
> 
> The raid5 kernel thread is an emulated block device. It is responible to
> handle I/Os (stripes) from upper layers. The emulated block device
> should not request any I/Os on itself. That is, it should not call VFS
> layer functions. (If it did, it will try to acquire VFS locks to
> guarantee the I/Os sequence.) So we have the resync thread to send
> resync I/O requests and to wait for the results.
> 
> For solving this potential deadlock, we can put the size growth of the
> emulated block device as the final step of reshape thread.
> 
> 2017/12/29:
> Thanks to Guoqing Jiang <gqjiang@suse.com>,
> we confirmed that there is the same deadlock issue in raid10. It's
> reproducible and can be fixed by this patch. For raid10.c, we can remove
> the similar code to prevent deadlock as well since they has been called
> before.
> 
> Reported-by: Alex Wu <alexwu@synology.com>
> Reviewed-by: Alex Wu <alexwu@synology.com>
> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
> Signed-off-by: BingJing Chang <bingjingc@synology.com>

can you resend this patch? it doesn't apply.

> ---
>  drivers/md/md.c     | 13 +++++++++++++
>  drivers/md/raid10.c |  8 +-------
>  drivers/md/raid5.c  |  8 +-------
>  3 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4e4dee0..52092de 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8528,6 +8528,19 @@ void md_do_sync(struct md_thread *thread)
>  	set_mask_bits(&mddev->sb_flags, 0,
>  		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
> 
> +	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> +			!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> +			mddev->delta_disks > 0 &&
> +			mddev->pers->finish_reshape &&
> +			mddev->pers->size &&
> +			mddev->queue) {
> +		mddev_lock_nointr(mddev);
> +		md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
> +		mddev_unlock(mddev);
> +		set_capacity(mddev->gendisk, mddev->array_sectors);
> +		revalidate_disk(mddev->gendisk);
> +	}
> +
>  	spin_lock(&mddev->lock);
>  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>  		/* We completed so min/max setting can be forgotten if used. */
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c131835..8a46473 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4818,17 +4818,11 @@ static void raid10_finish_reshape(struct mddev
> *mddev)
>  		return;
> 
>  	if (mddev->delta_disks > 0) {
> -		sector_t size = raid10_size(mddev, 0, 0);
> -		md_set_array_sectors(mddev, size);
>  		if (mddev->recovery_cp > mddev->resync_max_sectors) {
>  			mddev->recovery_cp = mddev->resync_max_sectors;
>  			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>  		}
> -		mddev->resync_max_sectors = size;
> -		if (mddev->queue) {
> -			set_capacity(mddev->gendisk, mddev->array_sectors);
> -			revalidate_disk(mddev->gendisk);
> -		}
> +		mddev->resync_max_sectors = mddev->array_sectors;
>  	} else {
>  		int d;
>  		rcu_read_lock();
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 98ce427..0f68faa 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev *mddev)
> 
>  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> 
> -		if (mddev->delta_disks > 0) {
> -			md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
> -			if (mddev->queue) {
> -				set_capacity(mddev->gendisk, mddev->array_sectors);
> -				revalidate_disk(mddev->gendisk);
> -			}
> -		} else {
> +		if (mddev->delta_disks <= 0) {
>  			int d;
>  			spin_lock_irq(&conf->device_lock);
>  			mddev->degraded = raid5_calc_degraded(conf);
> -- 
> 2.7.4
> 
> 
> 
> Guoqing Jiang 於 2017-12-29 14:42 寫到:
> > On 12/29/2017 02:02 PM, bingjingc wrote:
> > > 
> > > Guoqing Jiang 於 2017-12-28 15:59 寫到:
> > > > On 12/27/2017 01:47 PM, bingjingc wrote:
> > > > > 
> > > > > There is a potential deadlock if mount/umount happens when
> > > > > raid5_finish_reshape() tries to grow the size of emulated disk.
> > > > > 
> > > > > How the deadlock happens?
> > > > > 1) The raid5 resync thread finished reshape (expanding array).
> > > > > 2) The mount or umount thread holds VFS sb->s_umount lock
> > > > > and tries to
> > > > >    write through critical data into raid5 emulated block
> > > > > device. So it
> > > > >    waits for raid5 kernel thread handling stripes in order
> > > > > to finish it
> > > > >    I/Os.
> > > > > 3) In the routine of raid5 kernel thread,
> > > > > md_check_recovery() will be
> > > > >    called first in order to reap the raid5 resync thread. That is,
> > > > >    raid5_finish_reshape() will be called. In this function,
> > > > > it will try
> > > > >    to update conf and call VFS revalidate_disk() to grow the raid5
> > > > >    emulated block device. It will try to acquire VFS
> > > > > sb->s_umount lock.
> > > > > The raid5 kernel thread cannot continue, so no one can handle mount/
> > > > > umount I/Os (stripes). Once the write-through I/Os cannot be
> > > > > finished,
> > > > > mount/umount will not release sb->s_umount lock. The
> > > > > deadlock happens.
> > > > > 
> > > > > The raid5 kernel thread is an emulated block device. It is
> > > > > responible to
> > > > > handle I/Os (stripes) from upper layers. The emulated block device
> > > > > should not request any I/Os on itself. That is, it should
> > > > > not call VFS
> > > > > layer functions. (If it did, it will try to acquire VFS locks to
> > > > > guarantee the I/Os sequence.) So we have the resync thread to send
> > > > > resync I/O requests and to wait for the results.
> > > > > 
> > > > > For solving this potential deadlock, we can put the size
> > > > > growth of the
> > > > > emulated block device as the final step of reshape thread.
> > > > 
> > > > Maybe another way to avoid the deadlock is to ensure the
> > > > reshaping array
> > > > can't be mount at all. One reason is, if the reshaping is
> > > > finished, but fs still
> > > > don't know about  the new array size since metadata is not
> > > > refreshed. Just
> > > > my $0.02.
> > > > 
> > > 
> > > It may take hours or couples of days to expand an array. So I bet
> > > that it's a
> > > feature that an expanding array can still serve fs requests. Once it
> > > finished,
> > > the new array size for metadata can be passed to fs layer via
> > > revalidate_disk().
> > > From then, fs is able to be expanded, too. There are no needs to
> > > restart that
> > > array. Based on this point, I don't want to break this feature.
> > 
> > Sorry, the metadata which I mentioned before is the metadata for file
> > system.
> > How fs know about the new size without update fs's metadata? Pls see the
> > wiki:
> > https://raid.wiki.kernel.org/index.php/Growing#Extending_the_filesystem
> 
> Sorry, you're right. It requires few more commands to grow the
> filesystem's size. I used wrong wordings. In fact, what I want to
> mention is the metadata of the emulated disk. In fs/block_dev.c,
> revalidate_disk() will call check_disk_size_change(disk, bdev) to
> adjust the monitored block device size. I guess that these structures
> belong to VFS. So for MD, it keeps the same behavior to issue the
> upper layer about the size change.
> 
> Thanks,
> BingJing
> 
> > 
> > Thanks,
> > Guoqing

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-02-17 20:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-27  5:47 [PATCH] md/raid5: fix a potential deadlock of raid5 reshape bingjingc
2017-12-28  7:59 ` Guoqing Jiang
2017-12-29  6:02   ` bingjingc
2017-12-29  6:42     ` Guoqing Jiang
2017-12-29  9:44       ` bingjingc
2018-02-17 20:52         ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox