* [PATCH 0/2] md: regression fix
@ 2022-06-07 2:03 Guoqing Jiang
2022-06-07 2:03 ` [PATCH 1/2] Revert "md: don't unregister sync_thread with reconfig_mutex held" Guoqing Jiang
2022-06-07 2:03 ` [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store Guoqing Jiang
0 siblings, 2 replies; 8+ messages in thread
From: Guoqing Jiang @ 2022-06-07 2:03 UTC (permalink / raw)
To: song; +Cc: buczek, logang, linux-raid
Hi,
The first one reverts commit 8b48ec23cc51a ("md: don't unregister
sync_thread with reconfig_mutex held") since 07reshape5intr is
broken because of it.
The second one tried another way to fix previous problem.
Thanks,
Guoqing
Guoqing Jiang (2):
Revert "md: don't unregister sync_thread with reconfig_mutex held"
md: unlock mddev before reap sync_thread in action_store
drivers/md/dm-raid.c | 3 ++-
drivers/md/md.c | 26 +++++++++++++++-----------
drivers/md/md.h | 2 +-
3 files changed, 18 insertions(+), 13 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] Revert "md: don't unregister sync_thread with reconfig_mutex held"
2022-06-07 2:03 [PATCH 0/2] md: regression fix Guoqing Jiang
@ 2022-06-07 2:03 ` Guoqing Jiang
2022-06-07 2:03 ` [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store Guoqing Jiang
1 sibling, 0 replies; 8+ messages in thread
From: Guoqing Jiang @ 2022-06-07 2:03 UTC (permalink / raw)
To: song; +Cc: buczek, logang, linux-raid
The 07reshape5intr test is broke because of below path.
md_reap_sync_thread
-> mddev_unlock
-> md_unregister_thread(&mddev->sync_thread)
And md_check_recovery is triggered by,
mddev_unlock -> md_wakeup_thread(mddev->thread)
then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.
Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
drivers/md/dm-raid.c | 2 +-
drivers/md/md.c | 14 +++++---------
drivers/md/md.h | 2 +-
3 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5e41fbae3f6b..9526ccbedafb 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3725,7 +3725,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev, false);
+ md_reap_sync_thread(mddev);
}
} else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
return -EBUSY;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5c8efef13881..2e83a19e3aba 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4831,7 +4831,7 @@ action_store(struct mddev *mddev, const char *page, size_t len)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev, true);
+ md_reap_sync_thread(mddev);
}
mddev_unlock(mddev);
}
@@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev, true);
+ md_reap_sync_thread(mddev);
}
del_timer_sync(&mddev->safemode_timer);
@@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev)
* ->spare_active and clear saved_raid_disk
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev, true);
+ md_reap_sync_thread(mddev);
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
@@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev)
goto unlock;
}
if (mddev->sync_thread) {
- md_reap_sync_thread(mddev, true);
+ md_reap_sync_thread(mddev);
goto unlock;
}
/* Set RUNNING before clearing NEEDED to avoid
@@ -9411,18 +9411,14 @@ void md_check_recovery(struct mddev *mddev)
}
EXPORT_SYMBOL(md_check_recovery);
-void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
+void md_reap_sync_thread(struct mddev *mddev)
{
struct md_rdev *rdev;
sector_t old_dev_sectors = mddev->dev_sectors;
bool is_reshaped = false;
- if (reconfig_mutex_held)
- mddev_unlock(mddev);
/* resync has finished, collect result */
md_unregister_thread(&mddev->sync_thread);
- if (reconfig_mutex_held)
- mddev_lock_nointr(mddev);
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 82465a4b1588..ba8224fd68e9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -719,7 +719,7 @@ extern struct md_thread *md_register_thread(
extern void md_unregister_thread(struct md_thread **threadp);
extern void md_wakeup_thread(struct md_thread *thread);
extern void md_check_recovery(struct mddev *mddev);
-extern void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held);
+extern void md_reap_sync_thread(struct mddev *mddev);
extern int mddev_init_writes_pending(struct mddev *mddev);
extern bool md_write_start(struct mddev *mddev, struct bio *bi);
extern void md_write_inc(struct mddev *mddev, struct bio *bi);
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store
2022-06-07 2:03 [PATCH 0/2] md: regression fix Guoqing Jiang
2022-06-07 2:03 ` [PATCH 1/2] Revert "md: don't unregister sync_thread with reconfig_mutex held" Guoqing Jiang
@ 2022-06-07 2:03 ` Guoqing Jiang
2022-06-07 9:46 ` Guoqing Jiang
2022-06-08 17:23 ` Logan Gunthorpe
1 sibling, 2 replies; 8+ messages in thread
From: Guoqing Jiang @ 2022-06-07 2:03 UTC (permalink / raw)
To: song; +Cc: buczek, logang, linux-raid
Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.
Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.
1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
drivers/md/dm-raid.c | 1 +
drivers/md/md.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 9526ccbedafb..d43b8075c055 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3725,6 +3725,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
}
} else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2e83a19e3aba..4d70672f8ea8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4830,6 +4830,12 @@ action_store(struct mddev *mddev, const char *page, size_t len)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
+ sector_t save_rp = mddev->reshape_position;
+
+ mddev_unlock(mddev);
+ md_unregister_thread(&mddev->sync_thread);
+ mddev_lock_nointr(mddev);
+ mddev->reshape_position = save_rp;
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
@@ -6197,6 +6203,7 @@ static void __md_stop_writes(struct mddev *mddev)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
}
@@ -9303,6 +9310,7 @@ void md_check_recovery(struct mddev *mddev)
* ->spare_active and clear saved_raid_disk
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -9338,6 +9346,7 @@ void md_check_recovery(struct mddev *mddev)
goto unlock;
}
if (mddev->sync_thread) {
+ md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
goto unlock;
}
@@ -9417,8 +9426,7 @@ void md_reap_sync_thread(struct mddev *mddev)
sector_t old_dev_sectors = mddev->dev_sectors;
bool is_reshaped = false;
- /* resync has finished, collect result */
- md_unregister_thread(&mddev->sync_thread);
+ /* sync_thread should be unregistered, collect result */
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
--
2.31.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store
2022-06-07 2:03 ` [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store Guoqing Jiang
@ 2022-06-07 9:46 ` Guoqing Jiang
2022-06-07 22:59 ` Song Liu
2022-06-08 17:23 ` Logan Gunthorpe
1 sibling, 1 reply; 8+ messages in thread
From: Guoqing Jiang @ 2022-06-07 9:46 UTC (permalink / raw)
To: song; +Cc: buczek, logang, linux-raid
Pls hold on, I will verify it with latest kernel.
Thanks,
Guoqing
On 6/7/22 10:03 AM, Guoqing Jiang wrote:
> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
> with reconfig_mutex held") fixed is related with action_store path, other
> callers which reap sync_thread didn't need to be changed.
>
> Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
> bug with belows.
>
> 1. unlock mddev before md_reap_sync_thread in action_store.
> 2. save reshape_position before unlock, then restore it to ensure position
> not changed accidentally by others.
>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
> drivers/md/dm-raid.c | 1 +
> drivers/md/md.c | 12 ++++++++++--
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 9526ccbedafb..d43b8075c055 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3725,6 +3725,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
> if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
> if (mddev->sync_thread) {
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> }
> } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2e83a19e3aba..4d70672f8ea8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4830,6 +4830,12 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> if (work_pending(&mddev->del_work))
> flush_workqueue(md_misc_wq);
> if (mddev->sync_thread) {
> + sector_t save_rp = mddev->reshape_position;
> +
> + mddev_unlock(mddev);
> + md_unregister_thread(&mddev->sync_thread);
> + mddev_lock_nointr(mddev);
> + mddev->reshape_position = save_rp;
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> md_reap_sync_thread(mddev);
> }
> @@ -6197,6 +6203,7 @@ static void __md_stop_writes(struct mddev *mddev)
> flush_workqueue(md_misc_wq);
> if (mddev->sync_thread) {
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> }
>
> @@ -9303,6 +9310,7 @@ void md_check_recovery(struct mddev *mddev)
> * ->spare_active and clear saved_raid_disk
> */
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> @@ -9338,6 +9346,7 @@ void md_check_recovery(struct mddev *mddev)
> goto unlock;
> }
> if (mddev->sync_thread) {
> + md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> goto unlock;
> }
> @@ -9417,8 +9426,7 @@ void md_reap_sync_thread(struct mddev *mddev)
> sector_t old_dev_sectors = mddev->dev_sectors;
> bool is_reshaped = false;
>
> - /* resync has finished, collect result */
> - md_unregister_thread(&mddev->sync_thread);
> + /* sync_thread should be unregistered, collect result */
> if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> mddev->degraded != mddev->raid_disks) {
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store
2022-06-07 9:46 ` Guoqing Jiang
@ 2022-06-07 22:59 ` Song Liu
2022-06-07 23:45 ` Guoqing Jiang
0 siblings, 1 reply; 8+ messages in thread
From: Song Liu @ 2022-06-07 22:59 UTC (permalink / raw)
To: Guoqing Jiang; +Cc: Donald Buczek, Logan Gunthorpe, linux-raid
On Tue, Jun 7, 2022 at 2:46 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
> Pls hold on, I will verify it with latest kernel.
>
> Thanks,
> Guoqing
>
> On 6/7/22 10:03 AM, Guoqing Jiang wrote:
> > Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
> > with reconfig_mutex held") fixed is related with action_store path, other
> > callers which reap sync_thread didn't need to be changed.
> >
> > Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
> > bug with belows.
> >
> > 1. unlock mddev before md_reap_sync_thread in action_store.
> > 2. save reshape_position before unlock, then restore it to ensure position
> > not changed accidentally by others.
> >
> > Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
This version doesn't really work. Maybe we should ship the revert first?
Thanks,
Song
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store
2022-06-07 22:59 ` Song Liu
@ 2022-06-07 23:45 ` Guoqing Jiang
0 siblings, 0 replies; 8+ messages in thread
From: Guoqing Jiang @ 2022-06-07 23:45 UTC (permalink / raw)
To: Song Liu; +Cc: Donald Buczek, Logan Gunthorpe, linux-raid
On 6/8/22 6:59 AM, Song Liu wrote:
> On Tue, Jun 7, 2022 at 2:46 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>> Pls hold on, I will verify it with latest kernel.
>>
>> Thanks,
>> Guoqing
>>
>> On 6/7/22 10:03 AM, Guoqing Jiang wrote:
>>> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
>>> with reconfig_mutex held") fixed is related with action_store path, other
>>> callers which reap sync_thread didn't need to be changed.
>>>
>>> Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
>>> bug with belows.
>>>
>>> 1. unlock mddev before md_reap_sync_thread in action_store.
>>> 2. save reshape_position before unlock, then restore it to ensure position
>>> not changed accidentally by others.
>>>
>>> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> This version doesn't really work. Maybe we should ship the revert first?
Agree, pls apply the revert first.
Thanks,
Guoqing
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store
2022-06-07 2:03 ` [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store Guoqing Jiang
2022-06-07 9:46 ` Guoqing Jiang
@ 2022-06-08 17:23 ` Logan Gunthorpe
2022-06-09 7:20 ` Guoqing Jiang
1 sibling, 1 reply; 8+ messages in thread
From: Logan Gunthorpe @ 2022-06-08 17:23 UTC (permalink / raw)
To: Guoqing Jiang, song; +Cc: buczek, linux-raid
Hey,
On 2022-06-06 20:03, Guoqing Jiang wrote:
> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
> with reconfig_mutex held") fixed is related with action_store path, other
> callers which reap sync_thread didn't need to be changed.
>
> Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
> bug with belows.
>
> 1. unlock mddev before md_reap_sync_thread in action_store.
> 2. save reshape_position before unlock, then restore it to ensure position
> not changed accidentally by others.
>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
> drivers/md/dm-raid.c | 1 +
> drivers/md/md.c | 12 ++++++++++--
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 9526ccbedafb..d43b8075c055 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3725,6 +3725,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
> if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
> if (mddev->sync_thread) {
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> }
> } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2e83a19e3aba..4d70672f8ea8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4830,6 +4830,12 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> if (work_pending(&mddev->del_work))
> flush_workqueue(md_misc_wq);
> if (mddev->sync_thread) {
> + sector_t save_rp = mddev->reshape_position;
> +
> + mddev_unlock(mddev);
> + md_unregister_thread(&mddev->sync_thread);
> + mddev_lock_nointr(mddev);
> + mddev->reshape_position = save_rp;
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> md_reap_sync_thread(mddev);
> }
So with this patch, with 07revert-grow: I see the warning at the end of
this email. Seems there's a new bug in the hunk above: MD_RECOVERY_INTR
should be set before md_unregsiter_thread().
When I make that change, 07revert-grow fails in the same way it did with
the previous patch.
Logan
--
177.804352] ------------[ cut here ]------------
[ 177.805200] WARNING: CPU: 2 PID: 1382 at drivers/md/md.c:490
mddev_suspend+0x26c/0x330
[ 177.806541] Modules linked in:
[ 177.807078] CPU: 2 PID: 1382 Comm: md0_raid5 Not tainted
5.18.0-rc3-eid-vmlocalyes-dbg-00035-gd1d91cae9ac1 #2237
[ 177.809122] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.14.0-2 04/01/2014
[ 177.810649] RIP: 0010:mddev_suspend+0x26c/0x330
[ 177.811654] Code: ab 5c 13 00 00 e9 a3 fe ff ff 48 8d bb d8 02 00 00
be ff ff ff ff e8 d3 97 5f 00 85 c0 0f 85 74 fe ff ff 0f 0b e9 6d fe ff
ff <0f> 0b e9 4c fe ff ff 4c 8d bd 68 ff ff ff 31 f6 4c 89 ff e8 1c f7
[ 177.815028] RSP: 0018:ffff88828656faa8 EFLAGS: 00010246
[ 177.815995] RAX: 0000000000000000 RBX: ffff8881765e8000 RCX:
ffffffff9b3b1ed4
[ 177.817306] RDX: dffffc0000000000 RSI: ffff88817333e478 RDI:
ffff88816e861670
[ 177.818555] RBP: ffff88828656fb78 R08: ffffffff9b38e442 R09:
ffffffff9e94db07
[ 177.819757] R10: fffffbfff3d29b60 R11: 0000000000000001 R12:
ffff88816e861600
[ 177.821043] R13: ffff88829c34cf00 R14: ffff8881765e8000 R15:
ffff88817333e248
[ 177.822307] FS: 0000000000000000(0000) GS:ffff888472e00000(0000)
knlGS:0000000000000000
[ 177.823699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 177.824701] CR2: 00007faa60ada843 CR3: 000000028e9ca002 CR4:
0000000000370ee0
[ 177.825889] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 177.827111] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 177.828346] Call Trace:
[ 177.828785] <TASK>
[ 177.829275] ? rdev_read_only+0x150/0x150
[ 177.830180] ? lock_is_held_type+0xf3/0x150
[ 177.830888] ? raid5_calc_degraded+0x1f9/0x650
[ 177.831665] check_reshape+0x289/0x500
[ 177.832360] raid5_check_reshape+0x93/0x150
[ 177.833060] md_check_recovery+0x864/0xa80
[ 177.833757] raid5d+0xc6/0x930
[ 177.834272] ? __this_cpu_preempt_check+0x13/0x20
[ 177.835060] ? lock_downgrade+0x3f0/0x3f0
[ 177.835727] ? raid5_do_work+0x330/0x330
[ 177.836392] ? __this_cpu_preempt_check+0x13/0x20
[ 177.837313] ? lockdep_hardirqs_on+0x82/0x110
[ 177.838156] ? _raw_spin_unlock_irqrestore+0x31/0x60
[ 177.838977] ? trace_hardirqs_on+0x2b/0x110
[ 177.839678] ? preempt_count_sub+0x18/0xc0
[ 177.840383] ? _raw_spin_unlock_irqrestore+0x41/0x60
[ 177.841241] md_thread+0x1a2/0x2c0
[ 177.841836] ? suspend_lo_store+0x140/0x140
[ 177.842515] ? __this_cpu_preempt_check+0x13/0x20
[ 177.843290] ? lockdep_hardirqs_on+0x82/0x110
[ 177.844111] ? destroy_sched_domains_rcu+0x40/0x40
[ 177.844979] ? preempt_count_sub+0x18/0xc0
[ 177.845769] ? __kasan_check_read+0x11/0x20
[ 177.846514] ? suspend_lo_store+0x140/0x140
[ 177.847210] kthread+0x177/0x1b0
[ 177.847756] ? kthread_complete_and_exit+0x30/0x30
[ 177.848570] ret_from_fork+0x22/0x30
[ 177.849204] </TASK>
[ 177.849584] irq event stamp: 609927
[ 177.850163] hardirqs last enabled at (609935): [<ffffffff9a216ea4>]
__up_console_sem+0x64/0x70
[ 177.851601] hardirqs last disabled at (609942): [<ffffffff9a216e89>]
__up_console_sem+0x49/0x70
[ 177.853048] softirqs last enabled at (609546): [<ffffffff9a14751e>]
__irq_exit_rcu+0xfe/0x150
[ 177.854706] softirqs last disabled at (609499): [<ffffffff9a14751e>]
__irq_exit_rcu+0xfe/0x150
[ 177.856240] ---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store
2022-06-08 17:23 ` Logan Gunthorpe
@ 2022-06-09 7:20 ` Guoqing Jiang
0 siblings, 0 replies; 8+ messages in thread
From: Guoqing Jiang @ 2022-06-09 7:20 UTC (permalink / raw)
To: Logan Gunthorpe, song; +Cc: buczek, linux-raid
On 6/9/22 1:23 AM, Logan Gunthorpe wrote:
> Hey,
>
> On 2022-06-06 20:03, Guoqing Jiang wrote:
>> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
>> with reconfig_mutex held") fixed is related with action_store path, other
>> callers which reap sync_thread didn't need to be changed.
>>
>> Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
>> bug with belows.
>>
>> 1. unlock mddev before md_reap_sync_thread in action_store.
>> 2. save reshape_position before unlock, then restore it to ensure position
>> not changed accidentally by others.
>>
>> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
>> ---
>> drivers/md/dm-raid.c | 1 +
>> drivers/md/md.c | 12 ++++++++++--
>> 2 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>> index 9526ccbedafb..d43b8075c055 100644
>> --- a/drivers/md/dm-raid.c
>> +++ b/drivers/md/dm-raid.c
>> @@ -3725,6 +3725,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
>> if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
>> if (mddev->sync_thread) {
>> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> + md_unregister_thread(&mddev->sync_thread);
>> md_reap_sync_thread(mddev);
>> }
>> } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2e83a19e3aba..4d70672f8ea8 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -4830,6 +4830,12 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>> if (work_pending(&mddev->del_work))
>> flush_workqueue(md_misc_wq);
>> if (mddev->sync_thread) {
>> + sector_t save_rp = mddev->reshape_position;
>> +
>> + mddev_unlock(mddev);
>> + md_unregister_thread(&mddev->sync_thread);
>> + mddev_lock_nointr(mddev);
>> + mddev->reshape_position = save_rp;
>> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> md_reap_sync_thread(mddev);
>> }
> So with this patch, with 07revert-grow: I see the warning at the end of
> this email. Seems there's a new bug in the hunk above: MD_RECOVERY_INTR
> should be set before md_unregsiter_thread().
Yes, it was missed by mistake, the below incremental change is needed.
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e4d9a8b0efac..c29ef9505e03 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4833,6 +4833,7 @@ action_store(struct mddev *mddev, const char
*page, size_t len)
sector_t save_rp = mddev->reshape_position;
mddev_unlock(mddev);
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_unregister_thread(&mddev->sync_thread);
mddev_lock_nointr(mddev);
mddev->reshape_position = save_rp;
And we may not save/restore reshape_position, I will run more tests
before send new version. And Xiao's suggestion might be safer if it
fixes the original issue.
> When I make that change, 07revert-grow fails in the same way it did with
> the previous patch.
>
> Logan
>
> --
>
>
>
> 177.804352] ------------[ cut here ]------------
> [ 177.805200] WARNING: CPU: 2 PID: 1382 at drivers/md/md.c:490
> mddev_suspend+0x26c/0x330
> [ 177.806541] Modules linked in:
> [ 177.807078] CPU: 2 PID: 1382 Comm: md0_raid5 Not tainted
> 5.18.0-rc3-eid-vmlocalyes-dbg-00035-gd1d91cae9ac1 #2237
> [ 177.809122] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.14.0-2 04/01/2014
> [ 177.810649] RIP: 0010:mddev_suspend+0x26c/0x330
> [ 177.811654] Code: ab 5c 13 00 00 e9 a3 fe ff ff 48 8d bb d8 02 00 00
> be ff ff ff ff e8 d3 97 5f 00 85 c0 0f 85 74 fe ff ff 0f 0b e9 6d fe ff
> ff <0f> 0b e9 4c fe ff ff 4c 8d bd 68 ff ff ff 31 f6 4c 89 ff e8 1c f7
> [ 177.815028] RSP: 0018:ffff88828656faa8 EFLAGS: 00010246
> [ 177.815995] RAX: 0000000000000000 RBX: ffff8881765e8000 RCX:
> ffffffff9b3b1ed4
> [ 177.817306] RDX: dffffc0000000000 RSI: ffff88817333e478 RDI:
> ffff88816e861670
> [ 177.818555] RBP: ffff88828656fb78 R08: ffffffff9b38e442 R09:
> ffffffff9e94db07
> [ 177.819757] R10: fffffbfff3d29b60 R11: 0000000000000001 R12:
> ffff88816e861600
> [ 177.821043] R13: ffff88829c34cf00 R14: ffff8881765e8000 R15:
> ffff88817333e248
> [ 177.822307] FS: 0000000000000000(0000) GS:ffff888472e00000(0000)
> knlGS:0000000000000000
> [ 177.823699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 177.824701] CR2: 00007faa60ada843 CR3: 000000028e9ca002 CR4:
> 0000000000370ee0
> [ 177.825889] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 177.827111] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 177.828346] Call Trace:
> [ 177.828785] <TASK>
> [ 177.829275] ? rdev_read_only+0x150/0x150
> [ 177.830180] ? lock_is_held_type+0xf3/0x150
> [ 177.830888] ? raid5_calc_degraded+0x1f9/0x650
> [ 177.831665] check_reshape+0x289/0x500
> [ 177.832360] raid5_check_reshape+0x93/0x150
> [ 177.833060] md_check_recovery+0x864/0xa80
> [ 177.833757] raid5d+0xc6/0x930
I saw exactly the same trace when I replied with "held on" 😁.
Thanks,
Guoqing
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-06-09 7:21 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-07 2:03 [PATCH 0/2] md: regression fix Guoqing Jiang
2022-06-07 2:03 ` [PATCH 1/2] Revert "md: don't unregister sync_thread with reconfig_mutex held" Guoqing Jiang
2022-06-07 2:03 ` [PATCH 2/2] md: unlock mddev before reap sync_thread in action_store Guoqing Jiang
2022-06-07 9:46 ` Guoqing Jiang
2022-06-07 22:59 ` Song Liu
2022-06-07 23:45 ` Guoqing Jiang
2022-06-08 17:23 ` Logan Gunthorpe
2022-06-09 7:20 ` Guoqing Jiang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).