public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] A couple more bug fixes
@ 2022-09-08 16:15 Logan Gunthorpe
  2022-09-08 16:15 ` [PATCH v2 1/3] md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() Logan Gunthorpe
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 16:15 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Hey,

The first two patches are a resend of the two sent earlier with tags
collected and a couple minor typos fixed.

The third patch fixes the deadlock issue I brought up in another email.

These patches are based on the current md/md-next branch (526bd69b9d3).

Thanks,

Logan

--

David Sloan (1):
  md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()

Logan Gunthorpe (2):
  md: Remove extra mddev_get() in md_seq_start()
  md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d

 drivers/md/md.c    | 1 -
 drivers/md/raid5.c | 5 ++++-
 2 files changed, 4 insertions(+), 2 deletions(-)


base-commit: 526bd69b9d330eed1db59b2cf6a7d18caf866847
--
2.30.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()
  2022-09-08 16:15 [PATCH v2 0/3] A couple more bug fixes Logan Gunthorpe
@ 2022-09-08 16:15 ` Logan Gunthorpe
  2022-09-08 16:15 ` [PATCH v2 2/3] md: Remove extra mddev_get() in md_seq_start() Logan Gunthorpe
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 16:15 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, David Sloan, Logan Gunthorpe, Christoph Hellwig,
	Guoqing Jiang

From: David Sloan <david.sloan@eideticom.com>

When running chunk-sized reads on disks with badblocks duplicate bio
free/puts are observed:

   =============================================================================
   BUG bio-200 (Not tainted): Object already free
   -----------------------------------------------------------------------------
   Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504
    __slab_alloc.constprop.0+0x5a/0xb0
    kmem_cache_alloc+0x31e/0x330
    mempool_alloc_slab+0x17/0x20
    mempool_alloc+0x100/0x2b0
    bio_alloc_bioset+0x181/0x460
    do_mpage_readpage+0x776/0xd00
    mpage_readahead+0x166/0x320
    blkdev_readahead+0x15/0x20
    read_pages+0x13f/0x5f0
    page_cache_ra_unbounded+0x18d/0x220
    force_page_cache_ra+0x181/0x1c0
    page_cache_sync_ra+0x65/0xb0
    filemap_get_pages+0x1df/0xaf0
    filemap_read+0x1e1/0x700
    blkdev_read_iter+0x1e5/0x330
    vfs_read+0x42a/0x570
   Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504
    kmem_cache_free+0x46d/0x490
    mempool_free_slab+0x17/0x20
    mempool_free+0x66/0x190
    bio_free+0x78/0x90
    bio_put+0x100/0x1a0
    raid5_make_request+0x2259/0x2450
    md_handle_request+0x402/0x600
    md_submit_bio+0xd9/0x120
    __submit_bio+0x11f/0x1b0
    submit_bio_noacct_nocheck+0x204/0x480
    submit_bio_noacct+0x32e/0xc70
    submit_bio+0x98/0x1a0
    mpage_readahead+0x250/0x320
    blkdev_readahead+0x15/0x20
    read_pages+0x13f/0x5f0
    page_cache_ra_unbounded+0x18d/0x220
   Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
   CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
   Workqueue: raid5wq raid5_do_work
   Call Trace:
    <TASK>
    dump_stack_lvl+0x5a/0x78
    dump_stack+0x10/0x16
    print_trailer+0x158/0x165
    object_err+0x35/0x50
    free_debug_processing.cold+0xb7/0xbe
    __slab_free+0x1ae/0x330
    kmem_cache_free+0x46d/0x490
    mempool_free_slab+0x17/0x20
    mempool_free+0x66/0x190
    bio_free+0x78/0x90
    bio_put+0x100/0x1a0
    mpage_end_io+0x36/0x150
    bio_endio+0x2fd/0x360
    md_end_io_acct+0x7e/0x90
    bio_endio+0x2fd/0x360
    handle_failed_stripe+0x960/0xb80
    handle_stripe+0x1348/0x3760
    handle_active_stripes.constprop.0+0x72a/0xaf0
    raid5_do_work+0x177/0x330
    process_one_work+0x616/0xb20
    worker_thread+0x2bd/0x6f0
    kthread+0x179/0x1b0
    ret_from_fork+0x22/0x30
    </TASK>

The double free is caused by an unnecessary bio_put() in the
if(is_badblock(...)) error path in raid5_read_one_chunk().

The error path was moved ahead of bio_alloc_clone() in c82aa1b76787c
("md/raid5: move checking badblock before clone bio in
raid5_read_one_chunk"). The previous code checked and freed align_bio
which required a bio_put. After the move that is no longer needed as
raid_bio is returned to the control of the common io path which
performs its own endio resulting in a double free on bad device blocks.

Fixes: c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk")
Signed-off-by: David Sloan <david.sloan@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4e6d865a6456..734f92e75f85 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5538,7 +5538,6 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 
 	if (is_badblock(rdev, sector, bio_sectors(raid_bio), &first_bad,
 			&bad_sectors)) {
-		bio_put(raid_bio);
 		rdev_dec_pending(rdev, mddev);
 		return 0;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] md: Remove extra mddev_get() in md_seq_start()
  2022-09-08 16:15 [PATCH v2 0/3] A couple more bug fixes Logan Gunthorpe
  2022-09-08 16:15 ` [PATCH v2 1/3] md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() Logan Gunthorpe
@ 2022-09-08 16:15 ` Logan Gunthorpe
  2022-09-09  4:53   ` Paul Menzel
  2022-09-08 16:15 ` [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d Logan Gunthorpe
  2022-09-09 14:41 ` [PATCH v2 0/3] A couple more bug fixes Song Liu
  3 siblings, 1 reply; 8+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 16:15 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig, Guoqing Jiang

A regression is seen where mddev devices stay permanently after they
are stopped due to an elevated reference count.

This was tracked down to an extra mddev_get() in md_seq_start().

It only happened rarely because most of the time the md_seq_start()
is called with a zero offset. The path with an extra mddev_get() only
happens when it starts with a non-zero offset.

The commit noted below changed an mddev_get() to check its success
but inadevrtantly left the original call in. Remove the extra call.

Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
---
 drivers/md/md.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index afaf36b2f6ab..9dc0175280b4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8154,7 +8154,6 @@ static void *md_seq_start(struct seq_file *seq, loff_t *pos)
 	list_for_each(tmp,&all_mddevs)
 		if (!l--) {
 			mddev = list_entry(tmp, struct mddev, all_mddevs);
-			mddev_get(mddev);
 			if (!mddev_get(mddev))
 				continue;
 			spin_unlock(&all_mddevs_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
  2022-09-08 16:15 [PATCH v2 0/3] A couple more bug fixes Logan Gunthorpe
  2022-09-08 16:15 ` [PATCH v2 1/3] md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() Logan Gunthorpe
  2022-09-08 16:15 ` [PATCH v2 2/3] md: Remove extra mddev_get() in md_seq_start() Logan Gunthorpe
@ 2022-09-08 16:15 ` Logan Gunthorpe
  2022-09-15 15:15   ` Logan Gunthorpe
  2022-09-09 14:41 ` [PATCH v2 0/3] A couple more bug fixes Song Liu
  3 siblings, 1 reply; 8+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 16:15 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.

When the deadlock occurs, multiple threads will hang in different ways:

 1) The group threads will hang in the blk-wbt code with bios waiting to
    be submitted to the block layer:

        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        __submit_bio+0xe6/0x100
        submit_bio_noacct_nocheck+0x42e/0x470
        submit_bio_noacct+0x4c2/0xbb0
        ops_run_io+0x46b/0x1a30
        handle_stripe+0xcd3/0x36b0
        handle_active_stripes.constprop.0+0x6f6/0xa60
        raid5_do_work+0x177/0x330

    Or:
        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        __submit_bio+0xe6/0x100
        submit_bio_noacct_nocheck+0x42e/0x470
        submit_bio_noacct+0x4c2/0xbb0
        flush_deferred_bios+0x136/0x170
        raid5_do_work+0x262/0x330

 2) The r5l_reclaim thread will hang in the same way, submitting a
    bio to the block layer:

        io_schedule+0x70/0xb0
        rq_qos_wait+0x153/0x210
        wbt_wait+0x115/0x1b0
        __rq_qos_throttle+0x38/0x60
        blk_mq_submit_bio+0x589/0xcd0
        __submit_bio+0xe6/0x100
        submit_bio_noacct_nocheck+0x42e/0x470
        submit_bio_noacct+0x4c2/0xbb0
        submit_bio+0x3f/0xf0
        md_super_write+0x12f/0x1b0
        md_update_sb.part.0+0x7c6/0xff0
        md_update_sb+0x30/0x60
        r5l_do_reclaim+0x4f9/0x5e0
        r5l_reclaim_thread+0x69/0x30b

    However, before hanging, the MD_SB_CHANGE_PENDING flag will be
    set for sb_flags in r5l_write_super_and_discard_space(). This
    flag will never be cleared because the submit_bio() call never
    returns.

 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
    will do no processing on any pending stripes and re-set
    STRIPE_HANDLE. This will cause the raid5d thread to enter an
    infinite loop, constantly trying to handle the same stripes
    stuck in the queue.

    The raid5d thread has a blk_plug that holds a number of bios
    that are also stuck waiting seeing the thread is in a loop
    that never schedules. These bios have been accounted for by
    blk-wbt thus preventing the other threads above from
    continuing when they try to submit bios. --Deadlock.

To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.

It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:

    16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
                   is updated.")

Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 734f92e75f85..7d1b7e361ef2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6783,6 +6783,10 @@ static void raid5d(struct md_thread *thread)
 			md_check_recovery(mddev);
 			spin_lock_irq(&conf->device_lock);
 		}
+
+		wait_event_lock_irq(mddev->sb_wait,
+			!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
+			conf->device_lock);
 	}
 	pr_debug("%d stripes handled\n", handled);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] md: Remove extra mddev_get() in md_seq_start()
  2022-09-08 16:15 ` [PATCH v2 2/3] md: Remove extra mddev_get() in md_seq_start() Logan Gunthorpe
@ 2022-09-09  4:53   ` Paul Menzel
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Menzel @ 2022-09-09  4:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Song Liu, linux-kernel, linux-raid, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig

Dear Logan,


Thank you for the updated series. One nit below, which could be 
corrected by the maintainer before applying.

Am 08.09.22 um 18:15 schrieb Logan Gunthorpe:
> A regression is seen where mddev devices stay permanently after they
> are stopped due to an elevated reference count.
> 
> This was tracked down to an extra mddev_get() in md_seq_start().
> 
> It only happened rarely because most of the time the md_seq_start()
> is called with a zero offset. The path with an extra mddev_get() only
> happens when it starts with a non-zero offset.
> 
> The commit noted below changed an mddev_get() to check its success
> but inadevrtantly left the original call in. Remove the extra call.

inadvertently

> Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed")
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
> ---
>   drivers/md/md.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index afaf36b2f6ab..9dc0175280b4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8154,7 +8154,6 @@ static void *md_seq_start(struct seq_file *seq, loff_t *pos)
>   	list_for_each(tmp,&all_mddevs)
>   		if (!l--) {
>   			mddev = list_entry(tmp, struct mddev, all_mddevs);
> -			mddev_get(mddev);
>   			if (!mddev_get(mddev))
>   				continue;
>   			spin_unlock(&all_mddevs_lock);


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] A couple more bug fixes
  2022-09-08 16:15 [PATCH v2 0/3] A couple more bug fixes Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2022-09-08 16:15 ` [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d Logan Gunthorpe
@ 2022-09-09 14:41 ` Song Liu
  3 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-09-09 14:41 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: open list, linux-raid, Christoph Hellwig, Guoqing Jiang,
	Stephen Bates, Martin Oliveira, David Sloan

On Thu, Sep 8, 2022 at 9:15 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Hey,
>
> The first two patches are a resend of the two sent earlier with tags
> collected and a couple minor typos fixed.
>
> The third patch fixes the deadlock issue I brought up in another email.
>
> These patches are based on the current md/md-next branch (526bd69b9d3).
>
> Thanks,
>
> Logan

Applied to md-next. Thanks!

Song

>
> --
>
> David Sloan (1):
>   md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()
>
> Logan Gunthorpe (2):
>   md: Remove extra mddev_get() in md_seq_start()
>   md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
>
>  drivers/md/md.c    | 1 -
>  drivers/md/raid5.c | 5 ++++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
>
> base-commit: 526bd69b9d330eed1db59b2cf6a7d18caf866847
> --
> 2.30.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
  2022-09-08 16:15 ` [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d Logan Gunthorpe
@ 2022-09-15 15:15   ` Logan Gunthorpe
  2022-09-19 18:31     ` Song Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Logan Gunthorpe @ 2022-09-15 15:15 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan


Hi Song,

On 2022-09-08 10:15, Logan Gunthorpe wrote:
> A complicated deadlock exists when using the journal and an elevated
> group_thrtead_cnt. It was found with loop devices, but its not clear
> whether it can be seen with real disks. The deadlock can occur simply
> by writing data with an fio script.
> 
> When the deadlock occurs, multiple threads will hang in different ways:
> 
>  1) The group threads will hang in the blk-wbt code with bios waiting to
>     be submitted to the block layer:
> 
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         __submit_bio+0xe6/0x100
>         submit_bio_noacct_nocheck+0x42e/0x470
>         submit_bio_noacct+0x4c2/0xbb0
>         ops_run_io+0x46b/0x1a30
>         handle_stripe+0xcd3/0x36b0
>         handle_active_stripes.constprop.0+0x6f6/0xa60
>         raid5_do_work+0x177/0x330
> 
>     Or:
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         __submit_bio+0xe6/0x100
>         submit_bio_noacct_nocheck+0x42e/0x470
>         submit_bio_noacct+0x4c2/0xbb0
>         flush_deferred_bios+0x136/0x170
>         raid5_do_work+0x262/0x330
> 
>  2) The r5l_reclaim thread will hang in the same way, submitting a
>     bio to the block layer:
> 
>         io_schedule+0x70/0xb0
>         rq_qos_wait+0x153/0x210
>         wbt_wait+0x115/0x1b0
>         __rq_qos_throttle+0x38/0x60
>         blk_mq_submit_bio+0x589/0xcd0
>         __submit_bio+0xe6/0x100
>         submit_bio_noacct_nocheck+0x42e/0x470
>         submit_bio_noacct+0x4c2/0xbb0
>         submit_bio+0x3f/0xf0
>         md_super_write+0x12f/0x1b0
>         md_update_sb.part.0+0x7c6/0xff0
>         md_update_sb+0x30/0x60
>         r5l_do_reclaim+0x4f9/0x5e0
>         r5l_reclaim_thread+0x69/0x30b
> 
>     However, before hanging, the MD_SB_CHANGE_PENDING flag will be
>     set for sb_flags in r5l_write_super_and_discard_space(). This
>     flag will never be cleared because the submit_bio() call never
>     returns.
> 
>  3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
>     will do no processing on any pending stripes and re-set
>     STRIPE_HANDLE. This will cause the raid5d thread to enter an
>     infinite loop, constantly trying to handle the same stripes
>     stuck in the queue.
> 
>     The raid5d thread has a blk_plug that holds a number of bios
>     that are also stuck waiting seeing the thread is in a loop
>     that never schedules. These bios have been accounted for by
>     blk-wbt thus preventing the other threads above from
>     continuing when they try to submit bios. --Deadlock.
> 
> To fix this, add the same wait_event() that is used in raid5_do_work()
> to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
> schedule and wait until the flag is cleared. The schedule action will
> flush the plug which will allow the r5l_reclaim thread to continue,
> thus preventing the deadlock.
> 
> It's not clear when the deadlock was introduced, but the similar
> wait_event() call in raid5_do_work() was added in 2017 by this
> commit:
> 
>     16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
>                    is updated.")
> 
> Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Seems this patch of mine that is in md-next is causing an 1 in ~30
failure on the mdadm test 13imsm-r0_r5_3d-grow-r0_r5_4d.

I'm looking into it and will try to send an updated patch when I have a fix.

Logan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
  2022-09-15 15:15   ` Logan Gunthorpe
@ 2022-09-19 18:31     ` Song Liu
  0 siblings, 0 replies; 8+ messages in thread
From: Song Liu @ 2022-09-19 18:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: open list, linux-raid, Christoph Hellwig, Guoqing Jiang,
	Stephen Bates, Martin Oliveira, David Sloan

On Thu, Sep 15, 2022 at 8:15 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> Hi Song,
>
> On 2022-09-08 10:15, Logan Gunthorpe wrote:
> > A complicated deadlock exists when using the journal and an elevated
> > group_thrtead_cnt. It was found with loop devices, but its not clear
> > whether it can be seen with real disks. The deadlock can occur simply
> > by writing data with an fio script.
> >
> > When the deadlock occurs, multiple threads will hang in different ways:
> >
> >  1) The group threads will hang in the blk-wbt code with bios waiting to
> >     be submitted to the block layer:
> >
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         ops_run_io+0x46b/0x1a30
> >         handle_stripe+0xcd3/0x36b0
> >         handle_active_stripes.constprop.0+0x6f6/0xa60
> >         raid5_do_work+0x177/0x330
> >
> >     Or:
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         flush_deferred_bios+0x136/0x170
> >         raid5_do_work+0x262/0x330
> >
> >  2) The r5l_reclaim thread will hang in the same way, submitting a
> >     bio to the block layer:
> >
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         submit_bio+0x3f/0xf0
> >         md_super_write+0x12f/0x1b0
> >         md_update_sb.part.0+0x7c6/0xff0
> >         md_update_sb+0x30/0x60
> >         r5l_do_reclaim+0x4f9/0x5e0
> >         r5l_reclaim_thread+0x69/0x30b
> >
> >     However, before hanging, the MD_SB_CHANGE_PENDING flag will be
> >     set for sb_flags in r5l_write_super_and_discard_space(). This
> >     flag will never be cleared because the submit_bio() call never
> >     returns.
> >
> >  3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
> >     will do no processing on any pending stripes and re-set
> >     STRIPE_HANDLE. This will cause the raid5d thread to enter an
> >     infinite loop, constantly trying to handle the same stripes
> >     stuck in the queue.
> >
> >     The raid5d thread has a blk_plug that holds a number of bios
> >     that are also stuck waiting seeing the thread is in a loop
> >     that never schedules. These bios have been accounted for by
> >     blk-wbt thus preventing the other threads above from
> >     continuing when they try to submit bios. --Deadlock.
> >
> > To fix this, add the same wait_event() that is used in raid5_do_work()
> > to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
> > schedule and wait until the flag is cleared. The schedule action will
> > flush the plug which will allow the r5l_reclaim thread to continue,
> > thus preventing the deadlock.
> >
> > It's not clear when the deadlock was introduced, but the similar
> > wait_event() call in raid5_do_work() was added in 2017 by this
> > commit:
> >
> >     16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
> >                    is updated.")
> >
> > Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
> > Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>
> Seems this patch of mine that is in md-next is causing an 1 in ~30
> failure on the mdadm test 13imsm-r0_r5_3d-grow-r0_r5_4d.
>
> I'm looking into it and will try to send an updated patch when I have a fix.

Thanks for the heads-up! I will remove this one from md-next for now.

Song

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-19 18:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-08 16:15 [PATCH v2 0/3] A couple more bug fixes Logan Gunthorpe
2022-09-08 16:15 ` [PATCH v2 1/3] md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() Logan Gunthorpe
2022-09-08 16:15 ` [PATCH v2 2/3] md: Remove extra mddev_get() in md_seq_start() Logan Gunthorpe
2022-09-09  4:53   ` Paul Menzel
2022-09-08 16:15 ` [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d Logan Gunthorpe
2022-09-15 15:15   ` Logan Gunthorpe
2022-09-19 18:31     ` Song Liu
2022-09-09 14:41 ` [PATCH v2 0/3] A couple more bug fixes Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox