* [PATCH AUTOSEL 6.0 36/67] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
[not found] <20221013001554.1892206-1-sashal@kernel.org>
@ 2022-10-13 0:15 ` Sasha Levin
2022-10-13 0:15 ` [PATCH AUTOSEL 6.0 56/67] block: replace blk_queue_nowait with bdev_nowait Sasha Levin
1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2022-10-13 0:15 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Logan Gunthorpe, Song Liu, Sasha Levin, linux-raid
From: Logan Gunthorpe <logang@deltatee.com>
[ Upstream commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74 ]
A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.
When the deadlock occurs, multiple threads will hang in different ways:
1) The group threads will hang in the blk-wbt code with bios waiting to
be submitted to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
ops_run_io+0x46b/0x1a30
handle_stripe+0xcd3/0x36b0
handle_active_stripes.constprop.0+0x6f6/0xa60
raid5_do_work+0x177/0x330
Or:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
flush_deferred_bios+0x136/0x170
raid5_do_work+0x262/0x330
2) The r5l_reclaim thread will hang in the same way, submitting a
bio to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
submit_bio+0x3f/0xf0
md_super_write+0x12f/0x1b0
md_update_sb.part.0+0x7c6/0xff0
md_update_sb+0x30/0x60
r5l_do_reclaim+0x4f9/0x5e0
r5l_reclaim_thread+0x69/0x30b
However, before hanging, the MD_SB_CHANGE_PENDING flag will be
set for sb_flags in r5l_write_super_and_discard_space(). This
flag will never be cleared because the submit_bio() call never
returns.
3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
will do no processing on any pending stripes and re-set
STRIPE_HANDLE. This will cause the raid5d thread to enter an
infinite loop, constantly trying to handle the same stripes
stuck in the queue.
The raid5d thread has a blk_plug that holds a number of bios
that are also stuck waiting seeing the thread is in a loop
that never schedules. These bios have been accounted for by
blk-wbt thus preventing the other threads above from
continuing when they try to submit bios. --Deadlock.
To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.
However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
from the same thread and can thus deadlock if the thread is put to
sleep. So avoid waiting if md_check_recovery() is being called in the
loop.
It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:
16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
is updated.")
Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/md/raid5.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 31a0cbf63384..ffae21aaf306 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -36,6 +36,7 @@
*/
#include <linux/blkdev.h>
+#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/raid/pq.h>
#include <linux/async_tx.h>
@@ -6781,7 +6782,18 @@ static void raid5d(struct md_thread *thread)
spin_unlock_irq(&conf->device_lock);
md_check_recovery(mddev);
spin_lock_irq(&conf->device_lock);
+
+ /*
+ * Waiting on MD_SB_CHANGE_PENDING below may deadlock
+ * seeing md_check_recovery() is needed to clear
+ * the flag when using mdmon.
+ */
+ continue;
}
+
+ wait_event_lock_irq(mddev->sb_wait,
+ !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
+ conf->device_lock);
}
pr_debug("%d stripes handled\n", handled);
--
2.35.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* [PATCH AUTOSEL 6.0 56/67] block: replace blk_queue_nowait with bdev_nowait
[not found] <20221013001554.1892206-1-sashal@kernel.org>
2022-10-13 0:15 ` [PATCH AUTOSEL 6.0 36/67] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d Sasha Levin
@ 2022-10-13 0:15 ` Sasha Levin
1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2022-10-13 0:15 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Christoph Hellwig, Pankaj Raghav, Jens Axboe, Sasha Levin, agk,
snitzer, dm-devel, song, linux-block, linux-raid, io-uring
From: Christoph Hellwig <hch@lst.de>
[ Upstream commit 568ec936bf1384fc15873908c96a9aeb62536edb ]
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
block/blk-core.c | 2 +-
drivers/md/dm-table.c | 4 +---
drivers/md/md.c | 4 ++--
include/linux/blkdev.h | 6 +++++-
io_uring/io_uring.c | 2 +-
5 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 651057c4146b..4ec669b0eadc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -717,7 +717,7 @@ void submit_bio_noacct(struct bio *bio)
* For a REQ_NOWAIT based request, return -EOPNOTSUPP
* if queue does not support NOWAIT.
*/
- if ((bio->bi_opf & REQ_NOWAIT) && !blk_queue_nowait(q))
+ if ((bio->bi_opf & REQ_NOWAIT) && !bdev_nowait(bdev))
goto not_supported;
if (should_fail_bio(bio))
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 332f96b58252..d8034ff0cb24 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1856,9 +1856,7 @@ static bool dm_table_supports_write_zeroes(struct dm_table *t)
static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
- struct request_queue *q = bdev_get_queue(dev->bdev);
-
- return !blk_queue_nowait(q);
+ return !bdev_nowait(dev->bdev);
}
static bool dm_table_supports_nowait(struct dm_table *t)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 729be2c5296c..72ad352cb41a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5845,7 +5845,7 @@ int md_run(struct mddev *mddev)
}
}
sysfs_notify_dirent_safe(rdev->sysfs_state);
- nowait = nowait && blk_queue_nowait(bdev_get_queue(rdev->bdev));
+ nowait = nowait && bdev_nowait(rdev->bdev);
}
if (!bioset_initialized(&mddev->bio_set)) {
@@ -6982,7 +6982,7 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
* If the new disk does not support REQ_NOWAIT,
* disable on the whole MD.
*/
- if (!blk_queue_nowait(bdev_get_queue(rdev->bdev))) {
+ if (!bdev_nowait(rdev->bdev)) {
pr_info("%s: Disabling nowait because %pg does not support nowait\n",
mdname(mddev), rdev->bdev);
blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, mddev->queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 84b13fdd34a7..4750772ef228 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -618,7 +618,6 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
#define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
#define blk_queue_pm_only(q) atomic_read(&(q)->pm_only)
#define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
-#define blk_queue_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
#define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
extern void blk_set_pm_only(struct request_queue *q);
@@ -1280,6 +1279,11 @@ static inline bool bdev_fua(struct block_device *bdev)
return test_bit(QUEUE_FLAG_FUA, &bdev_get_queue(bdev)->queue_flags);
}
+static inline bool bdev_nowait(struct block_device *bdev)
+{
+ return test_bit(QUEUE_FLAG_NOWAIT, &bdev_get_queue(bdev)->queue_flags);
+}
+
static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
{
struct request_queue *q = bdev_get_queue(bdev);
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 13af6b56ebd2..c13122a87c40 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1384,7 +1384,7 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
static bool io_bdev_nowait(struct block_device *bdev)
{
- return !bdev || blk_queue_nowait(bdev_get_queue(bdev));
+ return !bdev || bdev_nowait(bdev);
}
/*
--
2.35.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-10-13 0:22 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20221013001554.1892206-1-sashal@kernel.org>
2022-10-13 0:15 ` [PATCH AUTOSEL 6.0 36/67] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d Sasha Levin
2022-10-13 0:15 ` [PATCH AUTOSEL 6.0 56/67] block: replace blk_queue_nowait with bdev_nowait Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).