* [PATCH v3 0/4] Zone write plugging fixes
@ 2024-12-09 12:23 Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Damien Le Moal @ 2024-12-09 12:23 UTC (permalink / raw)
To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
Cc: Christoph Hellwig, Bart Van Assche
Jens,
These patches address potential issues with zone write plugging.
The first 2 patches fix handling of REQ_NOWAIT BIOs as these can be
"failed" after going through the zone write plugging and changing the
target zone plug zone write pointer offset.
Patch 3 is a bigger fix and address a potential deadlock issue due to
the zone write plugging internally issuing zone report operations to
recover from write errors. This zone report operation is removed by this
patch and replaced with an automatic recovery when the BIO issuer
execute a zone report. This change in behavior results in a problem with
REQ_OP_WRITE_ZEROES handling and failures in the dm-zoned device mapper.
That is fixed in patch 4.
I will followup these fixes with a cleanup of the report zones API and
its callback function interface to clean it up as patch 4 introduces an
indirect user callback call that is not very pretty.
Changes from v2:
- Added review tags from Christoph to patch 1 and 2.
- Reversed the order of patch 3 and 4. Patch 3 now includes some code
that was implemented in "block: Prevent potential deadlocks in zone
write plug error recovery" in v2.
- Adjusted and improved the commit messages of patch 3 and 4.
Changes from v1:
- Fixed kdoc comment for blkdev_issue_zone_zeroout() in patch 4
Damien Le Moal (4):
block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
block: Ignore REQ_NOWAIT for zone reset and zone finish operations
dm: Fix dm-zoned-reclaim zone write pointer alignment
block: Prevent potential deadlocks in zone write plug error recovery
block/blk-zoned.c | 506 +++++++++++++++-------------------
drivers/md/dm-zoned-reclaim.c | 6 +-
include/linux/blkdev.h | 5 +-
3 files changed, 229 insertions(+), 288 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
@ 2024-12-09 12:23 ` Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2024-12-09 12:23 UTC (permalink / raw)
To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
Cc: Christoph Hellwig, Bart Van Assche
For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.
Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.
Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.
blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
block/blk-zoned.c | 62 ++++++++++++++++++++++++++++++++---------------
1 file changed, 42 insertions(+), 20 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 263e28b72053..7982b9494d63 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -746,9 +746,25 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
return false;
}
-static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
- struct bio *bio, unsigned int nr_segs)
+static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug)
+{
+ /*
+ * Take a reference on the zone write plug and schedule the submission
+ * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
+ * reference we take here.
+ */
+ WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
+ refcount_inc(&zwplug->ref);
+ queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+}
+
+static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug,
+ struct bio *bio, unsigned int nr_segs)
{
+ bool schedule_bio_work = false;
+
/*
* Grab an extra reference on the BIO request queue usage counter.
* This reference will be reused to submit a request for the BIO for
@@ -764,6 +780,16 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
*/
bio_clear_polled(bio);
+ /*
+ * REQ_NOWAIT BIOs are always handled using the zone write plug BIO
+ * work, which can block. So clear the REQ_NOWAIT flag and schedule the
+ * work if this is the first BIO we are plugging.
+ */
+ if (bio->bi_opf & REQ_NOWAIT) {
+ schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
+ bio->bi_opf &= ~REQ_NOWAIT;
+ }
+
/*
* Reuse the poll cookie field to store the number of segments when
* split to the hardware limits.
@@ -777,6 +803,11 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
* at the tail of the list to preserve the sequential write order.
*/
bio_list_add(&zwplug->bio_list, bio);
+
+ zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+
+ if (schedule_bio_work)
+ disk_zone_wplug_schedule_bio_work(disk, zwplug);
}
/*
@@ -970,7 +1001,10 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
zwplug = disk_get_and_lock_zone_wplug(disk, sector, gfp_mask, &flags);
if (!zwplug) {
- bio_io_error(bio);
+ if (bio->bi_opf & REQ_NOWAIT)
+ bio_wouldblock_error(bio);
+ else
+ bio_io_error(bio);
return true;
}
@@ -979,9 +1013,11 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
/*
* If the zone is already plugged or has a pending error, add the BIO
- * to the plug BIO list. Otherwise, plug and let the BIO execute.
+ * to the plug BIO list. Do the same for REQ_NOWAIT BIOs to ensure that
+ * we will not see a BLK_STS_AGAIN failure if we let the BIO execute.
+ * Otherwise, plug and let the BIO execute.
*/
- if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+ if (zwplug->flags & BLK_ZONE_WPLUG_BUSY || (bio->bi_opf & REQ_NOWAIT))
goto plug;
/*
@@ -998,8 +1034,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
return false;
plug:
- zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
- blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
+ disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs);
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1083,19 +1118,6 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
}
EXPORT_SYMBOL_GPL(blk_zone_plug_bio);
-static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
- struct blk_zone_wplug *zwplug)
-{
- /*
- * Take a reference on the zone write plug and schedule the submission
- * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
- * reference we take here.
- */
- WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
- refcount_inc(&zwplug->ref);
- queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
-}
-
static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
--
2.47.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
@ 2024-12-09 12:23 ` Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Damien Le Moal @ 2024-12-09 12:23 UTC (permalink / raw)
To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
Cc: Christoph Hellwig, Bart Van Assche
There are currently any issuer of REQ_OP_ZONE_RESET and
REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot
handle this flag correctly due to the potential request allocation
failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio()
has handled the zone write plug write pointer updates for the targeted
zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this
flag is set and ignore it.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
block/blk-zoned.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7982b9494d63..ee9c67121c6c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -707,6 +707,15 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
return true;
}
+ /*
+ * No-wait reset or finish BIOs do not make much sense as the callers
+ * issue these as blocking operations in most cases. To avoid issues
+ * the BIO execution potentially failing with BLK_STS_AGAIN, warn about
+ * REQ_NOWAIT being set and ignore that flag.
+ */
+ if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT))
+ bio->bi_opf &= ~REQ_NOWAIT;
+
/*
* If we have a zone write plug, set its write pointer offset to 0
* (reset case) or to the zone size (finish case). This will abort all
--
2.47.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
@ 2024-12-09 12:23 ` Damien Le Moal
2024-12-09 12:41 ` Christoph Hellwig
2024-12-10 16:12 ` Mike Snitzer
2024-12-09 12:23 ` [PATCH v3 4/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
` (2 subsequent siblings)
5 siblings, 2 replies; 10+ messages in thread
From: Damien Le Moal @ 2024-12-09 12:23 UTC (permalink / raw)
To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
Cc: Christoph Hellwig, Bart Van Assche
The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 141 ++++++++++++++++++++++++++++------
drivers/md/dm-zoned-reclaim.c | 6 +-
include/linux/blkdev.h | 3 +
3 files changed, 124 insertions(+), 26 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ee9c67121c6c..781caca0381e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -115,6 +115,30 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
}
EXPORT_SYMBOL_GPL(blk_zone_cond_str);
+struct disk_report_zones_cb_args {
+ struct gendisk *disk;
+ report_zones_cb user_cb;
+ void *user_data;
+};
+
+static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
+ struct blk_zone *zone);
+
+static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+ void *data)
+{
+ struct disk_report_zones_cb_args *args = data;
+ struct gendisk *disk = args->disk;
+
+ if (disk->zone_wplugs_hash)
+ disk_zone_wplug_sync_wp_offset(disk, zone);
+
+ if (!args->user_cb)
+ return 0;
+
+ return args->user_cb(zone, idx, args->user_data);
+}
+
/**
* blkdev_report_zones - Get zones information
* @bdev: Target block device
@@ -694,6 +718,58 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
spin_unlock_irqrestore(&zwplug->lock, flags);
}
+static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
+{
+ switch (zone->cond) {
+ case BLK_ZONE_COND_IMP_OPEN:
+ case BLK_ZONE_COND_EXP_OPEN:
+ case BLK_ZONE_COND_CLOSED:
+ return zone->wp - zone->start;
+ case BLK_ZONE_COND_FULL:
+ return zone->len;
+ case BLK_ZONE_COND_EMPTY:
+ return 0;
+ case BLK_ZONE_COND_NOT_WP:
+ case BLK_ZONE_COND_OFFLINE:
+ case BLK_ZONE_COND_READONLY:
+ default:
+ /*
+ * Conventional, offline and read-only zones do not have a valid
+ * write pointer.
+ */
+ return UINT_MAX;
+ }
+}
+
+static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
+ struct blk_zone *zone)
+{
+ struct blk_zone_wplug *zwplug;
+ unsigned long flags;
+
+ zwplug = disk_get_zone_wplug(disk, zone->start);
+ if (!zwplug)
+ return;
+
+ spin_lock_irqsave(&zwplug->lock, flags);
+ if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
+ disk_zone_wplug_set_wp_offset(disk, zwplug,
+ blk_zone_wp_offset(zone));
+ spin_unlock_irqrestore(&zwplug->lock, flags);
+
+ disk_put_zone_wplug(zwplug);
+}
+
+static int disk_zone_sync_wp_offset(struct gendisk *disk, sector_t sector)
+{
+ struct disk_report_zones_cb_args args = {
+ .disk = disk,
+ };
+
+ return disk->fops->report_zones(disk, sector, 1,
+ disk_report_zones_cb, &args);
+}
+
static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
unsigned int wp_offset)
{
@@ -1280,29 +1356,6 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
disk_put_zone_wplug(zwplug);
}
-static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
-{
- switch (zone->cond) {
- case BLK_ZONE_COND_IMP_OPEN:
- case BLK_ZONE_COND_EXP_OPEN:
- case BLK_ZONE_COND_CLOSED:
- return zone->wp - zone->start;
- case BLK_ZONE_COND_FULL:
- return zone->len;
- case BLK_ZONE_COND_EMPTY:
- return 0;
- case BLK_ZONE_COND_NOT_WP:
- case BLK_ZONE_COND_OFFLINE:
- case BLK_ZONE_COND_READONLY:
- default:
- /*
- * Conventional, offline and read-only zones do not have a valid
- * write pointer.
- */
- return UINT_MAX;
- }
-}
-
static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
unsigned int idx, void *data)
{
@@ -1866,6 +1919,48 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
}
EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
+/**
+ * blk_zone_issue_zeroout - zero-fill a block range in a zone
+ * @bdev: blockdev to write
+ * @sector: start sector
+ * @nr_sects: number of sectors to write
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ * Zero-fill a block range in a zone (@sector must be equal to the zone write
+ * pointer), handling potential errors due to the (initially unknown) lack of
+ * hardware offload (See blkdev_issue_zeroout()).
+ */
+int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask)
+{
+ int ret;
+
+ if (WARN_ON_ONCE(!bdev_is_zoned(bdev)))
+ return -EIO;
+
+ ret = blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
+ BLKDEV_ZERO_NOFALLBACK);
+ if (ret != -EOPNOTSUPP)
+ return ret;
+
+ /*
+ * The failed call to blkdev_issue_zeroout() advanced the zone write
+ * pointer. Undo this using a report zone to update the zone write
+ * pointer to the correct current value.
+ */
+ ret = disk_zone_sync_wp_offset(bdev->bd_disk, sector);
+ if (ret != 1)
+ return ret < 0 ? ret : -EIO;
+
+ /*
+ * Retry without BLKDEV_ZERO_NOFALLBACK to force the fallback to a
+ * regular write with zero-pages.
+ */
+ return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, 0);
+}
+EXPORT_SYMBOL_GPL(blk_zone_issue_zeroout);
+
#ifdef CONFIG_BLK_DEBUG_FS
int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c
index d58db9a27e6c..76e2c6868548 100644
--- a/drivers/md/dm-zoned-reclaim.c
+++ b/drivers/md/dm-zoned-reclaim.c
@@ -76,9 +76,9 @@ static int dmz_reclaim_align_wp(struct dmz_reclaim *zrc, struct dm_zone *zone,
* pointer and the requested position.
*/
nr_blocks = block - wp_block;
- ret = blkdev_issue_zeroout(dev->bdev,
- dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block),
- dmz_blk2sect(nr_blocks), GFP_NOIO, 0);
+ ret = blk_zone_issue_zeroout(dev->bdev,
+ dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block),
+ dmz_blk2sect(nr_blocks), GFP_NOIO);
if (ret) {
dmz_dev_err(dev,
"Align zone %u wp %llu to %llu (wp+%u) blocks failed %d",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08a727b40816..4dd698dad2d6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1421,6 +1421,9 @@ static inline bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector)
return is_seq;
}
+int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask);
+
static inline unsigned int queue_dma_alignment(const struct request_queue *q)
{
return q->limits.dma_alignment;
--
2.47.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 4/4] block: Prevent potential deadlocks in zone write plug error recovery
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
` (2 preceding siblings ...)
2024-12-09 12:23 ` [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
@ 2024-12-09 12:23 ` Damien Le Moal
2024-12-09 12:42 ` Christoph Hellwig
2024-12-10 1:27 ` [PATCH v3 0/4] Zone write plugging fixes Martin K. Petersen
2024-12-10 16:17 ` Jens Axboe
5 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2024-12-09 12:23 UTC (permalink / raw)
To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
Cc: Christoph Hellwig, Bart Van Assche
Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.
However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.
Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.
The changes to remove the automatic error recovery are as follows:
- Completely remove the error recovery work and its associated
resources (zone write plug list head, disk error list, and disk
zone_wplugs_work work struct). This also removes the functions
disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
plug whenever a write opration targetting the zone of the zone write
plug fails. This flag indicates that the zone write pointer offset is
not reliable and that it must be updated when the next report zone,
reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this
new flag, thus implementing recovery of a correct write pointer
offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb()
callback so that disk_zone_wplug_sync_wp_offset() can be called for
any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
This implements recovery of a correct write pointer offset for zone
write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call
disk_zone_wplug_sync_wp_offset() for all sequential write required
zones when a zoned block device is revalidated, thus always resolving
any inconsistency between the write pointer offset of zone write
plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 308 ++++++++---------------------------------
include/linux/blkdev.h | 2 -
2 files changed, 61 insertions(+), 249 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 781caca0381e..77f12fc2086f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -41,7 +41,6 @@ static const char *const zone_cond_name[] = {
/*
* Per-zone write plug.
* @node: hlist_node structure for managing the plug using a hash table.
- * @link: To list the plug in the zone write plug error list of the disk.
* @ref: Zone write plug reference counter. A zone write plug reference is
* always at least 1 when the plug is hashed in the disk plug hash table.
* The reference is incremented whenever a new BIO needing plugging is
@@ -63,7 +62,6 @@ static const char *const zone_cond_name[] = {
*/
struct blk_zone_wplug {
struct hlist_node node;
- struct list_head link;
refcount_t ref;
spinlock_t lock;
unsigned int flags;
@@ -80,8 +78,8 @@ struct blk_zone_wplug {
* - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
* that is, that write BIOs are being throttled due to a write BIO already
* being executed or the zone write plug bio list is not empty.
- * - BLK_ZONE_WPLUG_ERROR: Indicates that a write error happened which will be
- * recovered with a report zone to update the zone write pointer offset.
+ * - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
+ * write pointer offset and need to update it.
* - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
* from the disk hash table and that the initial reference to the zone
* write plug set when the plug was first added to the hash table has been
@@ -91,11 +89,9 @@ struct blk_zone_wplug {
* freed once all remaining references from BIOs or functions are dropped.
*/
#define BLK_ZONE_WPLUG_PLUGGED (1U << 0)
-#define BLK_ZONE_WPLUG_ERROR (1U << 1)
+#define BLK_ZONE_WPLUG_NEED_WP_UPDATE (1U << 1)
#define BLK_ZONE_WPLUG_UNHASHED (1U << 2)
-#define BLK_ZONE_WPLUG_BUSY (BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR)
-
/**
* blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
* @zone_cond: BLK_ZONE_COND_XXX.
@@ -163,6 +159,11 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
{
struct gendisk *disk = bdev->bd_disk;
sector_t capacity = get_capacity(disk);
+ struct disk_report_zones_cb_args args = {
+ .disk = disk,
+ .user_cb = cb,
+ .user_data = data,
+ };
if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
return -EOPNOTSUPP;
@@ -170,7 +171,8 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
if (!nr_zones || sector >= capacity)
return 0;
- return disk->fops->report_zones(disk, sector, nr_zones, cb, data);
+ return disk->fops->report_zones(disk, sector, nr_zones,
+ disk_report_zones_cb, &args);
}
EXPORT_SYMBOL_GPL(blkdev_report_zones);
@@ -451,7 +453,7 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
{
if (refcount_dec_and_test(&zwplug->ref)) {
WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
- WARN_ON_ONCE(!list_empty(&zwplug->link));
+ WARN_ON_ONCE(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED));
call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
@@ -465,8 +467,8 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
return false;
- /* If the zone write plug is still busy, it cannot be removed. */
- if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+ /* If the zone write plug is still plugged, it cannot be removed. */
+ if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
return false;
/*
@@ -549,7 +551,6 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
return NULL;
INIT_HLIST_NODE(&zwplug->node);
- INIT_LIST_HEAD(&zwplug->link);
refcount_set(&zwplug->ref, 2);
spin_lock_init(&zwplug->lock);
zwplug->flags = 0;
@@ -598,115 +599,22 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
}
/*
- * Abort (fail) all plugged BIOs of a zone write plug that are not aligned
- * with the assumed write pointer location of the zone when the BIO will
- * be unplugged.
- */
-static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
- struct blk_zone_wplug *zwplug)
-{
- unsigned int wp_offset = zwplug->wp_offset;
- struct bio_list bl = BIO_EMPTY_LIST;
- struct bio *bio;
-
- while ((bio = bio_list_pop(&zwplug->bio_list))) {
- if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
- (bio_op(bio) != REQ_OP_ZONE_APPEND &&
- bio_offset_from_zone_start(bio) != wp_offset)) {
- blk_zone_wplug_bio_io_error(zwplug, bio);
- continue;
- }
-
- wp_offset += bio_sectors(bio);
- bio_list_add(&bl, bio);
- }
-
- bio_list_merge(&zwplug->bio_list, &bl);
-}
-
-static inline void disk_zone_wplug_set_error(struct gendisk *disk,
- struct blk_zone_wplug *zwplug)
-{
- unsigned long flags;
-
- if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
- return;
-
- /*
- * At this point, we already have a reference on the zone write plug.
- * However, since we are going to add the plug to the disk zone write
- * plugs work list, increase its reference count. This reference will
- * be dropped in disk_zone_wplugs_work() once the error state is
- * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
- * finished.
- */
- zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
- refcount_inc(&zwplug->ref);
-
- spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
- list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
- spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
-static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
- struct blk_zone_wplug *zwplug)
-{
- unsigned long flags;
-
- if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
- return;
-
- /*
- * We are racing with the error handling work which drops the reference
- * on the zone write plug after handling the error state. So remove the
- * plug from the error list and drop its reference count only if the
- * error handling has not yet started, that is, if the zone write plug
- * is still listed.
- */
- spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
- if (!list_empty(&zwplug->link)) {
- list_del_init(&zwplug->link);
- zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
- disk_put_zone_wplug(zwplug);
- }
- spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
-/*
- * Set a zone write plug write pointer offset to either 0 (zone reset case)
- * or to the zone size (zone finish case). This aborts all plugged BIOs, which
- * is fine to do as doing a zone reset or zone finish while writes are in-flight
- * is a mistake from the user which will most likely cause all plugged BIOs to
- * fail anyway.
+ * Set a zone write plug write pointer offset to the specified value.
+ * This aborts all plugged BIOs, which is fine as this function is called for
+ * a zone reset operation, a zone finish operation or if the zone needs a wp
+ * update from a report zone after a write error.
*/
static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
struct blk_zone_wplug *zwplug,
unsigned int wp_offset)
{
- unsigned long flags;
-
- spin_lock_irqsave(&zwplug->lock, flags);
-
- /*
- * Make sure that a BIO completion or another zone reset or finish
- * operation has not already removed the plug from the hash table.
- */
- if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) {
- spin_unlock_irqrestore(&zwplug->lock, flags);
- return;
- }
+ lockdep_assert_held(&zwplug->lock);
/* Update the zone write pointer and abort all plugged BIOs. */
+ zwplug->flags &= ~BLK_ZONE_WPLUG_NEED_WP_UPDATE;
zwplug->wp_offset = wp_offset;
disk_zone_wplug_abort(zwplug);
- /*
- * Updating the write pointer offset puts back the zone
- * in a good state. So clear the error flag and decrement the
- * error count if we were in error state.
- */
- disk_zone_wplug_clear_error(disk, zwplug);
-
/*
* The zone write plug now has no BIO plugged: remove it from the
* hash table so that it cannot be seen. The plug will be freed
@@ -714,8 +622,6 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
*/
if (disk_should_remove_zone_wplug(disk, zwplug))
disk_remove_zone_wplug(disk, zwplug);
-
- spin_unlock_irqrestore(&zwplug->lock, flags);
}
static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
@@ -752,7 +658,7 @@ static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
return;
spin_lock_irqsave(&zwplug->lock, flags);
- if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
+ if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
disk_zone_wplug_set_wp_offset(disk, zwplug,
blk_zone_wp_offset(zone));
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -776,6 +682,7 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
struct gendisk *disk = bio->bi_bdev->bd_disk;
sector_t sector = bio->bi_iter.bi_sector;
struct blk_zone_wplug *zwplug;
+ unsigned long flags;
/* Conventional zones cannot be reset nor finished. */
if (!bdev_zone_is_seq(bio->bi_bdev, sector)) {
@@ -801,7 +708,9 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
*/
zwplug = disk_get_zone_wplug(disk, sector);
if (zwplug) {
+ spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset);
+ spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
}
@@ -812,6 +721,7 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
struct blk_zone_wplug *zwplug;
+ unsigned long flags;
sector_t sector;
/*
@@ -823,7 +733,9 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
sector += disk->queue->limits.chunk_sectors) {
zwplug = disk_get_zone_wplug(disk, sector);
if (zwplug) {
+ spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
+ spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
}
}
@@ -1005,13 +917,23 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
+ /*
+ * If we lost track of the zone write pointer due to a write error,
+ * the user must either execute a report zones, reset the zone or finish
+ * the to recover a reliable write pointer position. Fail BIOs if the
+ * user did not do that as we cannot handle emulated zone append
+ * otherwise.
+ */
+ if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
+ return false;
+
/*
* Check that the user is not attempting to write to a full zone.
* We know such BIO will fail, and that would potentially overflow our
* write pointer offset beyond the end of the zone.
*/
if (disk_zone_wplug_is_full(disk, zwplug))
- goto err;
+ return false;
if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
/*
@@ -1030,24 +952,18 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
} else {
/*
- * Check for non-sequential writes early because we avoid a
- * whole lot of error handling trouble if we don't send it off
- * to the driver.
+ * Check for non-sequential writes early as we know that BIOs
+ * with a start sector not unaligned to the zone write pointer
+ * will fail.
*/
if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
- goto err;
+ return false;
}
/* Advance the zone write pointer offset. */
zwplug->wp_offset += bio_sectors(bio);
return true;
-
-err:
- /* We detected an invalid write BIO: schedule error recovery. */
- disk_zone_wplug_set_error(disk, zwplug);
- kblockd_schedule_work(&disk->zone_wplugs_work);
- return false;
}
static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
@@ -1097,20 +1013,20 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
/*
- * If the zone is already plugged or has a pending error, add the BIO
- * to the plug BIO list. Do the same for REQ_NOWAIT BIOs to ensure that
- * we will not see a BLK_STS_AGAIN failure if we let the BIO execute.
+ * If the zone is already plugged, add the BIO to the plug BIO list.
+ * Do the same for REQ_NOWAIT BIOs to ensure that we will not see a
+ * BLK_STS_AGAIN failure if we let the BIO execute.
* Otherwise, plug and let the BIO execute.
*/
- if (zwplug->flags & BLK_ZONE_WPLUG_BUSY || (bio->bi_opf & REQ_NOWAIT))
+ if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) ||
+ (bio->bi_opf & REQ_NOWAIT))
goto plug;
- /*
- * If an error is detected when preparing the BIO, add it to the BIO
- * list so that error recovery can deal with it.
- */
- if (!blk_zone_wplug_prepare_bio(zwplug, bio))
- goto plug;
+ if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
+ spin_unlock_irqrestore(&zwplug->lock, flags);
+ bio_io_error(bio);
+ return true;
+ }
zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
@@ -1210,16 +1126,6 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
spin_lock_irqsave(&zwplug->lock, flags);
- /*
- * If we had an error, schedule error recovery. The recovery work
- * will restart submission of plugged BIOs.
- */
- if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
- spin_unlock_irqrestore(&zwplug->lock, flags);
- kblockd_schedule_work(&disk->zone_wplugs_work);
- return;
- }
-
/* Schedule submission of the next plugged BIO if we have one. */
if (!bio_list_empty(&zwplug->bio_list)) {
disk_zone_wplug_schedule_bio_work(disk, zwplug);
@@ -1262,12 +1168,13 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
}
/*
- * If the BIO failed, mark the plug as having an error to trigger
- * recovery.
+ * If the BIO failed, abort all plugged BIOs and mark the plug as
+ * needing a write pointer update.
*/
if (bio->bi_status != BLK_STS_OK) {
spin_lock_irqsave(&zwplug->lock, flags);
- disk_zone_wplug_set_error(disk, zwplug);
+ disk_zone_wplug_abort(zwplug);
+ zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
spin_unlock_irqrestore(&zwplug->lock, flags);
}
@@ -1323,6 +1230,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
*/
spin_lock_irqsave(&zwplug->lock, flags);
+again:
bio = bio_list_pop(&zwplug->bio_list);
if (!bio) {
zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
@@ -1331,10 +1239,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
}
if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
- /* Error recovery will decide what to do with the BIO. */
- bio_list_add_head(&zwplug->bio_list, bio);
- spin_unlock_irqrestore(&zwplug->lock, flags);
- goto put_zwplug;
+ blk_zone_wplug_bio_io_error(zwplug, bio);
+ goto again;
}
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1356,97 +1262,6 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
disk_put_zone_wplug(zwplug);
}
-static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
- unsigned int idx, void *data)
-{
- struct blk_zone *zonep = data;
-
- *zonep = *zone;
- return 0;
-}
-
-static void disk_zone_wplug_handle_error(struct gendisk *disk,
- struct blk_zone_wplug *zwplug)
-{
- sector_t zone_start_sector =
- bdev_zone_sectors(disk->part0) * zwplug->zone_no;
- unsigned int noio_flag;
- struct blk_zone zone;
- unsigned long flags;
- int ret;
-
- /* Get the current zone information from the device. */
- noio_flag = memalloc_noio_save();
- ret = disk->fops->report_zones(disk, zone_start_sector, 1,
- blk_zone_wplug_report_zone_cb, &zone);
- memalloc_noio_restore(noio_flag);
-
- spin_lock_irqsave(&zwplug->lock, flags);
-
- /*
- * A zone reset or finish may have cleared the error already. In such
- * case, do nothing as the report zones may have seen the "old" write
- * pointer value before the reset/finish operation completed.
- */
- if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
- goto unlock;
-
- zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
-
- if (ret != 1) {
- /*
- * We failed to get the zone information, meaning that something
- * is likely really wrong with the device. Abort all remaining
- * plugged BIOs as otherwise we could endup waiting forever on
- * plugged BIOs to complete if there is a queue freeze on-going.
- */
- disk_zone_wplug_abort(zwplug);
- goto unplug;
- }
-
- /* Update the zone write pointer offset. */
- zwplug->wp_offset = blk_zone_wp_offset(&zone);
- disk_zone_wplug_abort_unaligned(disk, zwplug);
-
- /* Restart BIO submission if we still have any BIO left. */
- if (!bio_list_empty(&zwplug->bio_list)) {
- disk_zone_wplug_schedule_bio_work(disk, zwplug);
- goto unlock;
- }
-
-unplug:
- zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
- if (disk_should_remove_zone_wplug(disk, zwplug))
- disk_remove_zone_wplug(disk, zwplug);
-
-unlock:
- spin_unlock_irqrestore(&zwplug->lock, flags);
-}
-
-static void disk_zone_wplugs_work(struct work_struct *work)
-{
- struct gendisk *disk =
- container_of(work, struct gendisk, zone_wplugs_work);
- struct blk_zone_wplug *zwplug;
- unsigned long flags;
-
- spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-
- while (!list_empty(&disk->zone_wplugs_err_list)) {
- zwplug = list_first_entry(&disk->zone_wplugs_err_list,
- struct blk_zone_wplug, link);
- list_del_init(&zwplug->link);
- spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-
- disk_zone_wplug_handle_error(disk, zwplug);
- disk_put_zone_wplug(zwplug);
-
- spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
- }
-
- spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
{
return 1U << disk->zone_wplugs_hash_bits;
@@ -1455,8 +1270,6 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
void disk_init_zone_resources(struct gendisk *disk)
{
spin_lock_init(&disk->zone_wplugs_lock);
- INIT_LIST_HEAD(&disk->zone_wplugs_err_list);
- INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work);
}
/*
@@ -1555,8 +1368,6 @@ void disk_free_zone_resources(struct gendisk *disk)
if (!disk->zone_wplugs_pool)
return;
- cancel_work_sync(&disk->zone_wplugs_work);
-
if (disk->zone_wplugs_wq) {
destroy_workqueue(disk->zone_wplugs_wq);
disk->zone_wplugs_wq = NULL;
@@ -1753,6 +1564,8 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
if (!disk->zone_wplugs_hash)
return 0;
+ disk_zone_wplug_sync_wp_offset(disk, zone);
+
wp_offset = blk_zone_wp_offset(zone);
if (!wp_offset || wp_offset >= zone->capacity)
return 0;
@@ -1883,6 +1696,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
memalloc_noio_restore(noio_flag);
return ret;
}
+
ret = disk->fops->report_zones(disk, 0, UINT_MAX,
blk_revalidate_zone_cb, &args);
if (!ret) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4dd698dad2d6..378d3a1a22fc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -200,8 +200,6 @@ struct gendisk {
spinlock_t zone_wplugs_lock;
struct mempool_s *zone_wplugs_pool;
struct hlist_head *zone_wplugs_hash;
- struct list_head zone_wplugs_err_list;
- struct work_struct zone_wplugs_work;
struct workqueue_struct *zone_wplugs_wq;
#endif /* CONFIG_BLK_DEV_ZONED */
--
2.47.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
2024-12-09 12:23 ` [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
@ 2024-12-09 12:41 ` Christoph Hellwig
2024-12-10 16:12 ` Mike Snitzer
1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2024-12-09 12:41 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
Christoph Hellwig, Bart Van Assche
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 4/4] block: Prevent potential deadlocks in zone write plug error recovery
2024-12-09 12:23 ` [PATCH v3 4/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
@ 2024-12-09 12:42 ` Christoph Hellwig
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2024-12-09 12:42 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
Christoph Hellwig, Bart Van Assche
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/4] Zone write plugging fixes
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
` (3 preceding siblings ...)
2024-12-09 12:23 ` [PATCH v3 4/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
@ 2024-12-10 1:27 ` Martin K. Petersen
2024-12-10 16:17 ` Jens Axboe
5 siblings, 0 replies; 10+ messages in thread
From: Martin K. Petersen @ 2024-12-10 1:27 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
Christoph Hellwig, Bart Van Assche
Damien,
> These patches address potential issues with zone write plugging.
> The first 2 patches fix handling of REQ_NOWAIT BIOs as these can be
> "failed" after going through the zone write plugging and changing the
> target zone plug zone write pointer offset.
>
> Patch 3 is a bigger fix and address a potential deadlock issue due to
> the zone write plugging internally issuing zone report operations to
> recover from write errors. This zone report operation is removed by this
> patch and replaced with an automatic recovery when the BIO issuer
> execute a zone report. This change in behavior results in a problem with
> REQ_OP_WRITE_ZEROES handling and failures in the dm-zoned device mapper.
> That is fixed in patch 4.
Complicated set of fixes but they all look good to me.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
2024-12-09 12:23 ` [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
2024-12-09 12:41 ` Christoph Hellwig
@ 2024-12-10 16:12 ` Mike Snitzer
1 sibling, 0 replies; 10+ messages in thread
From: Mike Snitzer @ 2024-12-10 16:12 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, Mikulas Patocka, dm-devel,
Christoph Hellwig, Bart Van Assche
On Mon, Dec 09, 2024 at 09:23:56PM +0900, Damien Le Moal wrote:
> The zone reclaim processing of the dm-zoned device mapper uses
> blkdev_issue_zeroout() to align the write pointer of a zone being used
> for reclaiming another zone, to write the valid data blocks from the
> zone being reclaimed at the same position relative to the zone start in
> the reclaim target zone.
>
> The first call to blkdev_issue_zeroout() will try to use hardware
> offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
> non-zero max_write_zeroes_sectors queue limit. If this operation fails
> because of the lack of hardware support, blkdev_issue_zeroout() falls
> back to using a regular write operation with the zero-page as buffer.
> Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
> the block layer zone write plugging code which will execute a report
> zones operation to ensure that the write pointer of the target zone of
> the failed operation has not changed and to "rewind" the zone write
> pointer offset of the target zone as it was advanced when the write zero
> operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
> cause any issue and blkdev_issue_zeroout() works as expected.
>
> However, since the automatic recovery of zone write pointers by the zone
> write plugging code can potentially cause deadlocks with queue freeze
> operations, a different recovery must be implemented in preparation for
> the removal of zone write plugging report zones based recovery.
>
> Do this by introducing the new function blk_zone_issue_zeroout(). This
> function first calls blkdev_issue_zeroout() with the flag
> BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
> which attempt to use the device hardware offload with the
> REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
> operation is issued to restore the zone write pointer offset of the
> target zone to the correct position and blkdev_issue_zeroout() is called
> again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
> operation performing this recovery is implemented using the helper
> function disk_zone_sync_wp_offset() which calls the gendisk report_zones
> file operation with the callback disk_report_zones_cb(). This callback
> updates the target write pointer offset of the target zone using the new
> function disk_zone_wplug_sync_wp_offset().
>
> dmz_reclaim_align_wp() is modified to change its call to
> blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
> other change needed as the two functions are functionnally equivalent.
>
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Mike Snitzer <snitzer@kernel.org>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/4] Zone write plugging fixes
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
` (4 preceding siblings ...)
2024-12-10 1:27 ` [PATCH v3 0/4] Zone write plugging fixes Martin K. Petersen
@ 2024-12-10 16:17 ` Jens Axboe
5 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2024-12-10 16:17 UTC (permalink / raw)
To: linux-block, Mike Snitzer, Mikulas Patocka, dm-devel,
Damien Le Moal
Cc: Christoph Hellwig, Bart Van Assche
On Mon, 09 Dec 2024 21:23:53 +0900, Damien Le Moal wrote:
> Jens,
>
> These patches address potential issues with zone write plugging.
> The first 2 patches fix handling of REQ_NOWAIT BIOs as these can be
> "failed" after going through the zone write plugging and changing the
> target zone plug zone write pointer offset.
>
> [...]
Applied, thanks!
[1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
commit: cae005670887cb07ceafc25bb32e221e56286488
[2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations
commit: 5eb3317aa5a2ffe4574ab1a12cf9bc9447ca26c0
[3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
commit: b76b840fd93374240b59825f1ab8e2f5c9907acb
[4/4] block: Prevent potential deadlocks in zone write plug error recovery
commit: fe0418eb9bd69a19a948b297c8de815e05f3cde1
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-12-10 16:17 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 12:23 [PATCH v3 0/4] Zone write plugging fixes Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
2024-12-09 12:23 ` [PATCH v3 3/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
2024-12-09 12:41 ` Christoph Hellwig
2024-12-10 16:12 ` Mike Snitzer
2024-12-09 12:23 ` [PATCH v3 4/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
2024-12-09 12:42 ` Christoph Hellwig
2024-12-10 1:27 ` [PATCH v3 0/4] Zone write plugging fixes Martin K. Petersen
2024-12-10 16:17 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).