linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Zone write plugging fixes
@ 2024-12-08 22:57 Damien Le Moal
  2024-12-08 22:57 ` [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Damien Le Moal @ 2024-12-08 22:57 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
  Cc: Christoph Hellwig, Bart Van Assche

Jens,

These patches address potential issues with zone write plugging.
The first 2 patches fix handling of REQ_NOWAIT BIOs as these can be
"failed" after going through the zone write plugging and changing the
target zone plug zone write pointer offset.

Patch 3 is a bigger fix and address a potential deadlock issue due to
the zone write plugging internally issuing zone report operations to
recover from write errors. This zone report operation is removed by this
patch and replaced with an automatic recovery when the BIO issuer
execute a zone report. This change in behavior results in a problem with
REQ_OP_WRITE_ZEROES handling and failures in the dm-zoned device mapper.
That is fixed in patch 4.

I will followup these fixes with a cleanup of the report zones API and
its callback function interface to clean it up as patch 4 introduces an
indirect user callback call that is not very pretty.

Changes from v1:
 - Fixed kdoc comment for blkdev_issue_zone_zeroout() in patch 4

Damien Le Moal (4):
  block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
  block: Ignore REQ_NOWAIT for zone reset and zone finish operations
  block: Prevent potential deadlocks in zone write plug error recovery
  dm: Fix dm-zoned-reclaim zone write pointer alignment

 block/blk-zoned.c             | 506 +++++++++++++++-------------------
 drivers/md/dm-zoned-reclaim.c |   4 +-
 include/linux/blkdev.h        |   5 +-
 3 files changed, 228 insertions(+), 287 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
  2024-12-08 22:57 [PATCH v2 0/4] Zone write plugging fixes Damien Le Moal
@ 2024-12-08 22:57 ` Damien Le Moal
  2024-12-09  7:44   ` Christoph Hellwig
  2024-12-08 22:57 ` [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2024-12-08 22:57 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
  Cc: Christoph Hellwig, Bart Van Assche

For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.

Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.

Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.

blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().

Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 62 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 263e28b72053..7982b9494d63 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -746,9 +746,25 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 	return false;
 }
 
-static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
-					  struct bio *bio, unsigned int nr_segs)
+static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
+					      struct blk_zone_wplug *zwplug)
+{
+	/*
+	 * Take a reference on the zone write plug and schedule the submission
+	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
+	 * reference we take here.
+	 */
+	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
+	refcount_inc(&zwplug->ref);
+	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+}
+
+static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
+				struct blk_zone_wplug *zwplug,
+				struct bio *bio, unsigned int nr_segs)
 {
+	bool schedule_bio_work = false;
+
 	/*
 	 * Grab an extra reference on the BIO request queue usage counter.
 	 * This reference will be reused to submit a request for the BIO for
@@ -764,6 +780,16 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 	 */
 	bio_clear_polled(bio);
 
+	/*
+	 * REQ_NOWAIT BIOs are always handled using the zone write plug BIO
+	 * work, which can block. So clear the REQ_NOWAIT flag and schedule the
+	 * work if this is the first BIO we are plugging.
+	 */
+	if (bio->bi_opf & REQ_NOWAIT) {
+		schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
+		bio->bi_opf &= ~REQ_NOWAIT;
+	}
+
 	/*
 	 * Reuse the poll cookie field to store the number of segments when
 	 * split to the hardware limits.
@@ -777,6 +803,11 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 	 * at the tail of the list to preserve the sequential write order.
 	 */
 	bio_list_add(&zwplug->bio_list, bio);
+
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+
+	if (schedule_bio_work)
+		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 }
 
 /*
@@ -970,7 +1001,10 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 
 	zwplug = disk_get_and_lock_zone_wplug(disk, sector, gfp_mask, &flags);
 	if (!zwplug) {
-		bio_io_error(bio);
+		if (bio->bi_opf & REQ_NOWAIT)
+			bio_wouldblock_error(bio);
+		else
+			bio_io_error(bio);
 		return true;
 	}
 
@@ -979,9 +1013,11 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 
 	/*
 	 * If the zone is already plugged or has a pending error, add the BIO
-	 * to the plug BIO list. Otherwise, plug and let the BIO execute.
+	 * to the plug BIO list. Do the same for REQ_NOWAIT BIOs to ensure that
+	 * we will not see a BLK_STS_AGAIN failure if we let the BIO execute.
+	 * Otherwise, plug and let the BIO execute.
 	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY || (bio->bi_opf & REQ_NOWAIT))
 		goto plug;
 
 	/*
@@ -998,8 +1034,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	return false;
 
 plug:
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-	blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
+	disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs);
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
@@ -1083,19 +1118,6 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
 }
 EXPORT_SYMBOL_GPL(blk_zone_plug_bio);
 
-static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
-					      struct blk_zone_wplug *zwplug)
-{
-	/*
-	 * Take a reference on the zone write plug and schedule the submission
-	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
-	 * reference we take here.
-	 */
-	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
-	refcount_inc(&zwplug->ref);
-	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
-}
-
 static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 				       struct blk_zone_wplug *zwplug)
 {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations
  2024-12-08 22:57 [PATCH v2 0/4] Zone write plugging fixes Damien Le Moal
  2024-12-08 22:57 ` [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
@ 2024-12-08 22:57 ` Damien Le Moal
  2024-12-09  7:46   ` Christoph Hellwig
  2024-12-08 22:57 ` [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
  2024-12-08 22:57 ` [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
  3 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2024-12-08 22:57 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
  Cc: Christoph Hellwig, Bart Van Assche

There are currently any issuer of REQ_OP_ZONE_RESET and
REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot
handle this flag correctly due to the potential request allocation
failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio()
has handled the zone write plug write pointer updates for the targeted
zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this
flag is set and ignore it.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7982b9494d63..ee9c67121c6c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -707,6 +707,15 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
 		return true;
 	}
 
+	/*
+	 * No-wait reset or finish BIOs do not make much sense as the callers
+	 * issue these as blocking operations in most cases. To avoid issues
+	 * the BIO execution potentially failing with BLK_STS_AGAIN, warn about
+	 * REQ_NOWAIT being set and ignore that flag.
+	 */
+	if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT))
+		bio->bi_opf &= ~REQ_NOWAIT;
+
 	/*
 	 * If we have a zone write plug, set its write pointer offset to 0
 	 * (reset case) or to the zone size (finish case). This will abort all
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery
  2024-12-08 22:57 [PATCH v2 0/4] Zone write plugging fixes Damien Le Moal
  2024-12-08 22:57 ` [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
  2024-12-08 22:57 ` [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
@ 2024-12-08 22:57 ` Damien Le Moal
  2024-12-09  7:57   ` Christoph Hellwig
  2024-12-08 22:57 ` [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
  3 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2024-12-08 22:57 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
  Cc: Christoph Hellwig, Bart Van Assche

Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that BIOs can be checked on
submission and that we can always correctly emulate zone append
operations using regular write BIOs. However, this error recovery scheme
introduces a potential deadlock if a device queue freeze is initiated
while BIOs are still plugged in a zone write plug and one of these write
operation fails. In such case, the disk zone write plug error recovery
work is scheduled and executes a report zone. This in turn can result in
a request allocation in the underlying driver to issue the report zones
command to the device. However, with the device queue freeze already
started, this allocation will block, preventing the report zone
execution and the continuation of the processing of the plugged BIOs. As
plugged BIOs hold a queue usage reference, the queue freeze itself will
never complete, resulting in a deadlock.

Avoid this problem by completely getting rid of the need for executing a
report zone from within the zone write plugging code, instead relying on
the user either executing a report zones, resetting the zone or
finishing the zone of a failed write. This is not an unresannable
requirement as all well-behaved applications, FSes and device mapper
already use report zones to recover from write errors whenever possible.

The changes to switch to this recovery method are as follows:
 - Completely remove the error recovery work and its associated
   resources (zone write plug list head, disk error list, and disk
   zone_wplugs_work work struct). This also removes the functions
   disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
 - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
   BLK_ZONE_WPLUG_NEED_WP_UPDATE.
 - Modify the function disk_zone_wplug_set_wp_offset() to clear this
   flag, thus implementing recovery of a correct write pointer for the
   reset zone and finish zone operations.
 - Introduce the function disk_zone_wplug_sync_wp_offset() which calls
   disk_zone_wplug_set_wp_offset() if the zone write plug is marked with
   the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
   disk_zone_wplug_sync_wp_offset() is called from the new, always used,
   disk_report_zones_cb() report zones callback function for
   blkdev_report_zones(). This implements recovery of a correct write
   pointer offset for zone write plugs marked with
   BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report
   zones operation executed by the user.
 - Modify blk_zone_write_plug_bio_endio() to set the
   BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
   write BIO.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c      | 388 +++++++++++++----------------------------
 include/linux/blkdev.h |   2 -
 2 files changed, 121 insertions(+), 269 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ee9c67121c6c..d709f12d5935 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -41,7 +41,6 @@ static const char *const zone_cond_name[] = {
 /*
  * Per-zone write plug.
  * @node: hlist_node structure for managing the plug using a hash table.
- * @link: To list the plug in the zone write plug error list of the disk.
  * @ref: Zone write plug reference counter. A zone write plug reference is
  *       always at least 1 when the plug is hashed in the disk plug hash table.
  *       The reference is incremented whenever a new BIO needing plugging is
@@ -63,7 +62,6 @@ static const char *const zone_cond_name[] = {
  */
 struct blk_zone_wplug {
 	struct hlist_node	node;
-	struct list_head	link;
 	refcount_t		ref;
 	spinlock_t		lock;
 	unsigned int		flags;
@@ -80,8 +78,8 @@ struct blk_zone_wplug {
  *  - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
  *    that is, that write BIOs are being throttled due to a write BIO already
  *    being executed or the zone write plug bio list is not empty.
- *  - BLK_ZONE_WPLUG_ERROR: Indicates that a write error happened which will be
- *    recovered with a report zone to update the zone write pointer offset.
+ *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
+ *    write pointer offset and need to update it.
  *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
  *    from the disk hash table and that the initial reference to the zone
  *    write plug set when the plug was first added to the hash table has been
@@ -91,11 +89,9 @@ struct blk_zone_wplug {
  *    freed once all remaining references from BIOs or functions are dropped.
  */
 #define BLK_ZONE_WPLUG_PLUGGED		(1U << 0)
-#define BLK_ZONE_WPLUG_ERROR		(1U << 1)
+#define BLK_ZONE_WPLUG_NEED_WP_UPDATE	(1U << 1)
 #define BLK_ZONE_WPLUG_UNHASHED		(1U << 2)
 
-#define BLK_ZONE_WPLUG_BUSY	(BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR)
-
 /**
  * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
  * @zone_cond: BLK_ZONE_COND_XXX.
@@ -115,6 +111,27 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
 }
 EXPORT_SYMBOL_GPL(blk_zone_cond_str);
 
+struct disk_report_zones_cb_args {
+	struct gendisk	*disk;
+	report_zones_cb	user_cb;
+	void		*user_data;
+};
+
+static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
+					   struct blk_zone *zone);
+
+static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+				void *data)
+{
+	struct disk_report_zones_cb_args *args = data;
+	struct gendisk *disk = args->disk;
+
+	if (disk->zone_wplugs_hash)
+		disk_zone_wplug_sync_wp_offset(disk, zone);
+
+	return args->user_cb(zone, idx, args->user_data);
+}
+
 /**
  * blkdev_report_zones - Get zones information
  * @bdev:	Target block device
@@ -139,6 +156,11 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
 {
 	struct gendisk *disk = bdev->bd_disk;
 	sector_t capacity = get_capacity(disk);
+	struct disk_report_zones_cb_args args = {
+		.disk = disk,
+		.user_cb = cb,
+		.user_data = data,
+	};
 
 	if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
 		return -EOPNOTSUPP;
@@ -146,7 +168,8 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
 	if (!nr_zones || sector >= capacity)
 		return 0;
 
-	return disk->fops->report_zones(disk, sector, nr_zones, cb, data);
+	return disk->fops->report_zones(disk, sector, nr_zones,
+					disk_report_zones_cb, &args);
 }
 EXPORT_SYMBOL_GPL(blkdev_report_zones);
 
@@ -427,7 +450,7 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 {
 	if (refcount_dec_and_test(&zwplug->ref)) {
 		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
-		WARN_ON_ONCE(!list_empty(&zwplug->link));
+		WARN_ON_ONCE(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
 		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED));
 
 		call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
@@ -441,8 +464,8 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
 		return false;
 
-	/* If the zone write plug is still busy, it cannot be removed. */
-	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+	/* If the zone write plug is still plugged, it cannot be removed. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		return false;
 
 	/*
@@ -525,7 +548,6 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 		return NULL;
 
 	INIT_HLIST_NODE(&zwplug->node);
-	INIT_LIST_HEAD(&zwplug->link);
 	refcount_set(&zwplug->ref, 2);
 	spin_lock_init(&zwplug->lock);
 	zwplug->flags = 0;
@@ -574,115 +596,22 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
 }
 
 /*
- * Abort (fail) all plugged BIOs of a zone write plug that are not aligned
- * with the assumed write pointer location of the zone when the BIO will
- * be unplugged.
- */
-static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
-					    struct blk_zone_wplug *zwplug)
-{
-	unsigned int wp_offset = zwplug->wp_offset;
-	struct bio_list bl = BIO_EMPTY_LIST;
-	struct bio *bio;
-
-	while ((bio = bio_list_pop(&zwplug->bio_list))) {
-		if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
-		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
-		     bio_offset_from_zone_start(bio) != wp_offset)) {
-			blk_zone_wplug_bio_io_error(zwplug, bio);
-			continue;
-		}
-
-		wp_offset += bio_sectors(bio);
-		bio_list_add(&bl, bio);
-	}
-
-	bio_list_merge(&zwplug->bio_list, &bl);
-}
-
-static inline void disk_zone_wplug_set_error(struct gendisk *disk,
-					     struct blk_zone_wplug *zwplug)
-{
-	unsigned long flags;
-
-	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
-		return;
-
-	/*
-	 * At this point, we already have a reference on the zone write plug.
-	 * However, since we are going to add the plug to the disk zone write
-	 * plugs work list, increase its reference count. This reference will
-	 * be dropped in disk_zone_wplugs_work() once the error state is
-	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
-	 * finished.
-	 */
-	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
-	refcount_inc(&zwplug->ref);
-
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-	list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
-static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
-					       struct blk_zone_wplug *zwplug)
-{
-	unsigned long flags;
-
-	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
-		return;
-
-	/*
-	 * We are racing with the error handling work which drops the reference
-	 * on the zone write plug after handling the error state. So remove the
-	 * plug from the error list and drop its reference count only if the
-	 * error handling has not yet started, that is, if the zone write plug
-	 * is still listed.
-	 */
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-	if (!list_empty(&zwplug->link)) {
-		list_del_init(&zwplug->link);
-		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
-		disk_put_zone_wplug(zwplug);
-	}
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
-/*
- * Set a zone write plug write pointer offset to either 0 (zone reset case)
- * or to the zone size (zone finish case). This aborts all plugged BIOs, which
- * is fine to do as doing a zone reset or zone finish while writes are in-flight
- * is a mistake from the user which will most likely cause all plugged BIOs to
- * fail anyway.
+ * Set a zone write plug write pointer offset to the specified value.
+ * This aborts all plugged BIOs, which is fine as this function is called for
+ * a zone reset operation, a zone finish operation or if the zone needs a wp
+ * update from a report zone after a write error.
  */
 static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
 					  struct blk_zone_wplug *zwplug,
 					  unsigned int wp_offset)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&zwplug->lock, flags);
-
-	/*
-	 * Make sure that a BIO completion or another zone reset or finish
-	 * operation has not already removed the plug from the hash table.
-	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) {
-		spin_unlock_irqrestore(&zwplug->lock, flags);
-		return;
-	}
+	lockdep_assert_held(&zwplug->lock);
 
 	/* Update the zone write pointer and abort all plugged BIOs. */
+	zwplug->flags &= ~BLK_ZONE_WPLUG_NEED_WP_UPDATE;
 	zwplug->wp_offset = wp_offset;
 	disk_zone_wplug_abort(zwplug);
 
-	/*
-	 * Updating the write pointer offset puts back the zone
-	 * in a good state. So clear the error flag and decrement the
-	 * error count if we were in error state.
-	 */
-	disk_zone_wplug_clear_error(disk, zwplug);
-
 	/*
 	 * The zone write plug now has no BIO plugged: remove it from the
 	 * hash table so that it cannot be seen. The plug will be freed
@@ -690,8 +619,48 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
 	 */
 	if (disk_should_remove_zone_wplug(disk, zwplug))
 		disk_remove_zone_wplug(disk, zwplug);
+}
 
+static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
+{
+	switch (zone->cond) {
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		return zone->wp - zone->start;
+	case BLK_ZONE_COND_FULL:
+		return zone->len;
+	case BLK_ZONE_COND_EMPTY:
+		return 0;
+	case BLK_ZONE_COND_NOT_WP:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+	default:
+		/*
+		 * Conventional, offline and read-only zones do not have a valid
+		 * write pointer.
+		 */
+		return UINT_MAX;
+	}
+}
+
+static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
+					   struct blk_zone *zone)
+{
+	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
+
+	zwplug = disk_get_zone_wplug(disk, zone->start);
+	if (!zwplug)
+		return;
+
+	spin_lock_irqsave(&zwplug->lock, flags);
+	if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
+		disk_zone_wplug_set_wp_offset(disk, zwplug,
+					      blk_zone_wp_offset(zone));
 	spin_unlock_irqrestore(&zwplug->lock, flags);
+
+	disk_put_zone_wplug(zwplug);
 }
 
 static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
@@ -700,6 +669,7 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	sector_t sector = bio->bi_iter.bi_sector;
 	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
 
 	/* Conventional zones cannot be reset nor finished. */
 	if (!bdev_zone_is_seq(bio->bi_bdev, sector)) {
@@ -725,7 +695,9 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
 	 */
 	zwplug = disk_get_zone_wplug(disk, sector);
 	if (zwplug) {
+		spin_lock_irqsave(&zwplug->lock, flags);
 		disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset);
+		spin_unlock_irqrestore(&zwplug->lock, flags);
 		disk_put_zone_wplug(zwplug);
 	}
 
@@ -736,6 +708,7 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
 	sector_t sector;
 
 	/*
@@ -747,7 +720,9 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 	     sector += disk->queue->limits.chunk_sectors) {
 		zwplug = disk_get_zone_wplug(disk, sector);
 		if (zwplug) {
+			spin_lock_irqsave(&zwplug->lock, flags);
 			disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
+			spin_unlock_irqrestore(&zwplug->lock, flags);
 			disk_put_zone_wplug(zwplug);
 		}
 	}
@@ -929,13 +904,23 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 
+	/*
+	 * If we lost track of the zone write pointer due to a write error,
+	 * the user must either execute a report zones, reset the zone or finish
+	 * the to recover a reliable write pointer position. Fail BIOs if the
+	 * user did not do that as we cannot handle emulated zone append
+	 * otherwise.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE)
+		return false;
+
 	/*
 	 * Check that the user is not attempting to write to a full zone.
 	 * We know such BIO will fail, and that would potentially overflow our
 	 * write pointer offset beyond the end of the zone.
 	 */
 	if (disk_zone_wplug_is_full(disk, zwplug))
-		goto err;
+		return false;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 		/*
@@ -954,24 +939,18 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 		bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
 	} else {
 		/*
-		 * Check for non-sequential writes early because we avoid a
-		 * whole lot of error handling trouble if we don't send it off
-		 * to the driver.
+		 * Check for non-sequential writes early as we know that BIOs
+		 * with a start sector not unaligned to the zone write pointer
+		 * will fail.
 		 */
 		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
-			goto err;
+			return false;
 	}
 
 	/* Advance the zone write pointer offset. */
 	zwplug->wp_offset += bio_sectors(bio);
 
 	return true;
-
-err:
-	/* We detected an invalid write BIO: schedule error recovery. */
-	disk_zone_wplug_set_error(disk, zwplug);
-	kblockd_schedule_work(&disk->zone_wplugs_work);
-	return false;
 }
 
 static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
@@ -1021,20 +1000,20 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/*
-	 * If the zone is already plugged or has a pending error, add the BIO
-	 * to the plug BIO list. Do the same for REQ_NOWAIT BIOs to ensure that
-	 * we will not see a BLK_STS_AGAIN failure if we let the BIO execute.
+	 * If the zone is already plugged, add the BIO to the plug BIO list.
+	 * Do the same for REQ_NOWAIT BIOs to ensure that we will not see a
+	 * BLK_STS_AGAIN failure if we let the BIO execute.
 	 * Otherwise, plug and let the BIO execute.
 	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY || (bio->bi_opf & REQ_NOWAIT))
+	if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) ||
+	    (bio->bi_opf & REQ_NOWAIT))
 		goto plug;
 
-	/*
-	 * If an error is detected when preparing the BIO, add it to the BIO
-	 * list so that error recovery can deal with it.
-	 */
-	if (!blk_zone_wplug_prepare_bio(zwplug, bio))
-		goto plug;
+	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+		bio_io_error(bio);
+		return true;
+	}
 
 	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
 
@@ -1134,16 +1113,6 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 
-	/*
-	 * If we had an error, schedule error recovery. The recovery work
-	 * will restart submission of plugged BIOs.
-	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
-		spin_unlock_irqrestore(&zwplug->lock, flags);
-		kblockd_schedule_work(&disk->zone_wplugs_work);
-		return;
-	}
-
 	/* Schedule submission of the next plugged BIO if we have one. */
 	if (!bio_list_empty(&zwplug->bio_list)) {
 		disk_zone_wplug_schedule_bio_work(disk, zwplug);
@@ -1186,12 +1155,13 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 	}
 
 	/*
-	 * If the BIO failed, mark the plug as having an error to trigger
-	 * recovery.
+	 * If the BIO failed, abort all plugged BIOs and mark the plug as
+	 * needing a write pointer update.
 	 */
 	if (bio->bi_status != BLK_STS_OK) {
 		spin_lock_irqsave(&zwplug->lock, flags);
-		disk_zone_wplug_set_error(disk, zwplug);
+		disk_zone_wplug_abort(zwplug);
+		zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
 		spin_unlock_irqrestore(&zwplug->lock, flags);
 	}
 
@@ -1247,6 +1217,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	 */
 	spin_lock_irqsave(&zwplug->lock, flags);
 
+again:
 	bio = bio_list_pop(&zwplug->bio_list);
 	if (!bio) {
 		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
@@ -1255,10 +1226,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	}
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
-		/* Error recovery will decide what to do with the BIO. */
-		bio_list_add_head(&zwplug->bio_list, bio);
-		spin_unlock_irqrestore(&zwplug->lock, flags);
-		goto put_zwplug;
+		blk_zone_wplug_bio_io_error(zwplug, bio);
+		goto again;
 	}
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1280,120 +1249,6 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	disk_put_zone_wplug(zwplug);
 }
 
-static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
-{
-	switch (zone->cond) {
-	case BLK_ZONE_COND_IMP_OPEN:
-	case BLK_ZONE_COND_EXP_OPEN:
-	case BLK_ZONE_COND_CLOSED:
-		return zone->wp - zone->start;
-	case BLK_ZONE_COND_FULL:
-		return zone->len;
-	case BLK_ZONE_COND_EMPTY:
-		return 0;
-	case BLK_ZONE_COND_NOT_WP:
-	case BLK_ZONE_COND_OFFLINE:
-	case BLK_ZONE_COND_READONLY:
-	default:
-		/*
-		 * Conventional, offline and read-only zones do not have a valid
-		 * write pointer.
-		 */
-		return UINT_MAX;
-	}
-}
-
-static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
-					 unsigned int idx, void *data)
-{
-	struct blk_zone *zonep = data;
-
-	*zonep = *zone;
-	return 0;
-}
-
-static void disk_zone_wplug_handle_error(struct gendisk *disk,
-					 struct blk_zone_wplug *zwplug)
-{
-	sector_t zone_start_sector =
-		bdev_zone_sectors(disk->part0) * zwplug->zone_no;
-	unsigned int noio_flag;
-	struct blk_zone zone;
-	unsigned long flags;
-	int ret;
-
-	/* Get the current zone information from the device. */
-	noio_flag = memalloc_noio_save();
-	ret = disk->fops->report_zones(disk, zone_start_sector, 1,
-				       blk_zone_wplug_report_zone_cb, &zone);
-	memalloc_noio_restore(noio_flag);
-
-	spin_lock_irqsave(&zwplug->lock, flags);
-
-	/*
-	 * A zone reset or finish may have cleared the error already. In such
-	 * case, do nothing as the report zones may have seen the "old" write
-	 * pointer value before the reset/finish operation completed.
-	 */
-	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
-		goto unlock;
-
-	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
-
-	if (ret != 1) {
-		/*
-		 * We failed to get the zone information, meaning that something
-		 * is likely really wrong with the device. Abort all remaining
-		 * plugged BIOs as otherwise we could endup waiting forever on
-		 * plugged BIOs to complete if there is a queue freeze on-going.
-		 */
-		disk_zone_wplug_abort(zwplug);
-		goto unplug;
-	}
-
-	/* Update the zone write pointer offset. */
-	zwplug->wp_offset = blk_zone_wp_offset(&zone);
-	disk_zone_wplug_abort_unaligned(disk, zwplug);
-
-	/* Restart BIO submission if we still have any BIO left. */
-	if (!bio_list_empty(&zwplug->bio_list)) {
-		disk_zone_wplug_schedule_bio_work(disk, zwplug);
-		goto unlock;
-	}
-
-unplug:
-	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
-	if (disk_should_remove_zone_wplug(disk, zwplug))
-		disk_remove_zone_wplug(disk, zwplug);
-
-unlock:
-	spin_unlock_irqrestore(&zwplug->lock, flags);
-}
-
-static void disk_zone_wplugs_work(struct work_struct *work)
-{
-	struct gendisk *disk =
-		container_of(work, struct gendisk, zone_wplugs_work);
-	struct blk_zone_wplug *zwplug;
-	unsigned long flags;
-
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-
-	while (!list_empty(&disk->zone_wplugs_err_list)) {
-		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
-					  struct blk_zone_wplug, link);
-		list_del_init(&zwplug->link);
-		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-
-		disk_zone_wplug_handle_error(disk, zwplug);
-		disk_put_zone_wplug(zwplug);
-
-		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-	}
-
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
 static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
 {
 	return 1U << disk->zone_wplugs_hash_bits;
@@ -1402,8 +1257,6 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
 void disk_init_zone_resources(struct gendisk *disk)
 {
 	spin_lock_init(&disk->zone_wplugs_lock);
-	INIT_LIST_HEAD(&disk->zone_wplugs_err_list);
-	INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work);
 }
 
 /*
@@ -1502,8 +1355,6 @@ void disk_free_zone_resources(struct gendisk *disk)
 	if (!disk->zone_wplugs_pool)
 		return;
 
-	cancel_work_sync(&disk->zone_wplugs_work);
-
 	if (disk->zone_wplugs_wq) {
 		destroy_workqueue(disk->zone_wplugs_wq);
 		disk->zone_wplugs_wq = NULL;
@@ -1700,6 +1551,8 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
 	if (!disk->zone_wplugs_hash)
 		return 0;
 
+	disk_zone_wplug_sync_wp_offset(disk, zone);
+
 	wp_offset = blk_zone_wp_offset(zone);
 	if (!wp_offset || wp_offset >= zone->capacity)
 		return 0;
@@ -1830,6 +1683,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		memalloc_noio_restore(noio_flag);
 		return ret;
 	}
+
 	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
 				       blk_revalidate_zone_cb, &args);
 	if (!ret) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08a727b40816..15e7dfc013b7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -200,8 +200,6 @@ struct gendisk {
 	spinlock_t              zone_wplugs_lock;
 	struct mempool_s	*zone_wplugs_pool;
 	struct hlist_head       *zone_wplugs_hash;
-	struct list_head        zone_wplugs_err_list;
-	struct work_struct	zone_wplugs_work;
 	struct workqueue_struct *zone_wplugs_wq;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
  2024-12-08 22:57 [PATCH v2 0/4] Zone write plugging fixes Damien Le Moal
                   ` (2 preceding siblings ...)
  2024-12-08 22:57 ` [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
@ 2024-12-08 22:57 ` Damien Le Moal
  2024-12-09  7:44   ` Christoph Hellwig
  3 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2024-12-08 22:57 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel
  Cc: Christoph Hellwig, Bart Van Assche

The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming zones, to write valid data blocks at the correct zone
start relative offset. However, the first call to blkdev_issue_zeroout()
will try to use hardware offload through a REQ_OP_WRITE_ZEROES
operation, if the device reports a non-zero max zero sectors limit. If
this operation fails, blkdev_issue_zeroout() falls back to using a
regular write operation with zeror-pages.

With the removal of the zone write plug automatic write pointer recovery
after a write error, the first attempt using a REQ_OP_WRITE_ZEROES
leaves the target zone write pointer out of sync with the drive current
value as the zone writ eplugging code advanced the zone write plug write
pointer offset on submission of the REQ_OP_WRITE_ZEROES operation. The
target zone is marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE, which causes
the fallback regular write operation to also fail.

blkdev_issue_zeroout() callers such as dmz_reclaim_align_wp() can
recover from such situation by executing a report zones and retrying the
call to blkdev_issue_zeroout() to handle this recoverable error
situation. Given that such pattern will be common to all users of
blkdev_issue_zeroout(), introduce the function
blkdev_issue_zone_zeroout() to automatically handle such recovery. This
function calls blkdev_issue_zeroout() with the BLKDEV_ZERO_NOFALLBACK
flag to intercept failures on the first execution which attempt to use
the device hardware offload with a REQ_OP_WRITE_ZEROES. If this attempt
fails, blkdev_report_zones() is executed to recover the target zone to a
good state and execute again blkdev_issue_zeroout() without the
BLKDEV_ZERO_NOFALLBACK flag.

Replacing the call to blkdev_issue_zeroout() with a call to
blkdev_issue_zone_zeroout() in dmz_reclaim_align_wp() thus solves
irrecoverable write errors triggered by the removal of the zone write
plugging automatic recovery (commit "block: Prevent potential deadlocks
in zone write plug error recovery").

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c             | 55 +++++++++++++++++++++++++++++++++++
 drivers/md/dm-zoned-reclaim.c |  4 +--
 include/linux/blkdev.h        |  3 ++
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d709f12d5935..c5400792de13 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -129,6 +129,9 @@ static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx,
 	if (disk->zone_wplugs_hash)
 		disk_zone_wplug_sync_wp_offset(disk, zone);
 
+	if (!args->user_cb)
+		return 0;
+
 	return args->user_cb(zone, idx, args->user_data);
 }
 
@@ -663,6 +666,16 @@ static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
 	disk_put_zone_wplug(zwplug);
 }
 
+static int disk_zone_sync_wp_offset(struct gendisk *disk, sector_t sector)
+{
+	struct disk_report_zones_cb_args args = {
+		.disk = disk,
+	};
+
+	return disk->fops->report_zones(disk, sector, 1,
+					disk_report_zones_cb, &args);
+}
+
 static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
 						  unsigned int wp_offset)
 {
@@ -1720,6 +1733,48 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
 
+/**
+ * blkdev_issue_zone_zeroout - zero-fill a block range in a zone
+ * @bdev:	blockdev to write
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to write
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Zero-fill a block range in a zone (@sector must be equal to the zone write
+ *  pointer), handling potential errors due to the (initially unknown) lack of
+ *  hardware offload (See blkdev_issue_zeroout()).
+ */
+int blkdev_issue_zone_zeroout(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask)
+{
+	int ret;
+
+	if (WARN_ON_ONCE(!bdev_is_zoned(bdev)))
+		return -EIO;
+
+	ret = blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
+				   BLKDEV_ZERO_NOFALLBACK);
+	if (ret != -EOPNOTSUPP)
+		return ret;
+
+	/*
+	 * The failed call to blkdev_issue_zeroout() advanced the zone write
+	 * pointer. Undo this using a report zone to update the zone write
+	 * pointer to the correct current value.
+	 */
+	ret = disk_zone_sync_wp_offset(bdev->bd_disk, sector);
+	if (ret != 1)
+		return ret < 0 ? ret : -EIO;
+
+	/*
+	 * Retry without BLKDEV_ZERO_NOFALLBACK to force the fallback to a
+	 * regular write with zero-pages.
+	 */
+	return blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, 0);
+}
+EXPORT_SYMBOL(blkdev_issue_zone_zeroout);
+
 #ifdef CONFIG_BLK_DEBUG_FS
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c
index d58db9a27e6c..b085a929e64e 100644
--- a/drivers/md/dm-zoned-reclaim.c
+++ b/drivers/md/dm-zoned-reclaim.c
@@ -76,9 +76,9 @@ static int dmz_reclaim_align_wp(struct dmz_reclaim *zrc, struct dm_zone *zone,
 	 * pointer and the requested position.
 	 */
 	nr_blocks = block - wp_block;
-	ret = blkdev_issue_zeroout(dev->bdev,
+	ret = blkdev_issue_zone_zeroout(dev->bdev,
 				   dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block),
-				   dmz_blk2sect(nr_blocks), GFP_NOIO, 0);
+				   dmz_blk2sect(nr_blocks), GFP_NOIO);
 	if (ret) {
 		dmz_dev_err(dev,
 			    "Align zone %u wp %llu to %llu (wp+%u) blocks failed %d",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 15e7dfc013b7..696fdafcfe91 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1419,6 +1419,9 @@ static inline bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector)
 	return is_seq;
 }
 
+int blkdev_issue_zone_zeroout(struct block_device *bdev, sector_t sector,
+			      sector_t nr_sects, gfp_t gfp_mask);
+
 static inline unsigned int queue_dma_alignment(const struct request_queue *q)
 {
 	return q->limits.dma_alignment;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
  2024-12-08 22:57 ` [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
@ 2024-12-09  7:44   ` Christoph Hellwig
  2024-12-09  8:38     ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-12-09  7:44 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Christoph Hellwig, Bart Van Assche

On Mon, Dec 09, 2024 at 07:57:58AM +0900, Damien Le Moal wrote:
> +int blkdev_issue_zone_zeroout(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask)

Nit: Would blk_zone_issue_zeroout be a better name?

Also I think this needs to be re-ordered before the previous patch to
preserve bisectability.

> +EXPORT_SYMBOL(blkdev_issue_zone_zeroout);

EXPORT_SYMBOL_GPL is used for all other zoned code, I'd do the same
here for consitency.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
  2024-12-08 22:57 ` [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
@ 2024-12-09  7:44   ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2024-12-09  7:44 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Christoph Hellwig, Bart Van Assche

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations
  2024-12-08 22:57 ` [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
@ 2024-12-09  7:46   ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2024-12-09  7:46 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Christoph Hellwig, Bart Van Assche

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery
  2024-12-08 22:57 ` [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
@ 2024-12-09  7:57   ` Christoph Hellwig
  2024-12-09  8:18     ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-12-09  7:57 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Christoph Hellwig, Bart Van Assche

On Mon, Dec 09, 2024 at 07:57:57AM +0900, Damien Le Moal wrote:
> Avoid this problem by completely getting rid of the need for executing a
> report zone from within the zone write plugging code, instead relying on
> the user either executing a report zones, resetting the zone or
> finishing the zone of a failed write. This is not an unresannable

s/unresannable/unreasonable/ ?

> requirement as all well-behaved applications, FSes and device mapper
> already use report zones to recover from write errors whenever possible.

I think the real question here is what errors the file system (or other
submitter) can even recover from.  The next patch deals with the not
support case for a "special" operation, and that's of course a valid one.
The first patch already excludes EAGAIN from nowait, and the drivers
already retry anything that they think is retryable by just resubmitting
without bubbling it up to the submitter.  That mostly leaves fatal
media errors as all modern hardware that supports zones just remaps
on write media failures.  I.e. for those the most sane answer is to
simply shut down the file system for single-device file systems, or
treat the device as faulty for multi-device file systems.  This might
change when we support logical depop on a per-zone basis, but I don't
think anyone is there yet.  We also really should test this case.
I'll add a testcase with error injection for zoned xfs, and someone
should do the same for btrfs (including multi-device handling) and
f2fs.

Sorry for the long rant - not a comment on the code itself but maybe
the commit log could use a little update.

Also we probably need to recover this information somewhere in the
docs.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery
  2024-12-09  7:57   ` Christoph Hellwig
@ 2024-12-09  8:18     ` Damien Le Moal
  2024-12-09  8:21       ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2024-12-09  8:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Bart Van Assche

On 12/9/24 16:57, Christoph Hellwig wrote:
> On Mon, Dec 09, 2024 at 07:57:57AM +0900, Damien Le Moal wrote:
>> Avoid this problem by completely getting rid of the need for executing a
>> report zone from within the zone write plugging code, instead relying on
>> the user either executing a report zones, resetting the zone or
>> finishing the zone of a failed write. This is not an unresannable
> 
> s/unresannable/unreasonable/ ?

yes.

> 
>> requirement as all well-behaved applications, FSes and device mapper
>> already use report zones to recover from write errors whenever possible.
> 
> I think the real question here is what errors the file system (or other
> submitter) can even recover from.  The next patch deals with the not
> support case for a "special" operation, and that's of course a valid one.

Yep. But even that one is actually coded in scsi to return a -EIO instead of
ENOTSUPP. We can patch that (return ENOTSUPP for an invalid opcode error), but I
am not sure if that is safe to do given that this has been like this for ages.

This is all to say that we cannot even reliably distinguish special/valid error
cases that can be recovered from actual medium/hard errors.

> The first patch already excludes EAGAIN from nowait, and the drivers
> already retry anything that they think is retryable by just resubmitting
> without bubbling it up to the submitter.  That mostly leaves fatal
> media errors as all modern hardware that supports zones just remaps
> on write media failures.  I.e. for those the most sane answer is to
> simply shut down the file system for single-device file systems, or
> treat the device as faulty for multi-device file systems.  This might
> change when we support logical depop on a per-zone basis, but I don't
> think anyone is there yet.  We also really should test this case.
> I'll add a testcase with error injection for zoned xfs, and someone
> should do the same for btrfs (including multi-device handling) and
> f2fs.

I have test cases for zonefs already. That is because zonefs has the
"recover-error" mount option which forces a recovery of a file size (== write
pointer position) if a write fails or is torn. The default even for zonefs is to
go read-only since there is indeed not much we can do about failed writes.

> Sorry for the long rant - not a comment on the code itself but maybe
> the commit log could use a little update.

OK. Will try to improve it.

> Also we probably need to recover this information somewhere in the
> docs.

Hmmm... not sure we have a good place for this. Let me figure out something.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery
  2024-12-09  8:18     ` Damien Le Moal
@ 2024-12-09  8:21       ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2024-12-09  8:21 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, Mike Snitzer,
	Mikulas Patocka, dm-devel, Bart Van Assche

On Mon, Dec 09, 2024 at 05:18:00PM +0900, Damien Le Moal wrote:
> 
> Yep. But even that one is actually coded in scsi to return a -EIO instead of
> ENOTSUPP. We can patch that (return ENOTSUPP for an invalid opcode error), but I
> am not sure if that is safe to do given that this has been like this for ages.
> 
> This is all to say that we cannot even reliably distinguish special/valid error
> cases that can be recovered from actual medium/hard errors.

I think we can't change the return value, as the whole thing is messy.
I just meant EOPNOTSUP-like.  The exact error should not matter for
the handling anyway, just reasoning about use cases.

> I have test cases for zonefs already. That is because zonefs has the
> "recover-error" mount option which forces a recovery of a file size (== write
> pointer position) if a write fails or is torn. The default even for zonefs is to
> go read-only since there is indeed not much we can do about failed writes.

Yes, that іs the sensible way to handle errors as far as I'm conerned.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
  2024-12-09  7:44   ` Christoph Hellwig
@ 2024-12-09  8:38     ` Damien Le Moal
  2024-12-09  8:39       ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2024-12-09  8:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Bart Van Assche

On 12/9/24 16:44, Christoph Hellwig wrote:
> On Mon, Dec 09, 2024 at 07:57:58AM +0900, Damien Le Moal wrote:
>> +int blkdev_issue_zone_zeroout(struct block_device *bdev, sector_t sector,
>> +		sector_t nr_sects, gfp_t gfp_mask)
> 
> Nit: Would blk_zone_issue_zeroout be a better name?

Yes.

> Also I think this needs to be re-ordered before the previous patch to
> preserve bisectability.

The problem with doing that is that there is absolutely nothing to patch/fix
before the previous patch, since the "recovery/report zones" was done
automatically. So if anything, maybe I should just squash this patch together
with the previous one to be consistent against bisect ? That does make sense
since this patch is needed *because* of the previous patch change.

>> +EXPORT_SYMBOL(blkdev_issue_zone_zeroout);
> 
> EXPORT_SYMBOL_GPL is used for all other zoned code, I'd do the same
> here for consitency.

Indeed. Will do.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
  2024-12-09  8:38     ` Damien Le Moal
@ 2024-12-09  8:39       ` Christoph Hellwig
  2024-12-09  8:40         ` Damien Le Moal
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-12-09  8:39 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, Mike Snitzer,
	Mikulas Patocka, dm-devel, Bart Van Assche

On Mon, Dec 09, 2024 at 05:38:40PM +0900, Damien Le Moal wrote:
> 
> The problem with doing that is that there is absolutely nothing to patch/fix
> before the previous patch, since the "recovery/report zones" was done
> automatically. So if anything, maybe I should just squash this patch together
> with the previous one to be consistent against bisect ? That does make sense
> since this patch is needed *because* of the previous patch change.

But it's also harmless to do the extra zone report.  So just state that
in the commit log.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment
  2024-12-09  8:39       ` Christoph Hellwig
@ 2024-12-09  8:40         ` Damien Le Moal
  0 siblings, 0 replies; 14+ messages in thread
From: Damien Le Moal @ 2024-12-09  8:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, Mike Snitzer, Mikulas Patocka, dm-devel,
	Bart Van Assche

On 12/9/24 17:39, Christoph Hellwig wrote:
> On Mon, Dec 09, 2024 at 05:38:40PM +0900, Damien Le Moal wrote:
>>
>> The problem with doing that is that there is absolutely nothing to patch/fix
>> before the previous patch, since the "recovery/report zones" was done
>> automatically. So if anything, maybe I should just squash this patch together
>> with the previous one to be consistent against bisect ? That does make sense
>> since this patch is needed *because* of the previous patch change.
> 
> But it's also harmless to do the extra zone report.  So just state that
> in the commit log.

OK.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-12-09  8:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-08 22:57 [PATCH v2 0/4] Zone write plugging fixes Damien Le Moal
2024-12-08 22:57 ` [PATCH v2 1/4] block: Use a zone write plug BIO work for REQ_NOWAIT BIOs Damien Le Moal
2024-12-09  7:44   ` Christoph Hellwig
2024-12-08 22:57 ` [PATCH v2 2/4] block: Ignore REQ_NOWAIT for zone reset and zone finish operations Damien Le Moal
2024-12-09  7:46   ` Christoph Hellwig
2024-12-08 22:57 ` [PATCH v2 3/4] block: Prevent potential deadlocks in zone write plug error recovery Damien Le Moal
2024-12-09  7:57   ` Christoph Hellwig
2024-12-09  8:18     ` Damien Le Moal
2024-12-09  8:21       ` Christoph Hellwig
2024-12-08 22:57 ` [PATCH v2 4/4] dm: Fix dm-zoned-reclaim zone write pointer alignment Damien Le Moal
2024-12-09  7:44   ` Christoph Hellwig
2024-12-09  8:38     ` Damien Le Moal
2024-12-09  8:39       ` Christoph Hellwig
2024-12-09  8:40         ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).