Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
@ 2024-11-24 15:33 kernel test robot
  0 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2024-11-24 15:33 UTC (permalink / raw)
  To: oe-kbuild; +Cc: lkp, Dan Carpenter

BCC: lkp@intel.com
CC: oe-kbuild-all@lists.linux.dev
In-Reply-To: <20241119002815.600608-5-bvanassche@acm.org>
References: <20241119002815.600608-5-bvanassche@acm.org>
TO: Bart Van Assche <bvanassche@acm.org>

Hi Bart,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on next-20241122]
[cannot apply to jejb-scsi/for-next mkp-scsi/for-next v6.12]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bart-Van-Assche/blk-zoned-Fix-a-reference-count-leak/20241121-112830
base:   linus/master
patch link:    https://lore.kernel.org/r/20241119002815.600608-5-bvanassche%40acm.org
patch subject: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
:::::: branch date: 4 days ago
:::::: commit date: 4 days ago
config: microblaze-randconfig-r072-20241124 (https://download.01.org/0day-ci/archive/20241124/202411242326.HSKyGmyJ-lkp@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 14.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <error27@gmail.com>
| Closes: https://lore.kernel.org/r/202411242326.HSKyGmyJ-lkp@intel.com/

New smatch warnings:
block/blk-zoned.c:804 blk_zone_all_zwr_inserted() warn: iterator 'i' not incremented

Old smatch warnings:
arch/microblaze/include/asm/thread_info.h:85 current_thread_info() error: uninitialized symbol 'sp'.

vim +/i +804 block/blk-zoned.c

2238dfb90653aa Bart Van Assche 2024-11-18  779  
2238dfb90653aa Bart Van Assche 2024-11-18  780  /*
2238dfb90653aa Bart Van Assche 2024-11-18  781   * Report whether or not all zoned writes for a zone have been inserted into a
2238dfb90653aa Bart Van Assche 2024-11-18  782   * software queue, elevator queue or hardware queue.
2238dfb90653aa Bart Van Assche 2024-11-18  783   */
2238dfb90653aa Bart Van Assche 2024-11-18  784  static bool blk_zone_all_zwr_inserted(struct blk_zone_wplug *zwplug)
2238dfb90653aa Bart Van Assche 2024-11-18  785  {
2238dfb90653aa Bart Van Assche 2024-11-18  786  	struct gendisk *disk = zwplug->disk;
2238dfb90653aa Bart Van Assche 2024-11-18  787  	struct request_queue *q = disk->queue;
2238dfb90653aa Bart Van Assche 2024-11-18  788  	struct all_zwr_inserted_data d = { .zwplug = zwplug, .res = true };
2238dfb90653aa Bart Van Assche 2024-11-18  789  	struct blk_mq_hw_ctx *hctx;
2238dfb90653aa Bart Van Assche 2024-11-18  790  	long unsigned int i;
2238dfb90653aa Bart Van Assche 2024-11-18  791  	struct request *rq;
2238dfb90653aa Bart Van Assche 2024-11-18  792  
2238dfb90653aa Bart Van Assche 2024-11-18  793  	scoped_guard(spinlock_irqsave, &q->requeue_lock) {
2238dfb90653aa Bart Van Assche 2024-11-18  794  		list_for_each_entry(rq, &q->requeue_list, queuelist)
2238dfb90653aa Bart Van Assche 2024-11-18  795  			if (blk_rq_is_seq_zoned_write(rq) &&
2238dfb90653aa Bart Van Assche 2024-11-18  796  			    bio_zone_no(rq->bio) == zwplug->zone_no)
2238dfb90653aa Bart Van Assche 2024-11-18  797  				return false;
2238dfb90653aa Bart Van Assche 2024-11-18  798  		list_for_each_entry(rq, &q->flush_list, queuelist)
2238dfb90653aa Bart Van Assche 2024-11-18  799  			if (blk_rq_is_seq_zoned_write(rq) &&
2238dfb90653aa Bart Van Assche 2024-11-18  800  			    bio_zone_no(rq->bio) == zwplug->zone_no)
2238dfb90653aa Bart Van Assche 2024-11-18  801  				return false;
2238dfb90653aa Bart Van Assche 2024-11-18  802  	}
2238dfb90653aa Bart Van Assche 2024-11-18  803  
2238dfb90653aa Bart Van Assche 2024-11-18 @804  	queue_for_each_hw_ctx(q, hctx, i) {
2238dfb90653aa Bart Van Assche 2024-11-18  805  		struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
2238dfb90653aa Bart Van Assche 2024-11-18  806  
2238dfb90653aa Bart Van Assche 2024-11-18  807  		blk_mq_all_tag_iter(tags, blk_zwr_inserted, &d);
2238dfb90653aa Bart Van Assche 2024-11-18  808  		if (!d.res || blk_mq_is_shared_tags(q->tag_set->flags))
2238dfb90653aa Bart Van Assche 2024-11-18  809  			break;
2238dfb90653aa Bart Van Assche 2024-11-18  810  	}
2238dfb90653aa Bart Van Assche 2024-11-18  811  
2238dfb90653aa Bart Van Assche 2024-11-18  812  	return d.res;
2238dfb90653aa Bart Van Assche 2024-11-18  813  }
2238dfb90653aa Bart Van Assche 2024-11-18  814  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v16 00/26] Improve write performance for zoned UFS devices
@ 2024-11-19  0:27 Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Hi Damien and Christoph,

This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
you are probably busy because the merge window is open, please take a look
at this patch series when you have the time. This patch series is organized
as follows:
 - Bug fixes for existing code at the start of the series.
 - The write pipelining support implementation comes after the bug fixes.

Thanks,

Bart.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (26):
  blk-zoned: Fix a reference count leak
  blk-zoned: Split disk_zone_wplugs_work()
  blk-zoned: Split queue_zone_wplugs_show()
  blk-zoned: Only handle errors after pending zoned writes have
    completed
  blk-zoned: Fix a deadlock triggered by unaligned writes
  blk-zoned: Fix requeuing of zoned writes
  block: Support block drivers that preserve the order of write requests
  dm-linear: Report to the block layer that the write order is preserved
  mq-deadline: Remove a local variable
  blk-mq: Clean up blk_mq_requeue_work()
  block: Optimize blk_mq_submit_bio() for the cache hit scenario
  block: Rework request allocation in blk_mq_submit_bio()
  block: Support allocating from a specific software queue
  blk-mq: Restore the zoned write order when requeuing
  blk-zoned: Document the locking order
  blk-zoned: Document locking assumptions
  blk-zoned: Uninline functions that are not in the hot path
  blk-zoned: Make disk_should_remove_zone_wplug() more robust
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Support pipelining of zoned writes
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: scsi_debug: Skip host/bus reset settle delay
  scsi: ufs: Inform the block layer about write ordering

 block/bfq-iosched.c       |   2 +
 block/blk-mq.c            |  97 ++++++----
 block/blk-mq.h            |   3 +
 block/blk-settings.c      |   2 +
 block/blk-zoned.c         | 376 +++++++++++++++++++++++++-------------
 block/blk.h               |  33 +++-
 block/kyber-iosched.c     |   2 +
 block/mq-deadline.c       |  10 +-
 drivers/md/dm-linear.c    |   6 +
 drivers/md/dm.c           |   2 +-
 drivers/scsi/scsi_debug.c |  22 ++-
 drivers/scsi/scsi_error.c |  16 ++
 drivers/scsi/sd.c         |   7 +
 drivers/ufs/core/ufshcd.c |   7 +
 include/linux/blk-mq.h    |  20 +-
 include/linux/blkdev.h    |  11 +-
 16 files changed, 444 insertions(+), 172 deletions(-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  2:50   ` Damien Le Moal
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Instead of handling write errors immediately, only handle these after all
pending zoned write requests have completed or have been requeued. This
patch prepares for changing the zone write pointer tracking approach.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         |   9 +++
 block/blk-zoned.c      | 154 +++++++++++++++++++++++++++++++++++++++--
 block/blk.h            |  29 ++++++++
 include/linux/blk-mq.h |  18 +++++
 4 files changed, 203 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 270cfd9fc6b0..a45077e187b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -793,6 +793,9 @@ void blk_mq_free_request(struct request *rq)
 	rq_qos_done(q, rq);
 
 	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+
+	blk_zone_free_request(rq);
+
 	if (req_ref_put_and_test(rq))
 		__blk_mq_free_request(rq);
 }
@@ -1189,6 +1192,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 			continue;
 
 		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+
+		blk_zone_free_request(rq);
+
 		if (!req_ref_put_and_test(rq))
 			continue;
 
@@ -1507,6 +1513,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	if (blk_mq_request_started(rq)) {
 		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 		rq->rq_flags &= ~RQF_TIMED_OUT;
+		blk_zone_requeue_work(q);
 	}
 }
 
@@ -1542,6 +1549,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
 	list_splice_init(&q->flush_list, &flush_list);
 	spin_unlock_irq(&q->requeue_lock);
 
+	blk_zone_requeue_work(q);
+
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
 		/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7e6e6ebb6235..b570b773e65f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -608,6 +608,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
 		return;
 
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
 	/*
 	 * At this point, we already have a reference on the zone write plug.
 	 * However, since we are going to add the plug to the disk zone write
@@ -616,7 +618,6 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
 	 * finished.
 	 */
-	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
 	refcount_inc(&zwplug->ref);
 
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
@@ -642,6 +643,7 @@ static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
 	if (!list_empty(&zwplug->link)) {
 		list_del_init(&zwplug->link);
+		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
 		disk_put_zone_wplug(zwplug);
 	}
@@ -746,6 +748,70 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 	return false;
 }
 
+struct all_zwr_inserted_data {
+	struct blk_zone_wplug *zwplug;
+	bool res;
+};
+
+/*
+ * Changes @data->res to %false if and only if @rq is a zoned write for the
+ * given zone and if it is owned by the block driver.
+ *
+ * @rq members may change while this function is in progress. Hence, use
+ * READ_ONCE() to read @rq members.
+ */
+static bool blk_zwr_inserted(struct request *rq, void *data)
+{
+	struct all_zwr_inserted_data *d = data;
+	struct blk_zone_wplug *zwplug = d->zwplug;
+	struct request_queue *q = zwplug->disk->queue;
+	struct bio *bio = READ_ONCE(rq->bio);
+
+	if (rq->q == q && READ_ONCE(rq->state) != MQ_RQ_IDLE &&
+	    blk_rq_is_seq_zoned_write(rq) && bio &&
+	    bio_zone_no(bio) == zwplug->zone_no) {
+		d->res = false;
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Report whether or not all zoned writes for a zone have been inserted into a
+ * software queue, elevator queue or hardware queue.
+ */
+static bool blk_zone_all_zwr_inserted(struct blk_zone_wplug *zwplug)
+{
+	struct gendisk *disk = zwplug->disk;
+	struct request_queue *q = disk->queue;
+	struct all_zwr_inserted_data d = { .zwplug = zwplug, .res = true };
+	struct blk_mq_hw_ctx *hctx;
+	long unsigned int i;
+	struct request *rq;
+
+	scoped_guard(spinlock_irqsave, &q->requeue_lock) {
+		list_for_each_entry(rq, &q->requeue_list, queuelist)
+			if (blk_rq_is_seq_zoned_write(rq) &&
+			    bio_zone_no(rq->bio) == zwplug->zone_no)
+				return false;
+		list_for_each_entry(rq, &q->flush_list, queuelist)
+			if (blk_rq_is_seq_zoned_write(rq) &&
+			    bio_zone_no(rq->bio) == zwplug->zone_no)
+				return false;
+	}
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
+
+		blk_mq_all_tag_iter(tags, blk_zwr_inserted, &d);
+		if (!d.res || blk_mq_is_shared_tags(q->tag_set->flags))
+			break;
+	}
+
+	return d.res;
+}
+
 static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 					  struct bio *bio, unsigned int nr_segs)
 {
@@ -1096,6 +1162,29 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
 }
 
+/*
+ * Change the zone state to "error" if a request is requeued to postpone
+ * processing of requeued requests until all pending requests have either
+ * completed or have been requeued.
+ */
+void blk_zone_write_plug_requeue_request(struct request *rq)
+{
+	struct gendisk *disk = rq->q->disk;
+	struct blk_zone_wplug *zwplug;
+
+	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
+		return;
+
+	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
+	if (WARN_ON_ONCE(!zwplug))
+		return;
+
+	scoped_guard(spinlock_irqsave, &zwplug->lock)
+		disk_zone_wplug_set_error(disk, zwplug);
+
+	disk_put_zone_wplug(zwplug);
+}
+
 static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 				       struct blk_zone_wplug *zwplug)
 {
@@ -1202,6 +1291,33 @@ void blk_zone_write_plug_finish_request(struct request *req)
 	disk_put_zone_wplug(zwplug);
 }
 
+/*
+ * Schedule zone_plugs_work if a zone is in the error state and if no requests
+ * are in flight. Called from blk_mq_free_request().
+ */
+void blk_zone_write_plug_free_request(struct request *rq)
+{
+	struct gendisk *disk = rq->q->disk;
+	struct blk_zone_wplug *zwplug;
+
+	/*
+	 * Do nothing if this function is called before the zone information
+	 * has been initialized.
+	 */
+	if (!disk->zone_wplugs_hash_bits)
+		return;
+
+	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
+
+	if (!zwplug)
+		return;
+
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
+		kblockd_schedule_work(&disk->zone_wplugs_work);
+
+	disk_put_zone_wplug(zwplug);
+}
+
 static void blk_zone_wplug_bio_work(struct work_struct *work)
 {
 	struct blk_zone_wplug *zwplug =
@@ -1343,14 +1459,15 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 
 static void disk_zone_process_err_list(struct gendisk *disk)
 {
-	struct blk_zone_wplug *zwplug;
+	struct blk_zone_wplug *zwplug, *next;
 	unsigned long flags;
 
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
 
-	while (!list_empty(&disk->zone_wplugs_err_list)) {
-		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
-					  struct blk_zone_wplug, link);
+	list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
+				 link) {
+		if (!blk_zone_all_zwr_inserted(zwplug))
+			continue;
 		list_del_init(&zwplug->link);
 		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
 
@@ -1361,6 +1478,12 @@ static void disk_zone_process_err_list(struct gendisk *disk)
 	}
 
 	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+
+	/*
+	 * If one or more zones have been skipped, this work will be requeued
+	 * when a request is requeued (blk_zone_requeue_work()) or freed
+	 * (blk_zone_write_plug_free_request()).
+	 */
 }
 
 static void disk_zone_wplugs_work(struct work_struct *work)
@@ -1371,6 +1494,20 @@ static void disk_zone_wplugs_work(struct work_struct *work)
 	disk_zone_process_err_list(disk);
 }
 
+/* May be called from interrupt context and hence must not sleep. */
+void blk_zone_requeue_work(struct request_queue *q)
+{
+	struct gendisk *disk = q->disk;
+
+	if (!disk)
+		return;
+
+	if (in_interrupt())
+		kblockd_schedule_work(&disk->zone_wplugs_work);
+	else
+		disk_zone_process_err_list(disk);
+}
+
 static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
 {
 	return 1U << disk->zone_wplugs_hash_bits;
@@ -1854,8 +1991,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
-		   zwp_wp_offset, zwp_bio_list_size);
+	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
+
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
+		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+		   zwp_bio_list_size, all_zwr_inserted);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/block/blk.h b/block/blk.h
index 2c26abf505b8..be945db6298d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -473,6 +473,18 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
 		bio->bi_iter.bi_sector = rq->__sector;
 }
+
+void blk_zone_write_plug_requeue_request(struct request *rq);
+static inline void blk_zone_requeue_request(struct request *rq)
+{
+	if (!blk_rq_is_seq_zoned_write(rq))
+		return;
+
+	blk_zone_write_plug_requeue_request(rq);
+}
+
+void blk_zone_requeue_work(struct request_queue *q);
+
 void blk_zone_write_plug_bio_endio(struct bio *bio);
 static inline void blk_zone_bio_endio(struct bio *bio)
 {
@@ -490,6 +502,14 @@ static inline void blk_zone_finish_request(struct request *rq)
 	if (rq->rq_flags & RQF_ZONE_WRITE_PLUGGING)
 		blk_zone_write_plug_finish_request(rq);
 }
+
+void blk_zone_write_plug_free_request(struct request *rq);
+static inline void blk_zone_free_request(struct request *rq)
+{
+	if (blk_queue_is_zoned(rq->q))
+		blk_zone_write_plug_free_request(rq);
+}
+
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
@@ -515,12 +535,21 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 					       struct bio *bio)
 {
 }
+static inline void blk_zone_requeue_request(struct request *rq)
+{
+}
+static inline void blk_zone_requeue_work(struct request_queue *q)
+{
+}
 static inline void blk_zone_bio_endio(struct bio *bio)
 {
 }
 static inline void blk_zone_finish_request(struct request *rq)
 {
 }
+static inline void blk_zone_free_request(struct request *rq)
+{
+}
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 		unsigned int cmd, unsigned long arg)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c596e0e4cb75..ac05974f08f9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1169,4 +1169,22 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+#ifdef CONFIG_BLK_DEV_ZONED
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+	default:
+		return false;
+	}
+}
+#else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	return false;
+}
+#endif /* CONFIG_BLK_DEV_ZONED */
+
 #endif /* BLK_MQ_H */

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
@ 2024-11-19  2:50   ` Damien Le Moal
  2024-11-19 20:51     ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2024-11-19  2:50 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 9:27 AM, Bart Van Assche wrote:
> Instead of handling write errors immediately, only handle these after all
> pending zoned write requests have completed or have been requeued. This
> patch prepares for changing the zone write pointer tracking approach.

A little more explanations about how this is achieved would be nice. I was
expecting a shorter change given the short commit message... Took some time to
understand the changes without details.

More comments below.

> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq.c         |   9 +++
>  block/blk-zoned.c      | 154 +++++++++++++++++++++++++++++++++++++++--
>  block/blk.h            |  29 ++++++++
>  include/linux/blk-mq.h |  18 +++++
>  4 files changed, 203 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 270cfd9fc6b0..a45077e187b5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -793,6 +793,9 @@ void blk_mq_free_request(struct request *rq)
>  	rq_qos_done(q, rq);
>  
>  	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
> +
> +	blk_zone_free_request(rq);
> +
>  	if (req_ref_put_and_test(rq))
>  		__blk_mq_free_request(rq);
>  }
> @@ -1189,6 +1192,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
>  			continue;
>  
>  		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
> +
> +		blk_zone_free_request(rq);
> +
>  		if (!req_ref_put_and_test(rq))
>  			continue;
>  
> @@ -1507,6 +1513,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>  	if (blk_mq_request_started(rq)) {
>  		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
>  		rq->rq_flags &= ~RQF_TIMED_OUT;
> +		blk_zone_requeue_work(q);
>  	}
>  }
>  
> @@ -1542,6 +1549,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
>  	list_splice_init(&q->flush_list, &flush_list);
>  	spin_unlock_irq(&q->requeue_lock);
>  
> +	blk_zone_requeue_work(q);
> +
>  	while (!list_empty(&rq_list)) {
>  		rq = list_entry(rq_list.next, struct request, queuelist);
>  		/*
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 7e6e6ebb6235..b570b773e65f 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -608,6 +608,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>  	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
>  		return;
>  
> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> +	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;

Why move these before the comment ? Also, why set BLK_ZONE_WPLUG_PLUGGED ? It
should be set already since this is handling a failed write that was either
being prepared for submission or submitted (and completed) already. In both
cases, the wplug is plugged since we have a write in flight.

>  	/*
>  	 * At this point, we already have a reference on the zone write plug.
>  	 * However, since we are going to add the plug to the disk zone write
> @@ -616,7 +618,6 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>  	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
>  	 * finished.
>  	 */
> -	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
>  	refcount_inc(&zwplug->ref);
>  
>  	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> @@ -642,6 +643,7 @@ static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
>  	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>  	if (!list_empty(&zwplug->link)) {
>  		list_del_init(&zwplug->link);
> +		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
>  		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>  		disk_put_zone_wplug(zwplug);
>  	}
> @@ -746,6 +748,70 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
>  	return false;
>  }
>  
> +struct all_zwr_inserted_data {
> +	struct blk_zone_wplug *zwplug;
> +	bool res;
> +};
> +
> +/*
> + * Changes @data->res to %false if and only if @rq is a zoned write for the
> + * given zone and if it is owned by the block driver.

It would be nice to have a request flag or a state indicating that instead of
needing all this code... Can't that be done ?

> + *
> + * @rq members may change while this function is in progress. Hence, use
> + * READ_ONCE() to read @rq members.
> + */
> +static bool blk_zwr_inserted(struct request *rq, void *data)
> +{
> +	struct all_zwr_inserted_data *d = data;
> +	struct blk_zone_wplug *zwplug = d->zwplug;
> +	struct request_queue *q = zwplug->disk->queue;
> +	struct bio *bio = READ_ONCE(rq->bio);
> +
> +	if (rq->q == q && READ_ONCE(rq->state) != MQ_RQ_IDLE &&
> +	    blk_rq_is_seq_zoned_write(rq) && bio &&
> +	    bio_zone_no(bio) == zwplug->zone_no) {
> +		d->res = false;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * Report whether or not all zoned writes for a zone have been inserted into a
> + * software queue, elevator queue or hardware queue.
> + */
> +static bool blk_zone_all_zwr_inserted(struct blk_zone_wplug *zwplug)
> +{
> +	struct gendisk *disk = zwplug->disk;
> +	struct request_queue *q = disk->queue;
> +	struct all_zwr_inserted_data d = { .zwplug = zwplug, .res = true };
> +	struct blk_mq_hw_ctx *hctx;
> +	long unsigned int i;
> +	struct request *rq;
> +
> +	scoped_guard(spinlock_irqsave, &q->requeue_lock) {
> +		list_for_each_entry(rq, &q->requeue_list, queuelist)
> +			if (blk_rq_is_seq_zoned_write(rq) &&
> +			    bio_zone_no(rq->bio) == zwplug->zone_no)
> +				return false;
> +		list_for_each_entry(rq, &q->flush_list, queuelist)
> +			if (blk_rq_is_seq_zoned_write(rq) &&
> +			    bio_zone_no(rq->bio) == zwplug->zone_no)
> +				return false;
> +	}
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
> +
> +		blk_mq_all_tag_iter(tags, blk_zwr_inserted, &d);
> +		if (!d.res || blk_mq_is_shared_tags(q->tag_set->flags))
> +			break;
> +	}
> +
> +	return d.res;
> +}
> +
>  static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>  					  struct bio *bio, unsigned int nr_segs)
>  {
> @@ -1096,6 +1162,29 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
>  	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
>  }
>  
> +/*
> + * Change the zone state to "error" if a request is requeued to postpone
> + * processing of requeued requests until all pending requests have either
> + * completed or have been requeued.
> + */
> +void blk_zone_write_plug_requeue_request(struct request *rq)
> +{
> +	struct gendisk *disk = rq->q->disk;
> +	struct blk_zone_wplug *zwplug;
> +
> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
> +		return;

I think the disk->zone_wplugs_hash_bits check needs to go inside
disk_get_zone_wplug() as that will avoid a similar check in
blk_zone_write_plug_free_request() too. That said, I am not even convinced it
is needed at all since these functions should be called only for a zoned drive
which should have its zone wplug hash setup.

> +
> +	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
> +	if (WARN_ON_ONCE(!zwplug))
> +		return;
> +
> +	scoped_guard(spinlock_irqsave, &zwplug->lock)
> +		disk_zone_wplug_set_error(disk, zwplug);
> +
> +	disk_put_zone_wplug(zwplug);
> +}
> +
>  static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
>  				       struct blk_zone_wplug *zwplug)
>  {
> @@ -1202,6 +1291,33 @@ void blk_zone_write_plug_finish_request(struct request *req)
>  	disk_put_zone_wplug(zwplug);
>  }
>  
> +/*
> + * Schedule zone_plugs_work if a zone is in the error state and if no requests
> + * are in flight. Called from blk_mq_free_request().
> + */
> +void blk_zone_write_plug_free_request(struct request *rq)
> +{
> +	struct gendisk *disk = rq->q->disk;
> +	struct blk_zone_wplug *zwplug;
> +
> +	/*
> +	 * Do nothing if this function is called before the zone information
> +	 * has been initialized.
> +	 */
> +	if (!disk->zone_wplugs_hash_bits)
> +		return;
> +
> +	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
> +

Useless blank line here.

> +	if (!zwplug)
> +		return;
> +
> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
> +		kblockd_schedule_work(&disk->zone_wplugs_work);

I think this needs to be done under the zone wplug spinlock ?

> +> +	disk_put_zone_wplug(zwplug);
> +}
> +
>  static void blk_zone_wplug_bio_work(struct work_struct *work)
>  {
>  	struct blk_zone_wplug *zwplug =
> @@ -1343,14 +1459,15 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  
>  static void disk_zone_process_err_list(struct gendisk *disk)
>  {
> -	struct blk_zone_wplug *zwplug;
> +	struct blk_zone_wplug *zwplug, *next;
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>  
> -	while (!list_empty(&disk->zone_wplugs_err_list)) {
> -		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
> -					  struct blk_zone_wplug, link);
> +	list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
> +				 link) {

You are holding the disk zwplug spinlock, so why use the _safe() loop ? Not
needed, right ?

> +		if (!blk_zone_all_zwr_inserted(zwplug))
> +			continue;
>  		list_del_init(&zwplug->link);
>  		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>  
> @@ -1361,6 +1478,12 @@ static void disk_zone_process_err_list(struct gendisk *disk)
>  	}
>  
>  	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +
> +	/*
> +	 * If one or more zones have been skipped, this work will be requeued
> +	 * when a request is requeued (blk_zone_requeue_work()) or freed
> +	 * (blk_zone_write_plug_free_request()).
> +	 */
>  }
>  
>  static void disk_zone_wplugs_work(struct work_struct *work)
> @@ -1371,6 +1494,20 @@ static void disk_zone_wplugs_work(struct work_struct *work)
>  	disk_zone_process_err_list(disk);
>  }
>  
> +/* May be called from interrupt context and hence must not sleep. */
> +void blk_zone_requeue_work(struct request_queue *q)
> +{
> +	struct gendisk *disk = q->disk;
> +
> +	if (!disk)
> +		return;

Can this happen ?

> +
> +	if (in_interrupt())
> +		kblockd_schedule_work(&disk->zone_wplugs_work);
> +	else
> +		disk_zone_process_err_list(disk);
> +}
> +
>  static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
>  {
>  	return 1U << disk->zone_wplugs_hash_bits;
> @@ -1854,8 +1991,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>  	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
>  
> -	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
> -		   zwp_wp_offset, zwp_bio_list_size);
> +	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);

Is this bool really needed ? If it is, shouldn't it be assigned while holding
the zwplug lock to have a "snapshot" of the plug with all printed values
consistent ?

> +
> +	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
> +		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
> +		   zwp_bio_list_size, all_zwr_inserted);
>  }
>  
>  int queue_zone_wplugs_show(void *data, struct seq_file *m)
> diff --git a/block/blk.h b/block/blk.h
> index 2c26abf505b8..be945db6298d 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -473,6 +473,18 @@ static inline void blk_zone_update_request_bio(struct request *rq,
>  	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
>  		bio->bi_iter.bi_sector = rq->__sector;
>  }
> +
> +void blk_zone_write_plug_requeue_request(struct request *rq);
> +static inline void blk_zone_requeue_request(struct request *rq)
> +{
> +	if (!blk_rq_is_seq_zoned_write(rq))
> +		return;
> +
> +	blk_zone_write_plug_requeue_request(rq);

May be:

	if (blk_rq_is_seq_zoned_write(rq))
		blk_zone_write_plug_requeue_request(rq);

?

> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index c596e0e4cb75..ac05974f08f9 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1169,4 +1169,22 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
>  }
>  void blk_dump_rq_flags(struct request *, char *);
>  
> +#ifdef CONFIG_BLK_DEV_ZONED
> +static inline bool blk_rq_is_seq_zoned_write(struct request *rq)

bdev_zone_is_seq() is already stubbed for the !CONFIG_BLK_DEV_ZONED case, so I
do not think this function needs the ifdef. It will compile either way and will
never be called from anywhere in the !CONFIG_BLK_DEV_ZONED case.

> +{
> +	switch (req_op(rq)) {
> +	case REQ_OP_WRITE:
> +	case REQ_OP_WRITE_ZEROES:
> +		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
> +	default:
> +		return false;
> +	}
> +}
> +#else /* CONFIG_BLK_DEV_ZONED */
> +static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_BLK_DEV_ZONED */
> +
>  #endif /* BLK_MQ_H */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19  2:50   ` Damien Le Moal
@ 2024-11-19 20:51     ` Bart Van Assche
  2024-11-21  3:23       ` Damien Le Moal
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2024-11-19 20:51 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 6:50 PM, Damien Le Moal wrote:
> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>> Instead of handling write errors immediately, only handle these after all
>> pending zoned write requests have completed or have been requeued. This
>> patch prepares for changing the zone write pointer tracking approach.
> 
> A little more explanations about how this is achieved would be nice. I was
> expecting a shorter change given the short commit message... Took some time to
> understand the changes without details.

Hi Damien,

I will make the patch description more detailed.

>> @@ -608,6 +608,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>>   	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
>>   		return;
>>   
>> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
>> +	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
> 
> Why move these before the comment ? Also, why set BLK_ZONE_WPLUG_PLUGGED ? It
> should be set already since this is handling a failed write that was either
> being prepared for submission or submitted (and completed) already. In both
> cases, the wplug is plugged since we have a write in flight.
> 
>>   	/*
>>   	 * At this point, we already have a reference on the zone write plug.
>>   	 * However, since we are going to add the plug to the disk zone write
>> @@ -616,7 +618,6 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>>   	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
>>   	 * finished.
>>   	 */
>> -	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
>>   	refcount_inc(&zwplug->ref);

Does the comment block refer to the refcount_inc() call only or to both
the flags changes and the refcount_inc() call? I'm assuming the former.
That's why I moved the "zwplug->flags |= BLK_ZONE_WPLUG_ERROR;"
statement. Did I perhaps misunderstand something?

>> +/*
>> + * Changes @data->res to %false if and only if @rq is a zoned write for the
>> + * given zone and if it is owned by the block driver.
> 
> It would be nice to have a request flag or a state indicating that instead of
> needing all this code... Can't that be done ?

rq->state and rq->bio are modified independently and are independent of
whether or not blk_rq_is_seq_zoned_write() evaluates to true. I'm
concerned that introducing the proposed flag would result in numerous
changes in the block layer and hence would make the block layer harder
to maintain.

>> +/*
>> + * Change the zone state to "error" if a request is requeued to postpone
>> + * processing of requeued requests until all pending requests have either
>> + * completed or have been requeued.
>> + */
>> +void blk_zone_write_plug_requeue_request(struct request *rq)
>> +{
>> +	struct gendisk *disk = rq->q->disk;
>> +	struct blk_zone_wplug *zwplug;
>> +
>> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
>> +		return;
> 
> I think the disk->zone_wplugs_hash_bits check needs to go inside
> disk_get_zone_wplug() as that will avoid a similar check in
> blk_zone_write_plug_free_request() too. That said, I am not even convinced it
> is needed at all since these functions should be called only for a zoned drive
> which should have its zone wplug hash setup.

Moving the disk->zone_wplugs_hash_bits check sounds good to me.

I added this check after hitting an UBSAN report that indicates that
disk->zone_wplugs_hash_bits was used before it was changed into a non-
zero value. sd_revalidate_disk() submits a very substantial number of
SCSI commands before it calls blk_revalidate_disk_zones(), the function
that sets disk->zone_wplugs_hash_bits.

>> @@ -1343,14 +1459,15 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>>   
>>   static void disk_zone_process_err_list(struct gendisk *disk)
>>   {
>> -	struct blk_zone_wplug *zwplug;
>> +	struct blk_zone_wplug *zwplug, *next;
>>   	unsigned long flags;
>>   
>>   	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>>   
>> -	while (!list_empty(&disk->zone_wplugs_err_list)) {
>> -		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
>> -					  struct blk_zone_wplug, link);
>> +	list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
>> +				 link) {
> 
> You are holding the disk zwplug spinlock, so why use the _safe() loop ? Not
> needed, right ?
> 
>> +		if (!blk_zone_all_zwr_inserted(zwplug))
>> +			continue;
>>   		list_del_init(&zwplug->link);
>>   		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);

The _safe suffix in list_for_each_entry_safe() is not related to locking
  - it means that it is allowed to delete the loop entry 'zwplug' from
the list.

I think the _safe variant is needed because of the list_del_init() call.

>> +/* May be called from interrupt context and hence must not sleep. */
>> +void blk_zone_requeue_work(struct request_queue *q)
>> +{
>> +	struct gendisk *disk = q->disk;
>> +
>> +	if (!disk)
>> +		return;
> 
> Can this happen ?

The code in scsi_scan.c executes SCSI commands like INQUIRY before
sd_probe() attaches a disk (device_add_disk()) so I think that any code
inserted into blk_mq_requeue_work() must support q->disk == NULL.

>> @@ -1854,8 +1991,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>>   	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
>>   	spin_unlock_irqrestore(&zwplug->lock, flags);
>>   
>> -	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
>> -		   zwp_wp_offset, zwp_bio_list_size);
>> +	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
> 
> Is this bool really needed ? If it is, shouldn't it be assigned while holding
> the zwplug lock to have a "snapshot" of the plug with all printed values
> consistent ?

Since blk_zone_all_zwr_inserted() checks information that is not
related to the zwplug state (e.g. the requeue list), I don't think that
the zwplug lock should be held around this blk_zone_all_zwr_inserted()
call.

I can keep this change as a local debugging patch if you would prefer
this.

>> +void blk_zone_write_plug_requeue_request(struct request *rq);
>> +static inline void blk_zone_requeue_request(struct request *rq)
>> +{
>> +	if (!blk_rq_is_seq_zoned_write(rq))
>> +		return;
>> +
>> +	blk_zone_write_plug_requeue_request(rq);
> 
> May be:
> 
> 	if (blk_rq_is_seq_zoned_write(rq))
> 		blk_zone_write_plug_requeue_request(rq);
> 
> ?

I can make this change - I don't have a strong opinion about which style
to prefer.
  Thanks,

Bart.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19 20:51     ` Bart Van Assche
@ 2024-11-21  3:23       ` Damien Le Moal
  2024-11-21 17:43         ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Damien Le Moal @ 2024-11-21  3:23 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 05:51, Bart Van Assche wrote:
>>> +/*
>>> + * Change the zone state to "error" if a request is requeued to postpone
>>> + * processing of requeued requests until all pending requests have either
>>> + * completed or have been requeued.
>>> + */
>>> +void blk_zone_write_plug_requeue_request(struct request *rq)
>>> +{
>>> +	struct gendisk *disk = rq->q->disk;
>>> +	struct blk_zone_wplug *zwplug;
>>> +
>>> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
>>> +		return;
>>
>> I think the disk->zone_wplugs_hash_bits check needs to go inside
>> disk_get_zone_wplug() as that will avoid a similar check in
>> blk_zone_write_plug_free_request() too. That said, I am not even convinced it
>> is needed at all since these functions should be called only for a zoned drive
>> which should have its zone wplug hash setup.
> 
> Moving the disk->zone_wplugs_hash_bits check sounds good to me.
> 
> I added this check after hitting an UBSAN report that indicates that
> disk->zone_wplugs_hash_bits was used before it was changed into a non-
> zero value. sd_revalidate_disk() submits a very substantial number of
> SCSI commands before it calls blk_revalidate_disk_zones(), the function
> that sets disk->zone_wplugs_hash_bits.

But non of the commands are writes to sequential zones, so the hash bits check
should not even be necessary, no ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-21  3:23       ` Damien Le Moal
@ 2024-11-21 17:43         ` Bart Van Assche
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2024-11-21 17:43 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 7:23 PM, Damien Le Moal wrote:
> On 11/20/24 05:51, Bart Van Assche wrote:
>>>> +/*
>>>> + * Change the zone state to "error" if a request is requeued to postpone
>>>> + * processing of requeued requests until all pending requests have either
>>>> + * completed or have been requeued.
>>>> + */
>>>> +void blk_zone_write_plug_requeue_request(struct request *rq)
>>>> +{
>>>> +	struct gendisk *disk = rq->q->disk;
>>>> +	struct blk_zone_wplug *zwplug;
>>>> +
>>>> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
>>>> +		return;
>>>
>>> I think the disk->zone_wplugs_hash_bits check needs to go inside
>>> disk_get_zone_wplug() as that will avoid a similar check in
>>> blk_zone_write_plug_free_request() too. That said, I am not even convinced it
>>> is needed at all since these functions should be called only for a zoned drive
>>> which should have its zone wplug hash setup.
>>
>> Moving the disk->zone_wplugs_hash_bits check sounds good to me.
>>
>> I added this check after hitting an UBSAN report that indicates that
>> disk->zone_wplugs_hash_bits was used before it was changed into a non-
>> zero value. sd_revalidate_disk() submits a very substantial number of
>> SCSI commands before it calls blk_revalidate_disk_zones(), the function
>> that sets disk->zone_wplugs_hash_bits.
> 
> But non of the commands are writes to sequential zones, so the hash bits check
> should not even be necessary, no ?

Hi Damien,

I will double check whether this test is really necessary and leave it
out if not.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-11-24 15:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-24 15:33 [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed kernel test robot
  -- strict thread matches above, loose matches on Subject: below --
2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
2024-11-19  2:50   ` Damien Le Moal
2024-11-19 20:51     ` Bart Van Assche
2024-11-21  3:23       ` Damien Le Moal
2024-11-21 17:43         ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.