* [PATCH v3 1/6] block: Introduce the flag REQ_NO_WRITE_LOCK
2023-07-26 0:57 [PATCH v3 0/6] Improve the performance of F2FS on zoned UFS Bart Van Assche
@ 2023-07-26 0:57 ` Bart Van Assche
2023-07-26 8:37 ` Damien Le Moal
2023-07-26 0:57 ` [PATCH v3 2/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
` (4 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 0:57 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Ming Lei
Not all software that supports zoned storage allocates and submits zoned
writes in LBA order per zone. Introduce the REQ_NO_WRITE_LOCK flag such
that submitters of zoned writes can indicate that zoned writes are
allocated and submitted in LBA order per zone.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-flush.c | 3 ++-
include/linux/blk_types.h | 8 ++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index e73dc22d05c1..038350ed7cce 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -336,7 +336,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
flush_rq->internal_tag = first_rq->internal_tag;
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
- flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
+ flush_rq->cmd_flags |= flags &
+ (REQ_FAILFAST_MASK | REQ_NO_ZONE_WRITE_LOCK | REQ_DRV);
flush_rq->rq_flags |= RQF_FLUSH_SEQ;
flush_rq->end_io = flush_end_io;
/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0bad62cca3d0..68f7934d8fa6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -416,6 +416,12 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD, /* read ahead, can fail anytime */
__REQ_BACKGROUND, /* background IO */
+ /*
+ * Do not use zone write locking. Setting this flag is only safe if
+ * the request submitter allocates and submit requests in LBA order per
+ * zone.
+ */
+ __REQ_NO_ZONE_WRITE_LOCK,
__REQ_NOWAIT, /* Don't wait if request will block */
__REQ_POLLED, /* caller polls for completion using bio_poll */
__REQ_ALLOC_CACHE, /* allocate IO from cache if available */
@@ -448,6 +454,8 @@ enum req_flag_bits {
#define REQ_PREFLUSH (__force blk_opf_t)(1ULL << __REQ_PREFLUSH)
#define REQ_RAHEAD (__force blk_opf_t)(1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
+#define REQ_NO_ZONE_WRITE_LOCK \
+ (__force blk_opf_t)(1ULL << __REQ_NO_ZONE_WRITE_LOCK)
#define REQ_NOWAIT (__force blk_opf_t)(1ULL << __REQ_NOWAIT)
#define REQ_POLLED (__force blk_opf_t)(1ULL << __REQ_POLLED)
#define REQ_ALLOC_CACHE (__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 1/6] block: Introduce the flag REQ_NO_WRITE_LOCK
2023-07-26 0:57 ` [PATCH v3 1/6] block: Introduce the flag REQ_NO_WRITE_LOCK Bart Van Assche
@ 2023-07-26 8:37 ` Damien Le Moal
2023-07-26 14:58 ` Bart Van Assche
0 siblings, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2023-07-26 8:37 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Ming Lei
On 7/26/23 09:57, Bart Van Assche wrote:
> Not all software that supports zoned storage allocates and submits zoned
> writes in LBA order per zone. Introduce the REQ_NO_WRITE_LOCK flag such
> that submitters of zoned writes can indicate that zoned writes are
> allocated and submitted in LBA order per zone.
That is absolutely NOT true. *ALL* in-kernel and application software supporting
zoned block devices issue writes sequentially. Otherwise, they are considered
buggy. The justification for zone write locking is to preserve sequential write
streams in the presence of storage adapters that do not preserve command orders,
or if the low level driver requeue write commands. If both of these conditions
are false (e.g. UFS), with the help of proper scsi requeue handling, zone write
locking can be disabled. REQ_NO_WRITE_LOCK indicates that a write request does
not need zone write locking when the underlying device signaled that it can
preserve command ordering. So we also need a queue flag for that...
>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/blk-flush.c | 3 ++-
> include/linux/blk_types.h | 8 ++++++++
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index e73dc22d05c1..038350ed7cce 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -336,7 +336,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> flush_rq->internal_tag = first_rq->internal_tag;
>
> flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
> - flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
> + flush_rq->cmd_flags |= flags &
> + (REQ_FAILFAST_MASK | REQ_NO_ZONE_WRITE_LOCK | REQ_DRV);
Why ? I do not see why this change is necessary. If it is, the commit message
should mention it and this change should probably be its own patch.
> flush_rq->rq_flags |= RQF_FLUSH_SEQ;
> flush_rq->end_io = flush_end_io;
> /*
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 0bad62cca3d0..68f7934d8fa6 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -416,6 +416,12 @@ enum req_flag_bits {
> __REQ_PREFLUSH, /* request for cache flush */
> __REQ_RAHEAD, /* read ahead, can fail anytime */
> __REQ_BACKGROUND, /* background IO */
> + /*
> + * Do not use zone write locking. Setting this flag is only safe if
> + * the request submitter allocates and submit requests in LBA order per
> + * zone.
...if the request submitter submits write requests sequentially per zone.
> + */
> + __REQ_NO_ZONE_WRITE_LOCK,
> __REQ_NOWAIT, /* Don't wait if request will block */
> __REQ_POLLED, /* caller polls for completion using bio_poll */
> __REQ_ALLOC_CACHE, /* allocate IO from cache if available */
> @@ -448,6 +454,8 @@ enum req_flag_bits {
> #define REQ_PREFLUSH (__force blk_opf_t)(1ULL << __REQ_PREFLUSH)
> #define REQ_RAHEAD (__force blk_opf_t)(1ULL << __REQ_RAHEAD)
> #define REQ_BACKGROUND (__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
> +#define REQ_NO_ZONE_WRITE_LOCK \
> + (__force blk_opf_t)(1ULL << __REQ_NO_ZONE_WRITE_LOCK)
> #define REQ_NOWAIT (__force blk_opf_t)(1ULL << __REQ_NOWAIT)
> #define REQ_POLLED (__force blk_opf_t)(1ULL << __REQ_POLLED)
> #define REQ_ALLOC_CACHE (__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v3 1/6] block: Introduce the flag REQ_NO_WRITE_LOCK
2023-07-26 8:37 ` Damien Le Moal
@ 2023-07-26 14:58 ` Bart Van Assche
0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 14:58 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Ming Lei
On 7/26/23 01:37, Damien Le Moal wrote:
> REQ_NO_WRITE_LOCK indicates that a write request does
> not need zone write locking when the underlying device signaled that it can
> preserve command ordering. So we also need a queue flag for that...
My apologies: I accidentally left out the patch "block: Introduce the
flag QUEUE_FLAG_NO_ZONE_WRITE_LOCK" from this series. I will repost this
series.
Bart.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 2/6] block/mq-deadline: Only use zone locking if necessary
2023-07-26 0:57 [PATCH v3 0/6] Improve the performance of F2FS on zoned UFS Bart Van Assche
2023-07-26 0:57 ` [PATCH v3 1/6] block: Introduce the flag REQ_NO_WRITE_LOCK Bart Van Assche
@ 2023-07-26 0:57 ` Bart Van Assche
2023-07-26 8:41 ` Damien Le Moal
2023-07-26 0:57 ` [PATCH v3 3/6] block/null_blk: Support disabling zone write locking Bart Van Assche
` (3 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 0:57 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Ming Lei
Measurements have shown that limiting the queue depth to one for zoned
writes has a significant negative performance impact on zoned UFS devices.
Hence this patch that disables zone locking by the mq-deadline scheduler
if zoned writes are submitted in order and if the storage controller
preserves the command order. This patch is based on the following
assumptions:
- It happens infrequently that zoned write requests are reordered by the
block layer.
- The I/O priority of all write requests is the same per zone.
- Either no I/O scheduler is used or an I/O scheduler is used that
submits write requests per zone in LBA order.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 02a916ba62ee..ce5b5048935e 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -338,6 +338,18 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
return rq;
}
+/*
+ * Use write locking if either QUEUE_FLAG_NO_ZONE_WRITE_LOCK or
+ * REQ_NO_ZONE_WRITE_LOCK has not been set. Not using zone write locking is
+ * only safe if the submitter allocates and submit requests in LBA order per
+ * zone and if the block driver preserves the request order.
+ */
+static bool dd_use_write_locking(struct request *rq)
+{
+ return !blk_queue_no_zone_write_lock(rq->q) ||
+ !(rq->cmd_flags & REQ_NO_ZONE_WRITE_LOCK);
+}
+
/*
* For the specified data direction, return the next request to
* dispatch using arrival ordered lists.
@@ -353,7 +365,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
return NULL;
rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
- if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+ if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
+ !dd_use_write_locking(rq))
return rq;
/*
@@ -398,7 +411,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
if (!rq)
return NULL;
- if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+ if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
+ !dd_use_write_locking(rq))
return rq;
/*
@@ -526,8 +540,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
}
/*
- * For a zoned block device, if we only have writes queued and none of
- * them can be dispatched, rq will be NULL.
+ * For a zoned block device that requires write serialization, if we
+ * only have writes queued and none of them can be dispatched, rq will
+ * be NULL.
*/
if (!rq)
return NULL;
@@ -552,7 +567,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
/*
* If the request needs its target zone locked, do it.
*/
- blk_req_zone_write_lock(rq);
+ if (dd_use_write_locking(rq))
+ blk_req_zone_write_lock(rq);
rq->rq_flags |= RQF_STARTED;
return rq;
}
@@ -933,7 +949,7 @@ static void dd_finish_request(struct request *rq)
atomic_inc(&per_prio->stats.completed);
- if (blk_queue_is_zoned(q)) {
+ if (blk_queue_is_zoned(rq->q) && dd_use_write_locking(rq)) {
unsigned long flags;
spin_lock_irqsave(&dd->zone_lock, flags);
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 2/6] block/mq-deadline: Only use zone locking if necessary
2023-07-26 0:57 ` [PATCH v3 2/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2023-07-26 8:41 ` Damien Le Moal
0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2023-07-26 8:41 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Ming Lei
On 7/26/23 09:57, Bart Van Assche wrote:
> Measurements have shown that limiting the queue depth to one for zoned
> writes has a significant negative performance impact on zoned UFS devices.
> Hence this patch that disables zone locking by the mq-deadline scheduler
> if zoned writes are submitted in order and if the storage controller
> preserves the command order. This patch is based on the following
> assumptions:
if zoned writes are submitted in order and if the storage... -> if the storage
controller preserves...
> - It happens infrequently that zoned write requests are reordered by the
> block layer.
> - The I/O priority of all write requests is the same per zone.
> - Either no I/O scheduler is used or an I/O scheduler is used that
> submits write requests per zone in LBA order.
>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Please use dlemoal@kernel.org
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 28 ++++++++++++++++++++++------
> 1 file changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 02a916ba62ee..ce5b5048935e 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -338,6 +338,18 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
> return rq;
> }
>
> +/*
> + * Use write locking if either QUEUE_FLAG_NO_ZONE_WRITE_LOCK or
> + * REQ_NO_ZONE_WRITE_LOCK has not been set. Not using zone write locking is
> + * only safe if the submitter allocates and submit requests in LBA order per
> + * zone and if the block driver preserves the request order.
> + */
> +static bool dd_use_write_locking(struct request *rq)
> +{
> + return !blk_queue_no_zone_write_lock(rq->q) ||
> + !(rq->cmd_flags & REQ_NO_ZONE_WRITE_LOCK);
> +}
> +
> /*
> * For the specified data direction, return the next request to
> * dispatch using arrival ordered lists.
> @@ -353,7 +365,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> return NULL;
>
> rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
> - if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> + if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
> + !dd_use_write_locking(rq))
The test for !blk_queue_is_zoned(rq->q) can be moved inside
dd_use_write_locking() I think. That will avoid repeating it again below.
> return rq;
>
> /*
> @@ -398,7 +411,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> if (!rq)
> return NULL;
>
> - if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> + if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
> + !dd_use_write_locking(rq))
> return rq;
>
> /*
> @@ -526,8 +540,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> }
>
> /*
> - * For a zoned block device, if we only have writes queued and none of
> - * them can be dispatched, rq will be NULL.
> + * For a zoned block device that requires write serialization, if we
> + * only have writes queued and none of them can be dispatched, rq will
> + * be NULL.
> */
> if (!rq)
> return NULL;
> @@ -552,7 +567,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> /*
> * If the request needs its target zone locked, do it.
> */
> - blk_req_zone_write_lock(rq);
> + if (dd_use_write_locking(rq))
> + blk_req_zone_write_lock(rq);
> rq->rq_flags |= RQF_STARTED;
> return rq;
> }
> @@ -933,7 +949,7 @@ static void dd_finish_request(struct request *rq)
>
> atomic_inc(&per_prio->stats.completed);
>
> - if (blk_queue_is_zoned(q)) {
> + if (blk_queue_is_zoned(rq->q) && dd_use_write_locking(rq)) {
> unsigned long flags;
>
> spin_lock_irqsave(&dd->zone_lock, flags);
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 3/6] block/null_blk: Support disabling zone write locking
2023-07-26 0:57 [PATCH v3 0/6] Improve the performance of F2FS on zoned UFS Bart Van Assche
2023-07-26 0:57 ` [PATCH v3 1/6] block: Introduce the flag REQ_NO_WRITE_LOCK Bart Van Assche
2023-07-26 0:57 ` [PATCH v3 2/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2023-07-26 0:57 ` Bart Van Assche
2023-07-26 8:43 ` Damien Le Moal
2023-07-26 0:57 ` [PATCH v3 4/6] scsi: Retry unaligned zoned writes Bart Van Assche
` (2 subsequent siblings)
5 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 0:57 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Ming Lei, Chaitanya Kulkarni, Damien Le Moal,
Nitesh Shetty, Vincent Fu, Dan Carpenter, Akinobu Mita,
Shin'ichiro Kawasaki, Martin K. Petersen, Johannes Thumshirn
Add a new configfs attribute for disabling zone write locking. The test
script below reports 250 K IOPS with no I/O scheduler, 6 K IOPS with
mq-deadline and write locking enabled and 123 K IOPS with mq-deadline
and write locking disabled. This shows that disabling write locking
results in about 20 times more IOPS for this particular test case.
#!/bin/bash
for mode in "none 0" "mq-deadline 0" "mq-deadline 1"; do
set +e
for d in /sys/kernel/config/nullb/*; do
[ -d "$d" ] && rmdir "$d"
done
modprobe -r null_blk
set -e
read -r iosched no_write_locking <<<"$mode"
modprobe null_blk nr_devices=0
(
cd /sys/kernel/config/nullb
mkdir nullb0
cd nullb0
params=(
completion_nsec=100000
hw_queue_depth=64
irqmode=2 # NULL_IRQ_TIMER
max_sectors=$((4096/512))
memory_backed=1
no_zone_write_lock="${no_write_locking}"
size=1
submit_queues=1
zone_size=1
zoned=1
power=1
)
for p in "${params[@]}"; do
echo "${p//*=}" > "${p//=*}"
done
)
udevadm settle
dev=/dev/nullb0
[ -b "${dev}" ]
params=(
--direct=1
--filename="${dev}"
--iodepth=64
--iodepth_batch=16
--ioengine=io_uring
--ioscheduler="${iosched}"
--gtod_reduce=1
--hipri=0
--name=nullb0
--runtime=30
--rw=write
--time_based=1
--zonemode=zbd
)
fio "${params[@]}"
done
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/block/null_blk/main.c | 2 ++
drivers/block/null_blk/null_blk.h | 1 +
drivers/block/null_blk/zoned.c | 3 +++
3 files changed, 6 insertions(+)
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 864013019d6b..5c0578137f51 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -424,6 +424,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
+NULLB_DEVICE_ATTR(no_zone_write_lock, bool, NULL);
NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
NULLB_DEVICE_ATTR(no_sched, bool, NULL);
NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL);
@@ -569,6 +570,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
&nullb_device_attr_zone_max_active,
&nullb_device_attr_zone_readonly,
&nullb_device_attr_zone_offline,
+ &nullb_device_attr_no_zone_write_lock,
&nullb_device_attr_virt_boundary,
&nullb_device_attr_no_sched,
&nullb_device_attr_shared_tag_bitmap,
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 929f659dd255..b521096bcc3f 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -117,6 +117,7 @@ struct nullb_device {
bool memory_backed; /* if data is stored in memory */
bool discard; /* if support discard */
bool zoned; /* if device is zoned */
+ bool no_zone_write_lock;
bool virt_boundary; /* virtual boundary on/off for the device */
bool no_sched; /* no IO scheduler for the device */
bool shared_tag_bitmap; /* use hostwide shared tags */
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 55c5b48bc276..31c8364a63e9 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -96,6 +96,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
spin_lock_init(&dev->zone_res_lock);
+ if (dev->no_zone_write_lock)
+ blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
+
if (dev->zone_nr_conv >= dev->nr_zones) {
dev->zone_nr_conv = dev->nr_zones - 1;
pr_info("changed the number of conventional zones to %u",
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 3/6] block/null_blk: Support disabling zone write locking
2023-07-26 0:57 ` [PATCH v3 3/6] block/null_blk: Support disabling zone write locking Bart Van Assche
@ 2023-07-26 8:43 ` Damien Le Moal
0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2023-07-26 8:43 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
Ming Lei, Chaitanya Kulkarni, Nitesh Shetty, Vincent Fu,
Dan Carpenter, Akinobu Mita, Shin'ichiro Kawasaki,
Martin K. Petersen, Johannes Thumshirn
On 7/26/23 09:57, Bart Van Assche wrote:
> Add a new configfs attribute for disabling zone write locking. The test
> script below reports 250 K IOPS with no I/O scheduler, 6 K IOPS with
> mq-deadline and write locking enabled and 123 K IOPS with mq-deadline
> and write locking disabled. This shows that disabling write locking
> results in about 20 times more IOPS for this particular test case.
>
> #!/bin/bash
>
> for mode in "none 0" "mq-deadline 0" "mq-deadline 1"; do
> set +e
> for d in /sys/kernel/config/nullb/*; do
> [ -d "$d" ] && rmdir "$d"
> done
> modprobe -r null_blk
> set -e
> read -r iosched no_write_locking <<<"$mode"
> modprobe null_blk nr_devices=0
> (
> cd /sys/kernel/config/nullb
> mkdir nullb0
> cd nullb0
> params=(
> completion_nsec=100000
> hw_queue_depth=64
> irqmode=2 # NULL_IRQ_TIMER
> max_sectors=$((4096/512))
> memory_backed=1
> no_zone_write_lock="${no_write_locking}"
> size=1
> submit_queues=1
> zone_size=1
> zoned=1
> power=1
> )
> for p in "${params[@]}"; do
> echo "${p//*=}" > "${p//=*}"
> done
> )
> udevadm settle
> dev=/dev/nullb0
> [ -b "${dev}" ]
> params=(
> --direct=1
> --filename="${dev}"
> --iodepth=64
> --iodepth_batch=16
> --ioengine=io_uring
> --ioscheduler="${iosched}"
> --gtod_reduce=1
> --hipri=0
> --name=nullb0
> --runtime=30
> --rw=write
> --time_based=1
> --zonemode=zbd
> )
> fio "${params[@]}"
> done
>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> drivers/block/null_blk/main.c | 2 ++
> drivers/block/null_blk/null_blk.h | 1 +
> drivers/block/null_blk/zoned.c | 3 +++
> 3 files changed, 6 insertions(+)
>
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index 864013019d6b..5c0578137f51 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -424,6 +424,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
> NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
> NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
> NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
> +NULLB_DEVICE_ATTR(no_zone_write_lock, bool, NULL);
> NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
> NULLB_DEVICE_ATTR(no_sched, bool, NULL);
> NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL);
> @@ -569,6 +570,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
> &nullb_device_attr_zone_max_active,
> &nullb_device_attr_zone_readonly,
> &nullb_device_attr_zone_offline,
> + &nullb_device_attr_no_zone_write_lock,
> &nullb_device_attr_virt_boundary,
> &nullb_device_attr_no_sched,
> &nullb_device_attr_shared_tag_bitmap,
> diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
> index 929f659dd255..b521096bcc3f 100644
> --- a/drivers/block/null_blk/null_blk.h
> +++ b/drivers/block/null_blk/null_blk.h
> @@ -117,6 +117,7 @@ struct nullb_device {
> bool memory_backed; /* if data is stored in memory */
> bool discard; /* if support discard */
> bool zoned; /* if device is zoned */
> + bool no_zone_write_lock;
> bool virt_boundary; /* virtual boundary on/off for the device */
> bool no_sched; /* no IO scheduler for the device */
> bool shared_tag_bitmap; /* use hostwide shared tags */
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 55c5b48bc276..31c8364a63e9 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -96,6 +96,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>
> spin_lock_init(&dev->zone_res_lock);
>
> + if (dev->no_zone_write_lock)
> + blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
This patch look OK, but there is no patch introducing
QUEUE_FLAG_NO_ZONE_WRITE_LOCK... Patch 1 only introduced REQ_NO_ZONE_WRITE_LOCK.
So the series as is will not compile.
> +
> if (dev->zone_nr_conv >= dev->nr_zones) {
> dev->zone_nr_conv = dev->nr_zones - 1;
> pr_info("changed the number of conventional zones to %u",
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 4/6] scsi: Retry unaligned zoned writes
2023-07-26 0:57 [PATCH v3 0/6] Improve the performance of F2FS on zoned UFS Bart Van Assche
` (2 preceding siblings ...)
2023-07-26 0:57 ` [PATCH v3 3/6] block/null_blk: Support disabling zone write locking Bart Van Assche
@ 2023-07-26 0:57 ` Bart Van Assche
2023-07-26 8:47 ` Damien Le Moal
2023-07-26 0:57 ` [PATCH v3 5/6] scsi: ufs: Disable zone write locking Bart Van Assche
2023-07-26 0:57 ` [PATCH v3 6/6] fs/f2fs: " Bart Van Assche
5 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 0:57 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Martin K . Petersen, Damien Le Moal, Ming Lei,
James E.J. Bottomley
From ZBC-2: "The device server terminates with CHECK CONDITION status, with
the sense key set to ILLEGAL REQUEST, and the additional sense code set to
UNALIGNED WRITE COMMAND a write command, other than an entire medium write
same command, that specifies: a) the starting LBA in a sequential write
required zone set to a value that is not equal to the write pointer for
that sequential write required zone; or b) an ending LBA that is not equal
to the last logical block within a physical block (see SBC-5)."
Send commands that failed with an unaligned write error to the SCSI error
handler. Let the SCSI error handler sort SCSI commands per LBA before
resubmitting these.
Increase the number of retries for write commands sent to a sequential
zone to the maximum number of outstanding commands.
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++++++++
drivers/scsi/scsi_lib.c | 1 +
drivers/scsi/sd.c | 3 +++
include/scsi/scsi.h | 1 +
4 files changed, 42 insertions(+)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..2b9aec05dc36 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -27,6 +27,7 @@
#include <linux/blkdev.h>
#include <linux/delay.h>
#include <linux/jiffies.h>
+#include <linux/list_sort.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
@@ -698,6 +699,17 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
fallthrough;
case ILLEGAL_REQUEST:
+ /*
+ * Unaligned write command. This indicates that zoned writes
+ * have been received by the device in the wrong order. If zone
+ * write locking is disabled, retry after all pending commands
+ * have completed.
+ */
+ if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+ blk_queue_no_zone_write_lock(sdev->request_queue) &&
+ !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd))
+ return NEEDS_DELAYED_RETRY;
+
if (sshdr.asc == 0x20 || /* Invalid command operation code */
sshdr.asc == 0x21 || /* Logical block address out of range */
sshdr.asc == 0x22 || /* Invalid function */
@@ -2223,6 +2235,25 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
}
EXPORT_SYMBOL(scsi_eh_flush_done_q);
+/*
+ * Returns a negative value if @_a has a lower LBA than @_b, zero if
+ * both have the same LBA and a positive value otherwise.
+ */
+static int scsi_cmp_lba(void *priv, const struct list_head *_a,
+ const struct list_head *_b)
+{
+ struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
+ struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
+ const sector_t pos_a = blk_rq_pos(scsi_cmd_to_rq(a));
+ const sector_t pos_b = blk_rq_pos(scsi_cmd_to_rq(b));
+
+ if (pos_a < pos_b)
+ return -1;
+ if (pos_a > pos_b)
+ return 1;
+ return 0;
+}
+
/**
* scsi_unjam_host - Attempt to fix a host which has a cmd that failed.
* @shost: Host to unjam.
@@ -2258,6 +2289,12 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
+ /*
+ * Sort pending SCSI commands in LBA order. This is important if zone
+ * write locking is disabled for a zoned SCSI device.
+ */
+ list_sort(NULL, &eh_work_q, scsi_cmp_lba);
+
if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 59176946ab56..69da8aee13df 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1443,6 +1443,7 @@ static void scsi_complete(struct request *rq)
case ADD_TO_MLQUEUE:
scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
break;
+ case NEEDS_DELAYED_RETRY:
default:
scsi_eh_scmd_add(cmd);
break;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 68b12afa0721..27b9ebe05b90 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1235,6 +1235,9 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
cmd->transfersize = sdp->sector_size;
cmd->underflow = nr_blocks << 9;
cmd->allowed = sdkp->max_retries;
+ if (blk_queue_no_zone_write_lock(rq->q) &&
+ blk_rq_is_seq_zoned_write(rq))
+ cmd->allowed += rq->q->nr_requests;
cmd->sdb.length = nr_blocks * sdp->sector_size;
SCSI_LOG_HLQUEUE(1,
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index ec093594ba53..6600db046227 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status)
* Internal return values.
*/
enum scsi_disposition {
+ NEEDS_DELAYED_RETRY = 0x2000,
NEEDS_RETRY = 0x2001,
SUCCESS = 0x2002,
FAILED = 0x2003,
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 4/6] scsi: Retry unaligned zoned writes
2023-07-26 0:57 ` [PATCH v3 4/6] scsi: Retry unaligned zoned writes Bart Van Assche
@ 2023-07-26 8:47 ` Damien Le Moal
2023-07-26 15:02 ` Bart Van Assche
0 siblings, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2023-07-26 8:47 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Martin K . Petersen,
Damien Le Moal, Ming Lei, James E.J. Bottomley
On 7/26/23 09:57, Bart Van Assche wrote:
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for
> that sequential write required zone; or b) an ending LBA that is not equal
> to the last logical block within a physical block (see SBC-5)."
>
> Send commands that failed with an unaligned write error to the SCSI error
> handler. Let the SCSI error handler sort SCSI commands per LBA before
> resubmitting these.
>
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.
>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++++++++
> drivers/scsi/scsi_lib.c | 1 +
> drivers/scsi/sd.c | 3 +++
> include/scsi/scsi.h | 1 +
> 4 files changed, 42 insertions(+)
>
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index c67cdcdc3ba8..2b9aec05dc36 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -27,6 +27,7 @@
> #include <linux/blkdev.h>
> #include <linux/delay.h>
> #include <linux/jiffies.h>
> +#include <linux/list_sort.h>
>
> #include <scsi/scsi.h>
> #include <scsi/scsi_cmnd.h>
> @@ -698,6 +699,17 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
> fallthrough;
>
> case ILLEGAL_REQUEST:
> + /*
> + * Unaligned write command. This indicates that zoned writes
> + * have been received by the device in the wrong order. If zone
> + * write locking is disabled, retry after all pending commands
> + * have completed.
> + */
> + if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
> + blk_queue_no_zone_write_lock(sdev->request_queue) &&
> + !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd))
> + return NEEDS_DELAYED_RETRY;
> +
> if (sshdr.asc == 0x20 || /* Invalid command operation code */
> sshdr.asc == 0x21 || /* Logical block address out of range */
> sshdr.asc == 0x22 || /* Invalid function */
> @@ -2223,6 +2235,25 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
> }
> EXPORT_SYMBOL(scsi_eh_flush_done_q);
>
> +/*
> + * Returns a negative value if @_a has a lower LBA than @_b, zero if
> + * both have the same LBA and a positive value otherwise.
> + */
> +static int scsi_cmp_lba(void *priv, const struct list_head *_a,
> + const struct list_head *_b)
The argument priv is unused.
> +{
> + struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
> + struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
> + const sector_t pos_a = blk_rq_pos(scsi_cmd_to_rq(a));
> + const sector_t pos_b = blk_rq_pos(scsi_cmd_to_rq(b));
> +
> + if (pos_a < pos_b)
> + return -1;
> + if (pos_a > pos_b)
> + return 1;
> + return 0;
> +}
> +
> /**
> * scsi_unjam_host - Attempt to fix a host which has a cmd that failed.
> * @shost: Host to unjam.
> @@ -2258,6 +2289,12 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
>
> SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
>
> + /*
> + * Sort pending SCSI commands in LBA order. This is important if zone
> + * write locking is disabled for a zoned SCSI device.
> + */
> + list_sort(NULL, &eh_work_q, scsi_cmp_lba);
Should we do this only for zoned devices ?
> +
> if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
> scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 59176946ab56..69da8aee13df 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1443,6 +1443,7 @@ static void scsi_complete(struct request *rq)
> case ADD_TO_MLQUEUE:
> scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
> break;
> + case NEEDS_DELAYED_RETRY:
> default:
> scsi_eh_scmd_add(cmd);
> break;
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 68b12afa0721..27b9ebe05b90 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1235,6 +1235,9 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
> cmd->transfersize = sdp->sector_size;
> cmd->underflow = nr_blocks << 9;
> cmd->allowed = sdkp->max_retries;
> + if (blk_queue_no_zone_write_lock(rq->q) &&
> + blk_rq_is_seq_zoned_write(rq))
> + cmd->allowed += rq->q->nr_requests;
Aouch... that could be a lot...
> cmd->sdb.length = nr_blocks * sdp->sector_size;
>
> SCSI_LOG_HLQUEUE(1,
> diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
> index ec093594ba53..6600db046227 100644
> --- a/include/scsi/scsi.h
> +++ b/include/scsi/scsi.h
> @@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status)
> * Internal return values.
> */
> enum scsi_disposition {
> + NEEDS_DELAYED_RETRY = 0x2000,
> NEEDS_RETRY = 0x2001,
> SUCCESS = 0x2002,
> FAILED = 0x2003,
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v3 4/6] scsi: Retry unaligned zoned writes
2023-07-26 8:47 ` Damien Le Moal
@ 2023-07-26 15:02 ` Bart Van Assche
2023-07-26 23:46 ` Damien Le Moal
0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 15:02 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Martin K . Petersen,
Damien Le Moal, Ming Lei, James E.J. Bottomley
On 7/26/23 01:47, Damien Le Moal wrote:
> On 7/26/23 09:57, Bart Van Assche wrote:
>> +/*
>> + * Returns a negative value if @_a has a lower LBA than @_b, zero if
>> + * both have the same LBA and a positive value otherwise.
>> + */
>> +static int scsi_cmp_lba(void *priv, const struct list_head *_a,
>> + const struct list_head *_b)
>
> The argument priv is unused.
I cannot remove the 'priv' argument. From include/linux/list_sort.h:
typedef int __attribute__((nonnull(2,3))) (*list_cmp_func_t)(void *,
const struct list_head *, const struct list_head *);
__attribute__((nonnull(2,3)))
void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp);
>> /**
>> * scsi_unjam_host - Attempt to fix a host which has a cmd that failed.
>> * @shost: Host to unjam.
>> @@ -2258,6 +2289,12 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
>>
>> SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
>>
>> + /*
>> + * Sort pending SCSI commands in LBA order. This is important if zone
>> + * write locking is disabled for a zoned SCSI device.
>> + */
>> + list_sort(NULL, &eh_work_q, scsi_cmp_lba);
>
> Should we do this only for zoned devices ?
I'm not sure this is possible. Error handling happens per SCSI host.
Some of the logical units associated with a host may be zoned while
others may represent conventional storage or have no storage associated
with them (WLUN).
Thanks,
Bart.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 4/6] scsi: Retry unaligned zoned writes
2023-07-26 15:02 ` Bart Van Assche
@ 2023-07-26 23:46 ` Damien Le Moal
0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2023-07-26 23:46 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Martin K . Petersen,
Ming Lei, James E.J. Bottomley
On 7/27/23 00:02, Bart Van Assche wrote:
> On 7/26/23 01:47, Damien Le Moal wrote:
>> On 7/26/23 09:57, Bart Van Assche wrote:
>>> +/*
>>> + * Returns a negative value if @_a has a lower LBA than @_b, zero if
>>> + * both have the same LBA and a positive value otherwise.
>>> + */
>>> +static int scsi_cmp_lba(void *priv, const struct list_head *_a,
>>> + const struct list_head *_b)
>>
>> The argument priv is unused.
>
> I cannot remove the 'priv' argument. From include/linux/list_sort.h:
>
> typedef int __attribute__((nonnull(2,3))) (*list_cmp_func_t)(void *,
> const struct list_head *, const struct list_head *);
>
> __attribute__((nonnull(2,3)))
> void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp);
Yes. I missed that this is a callbalck. Sorry for the noise.
>
>>> /**
>>> * scsi_unjam_host - Attempt to fix a host which has a cmd that failed.
>>> * @shost: Host to unjam.
>>> @@ -2258,6 +2289,12 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
>>> SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
>>> + /*
>>> + * Sort pending SCSI commands in LBA order. This is important if zone
>>> + * write locking is disabled for a zoned SCSI device.
>>> + */
>>> + list_sort(NULL, &eh_work_q, scsi_cmp_lba);
>>
>> Should we do this only for zoned devices ?
>
> I'm not sure this is possible. Error handling happens per SCSI host. Some of
> the logical units associated with a host may be zoned while others may
> represent conventional storage or have no storage associated with them (WLUN).
>
> Thanks,
>
> Bart.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 5/6] scsi: ufs: Disable zone write locking
2023-07-26 0:57 [PATCH v3 0/6] Improve the performance of F2FS on zoned UFS Bart Van Assche
` (3 preceding siblings ...)
2023-07-26 0:57 ` [PATCH v3 4/6] scsi: Retry unaligned zoned writes Bart Van Assche
@ 2023-07-26 0:57 ` Bart Van Assche
2023-07-26 12:04 ` kernel test robot
2023-07-26 0:57 ` [PATCH v3 6/6] fs/f2fs: " Bart Van Assche
5 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 0:57 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Martin K . Petersen, Avri Altman, Damien Le Moal, Ming Lei,
James E.J. Bottomley, Stanley Chu, Can Guo, Asutosh Das,
Bao D. Nguyen, Bean Huo, Arthur Simchaev
From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
"The host controller always process transfer requests in-order according
to the order submitted to the list. In case of multiple commands with
single doorbell register ringing (batch mode), The dispatch order for
these transfer requests by host controller will base on their index in
the List. A transfer request with lower index value will be executed
before a transfer request with higher index value."
From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer
Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
pointer
4. Host controller sends COMMAND UPIU to UFS device"
In other words, for both legacy and MCQ mode, UFS controllers are
required to forward commands to the UFS device in the order these
commands have been received from the host.
Notes:
- For legacy mode this is only correct if the host submits one
command at a time. The UFS driver does this.
- Also in legacy mode, the command order is not preserved if
auto-hibernation is enabled in the UFS controller.
This patch improves small write IOPS with a factor four on my test
setup.
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Avri Altman <avri.altman@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/ufs/core/ufshcd.c | 45 ++++++++++++++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 3 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 129446775796..0f7f91e2cda9 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -4337,29 +4337,67 @@ int ufshcd_uic_hibern8_exit(struct ufs_hba *hba)
}
EXPORT_SYMBOL_GPL(ufshcd_uic_hibern8_exit);
+void ufshcd_update_no_zone_write_lock(struct ufs_hba *hba,
+ bool set_no_zone_write_lock)
+{
+ struct scsi_device *sdev;
+
+ shost_for_each_device(sdev, hba->host)
+ blk_freeze_queue_start(sdev->request_queue);
+ shost_for_each_device(sdev, hba->host) {
+ struct request_queue *q = sdev->request_queue;
+
+ blk_mq_freeze_queue_wait(q);
+ if (set_no_zone_write_lock)
+ blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
+ blk_mq_unfreeze_queue(q);
+ }
+}
+
void ufshcd_auto_hibern8_update(struct ufs_hba *hba, u32 ahit)
{
unsigned long flags;
- bool update = false;
+ bool prev_state, new_state, update = false;
if (!ufshcd_is_auto_hibern8_supported(hba))
return;
spin_lock_irqsave(hba->host->host_lock, flags);
+ prev_state = ufshcd_is_auto_hibern8_enabled(hba);
if (hba->ahit != ahit) {
hba->ahit = ahit;
update = true;
}
+ new_state = ufshcd_is_auto_hibern8_enabled(hba);
spin_unlock_irqrestore(hba->host->host_lock, flags);
- if (update &&
- !pm_runtime_suspended(&hba->ufs_device_wlun->sdev_gendev)) {
+ if (!update)
+ return;
+ if (!is_mcq_enabled(hba) && !prev_state && new_state) {
+ /*
+ * Auto-hibernation will be enabled. Enable write locking for
+ * zoned writes since auto-hibernation may cause reordering of
+ * zoned writes when using the legacy mode of the UFS host
+ * controller.
+ */
+ ufshcd_update_no_zone_write_lock(hba, false);
+ }
+ if (!pm_runtime_suspended(&hba->ufs_device_wlun->sdev_gendev)) {
ufshcd_rpm_get_sync(hba);
ufshcd_hold(hba);
ufshcd_auto_hibern8_enable(hba);
ufshcd_release(hba);
ufshcd_rpm_put_sync(hba);
}
+ if (!is_mcq_enabled(hba) && prev_state && !new_state) {
+ /*
+ * Auto-hibernation has been disabled. Disable write locking
+ * for zoned writes.
+ */
+ ufshcd_update_no_zone_write_lock(hba, true);
+ }
}
EXPORT_SYMBOL_GPL(ufshcd_auto_hibern8_update);
@@ -5139,6 +5177,7 @@ static int ufshcd_slave_configure(struct scsi_device *sdev)
ufshcd_hpb_configure(hba, sdev);
+ blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
blk_queue_update_dma_alignment(q, SZ_4K - 1);
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 5/6] scsi: ufs: Disable zone write locking
2023-07-26 0:57 ` [PATCH v3 5/6] scsi: ufs: Disable zone write locking Bart Van Assche
@ 2023-07-26 12:04 ` kernel test robot
0 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2023-07-26 12:04 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: llvm, oe-kbuild-all, linux-block, Christoph Hellwig, Jaegeuk Kim,
Bart Van Assche, Martin K . Petersen, Avri Altman, Damien Le Moal,
Ming Lei, James E.J. Bottomley, Stanley Chu, Can Guo, Asutosh Das,
Bao D. Nguyen, Bean Huo, Arthur Simchaev
Hi Bart,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on jejb-scsi/for-next linus/master v6.5-rc3]
[cannot apply to mkp-scsi/for-next next-20230726]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Bart-Van-Assche/block-Introduce-the-flag-REQ_NO_WRITE_LOCK/20230726-085936
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230726005742.303865-6-bvanassche%40acm.org
patch subject: [PATCH v3 5/6] scsi: ufs: Disable zone write locking
config: s390-randconfig-r044-20230726 (https://download.01.org/0day-ci/archive/20230726/202307261940.mazVXKtc-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230726/202307261940.mazVXKtc-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307261940.mazVXKtc-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
In file included from drivers/ufs/core/ufshcd.c:25:
In file included from include/scsi/scsi_cmnd.h:5:
In file included from include/linux/dma-mapping.h:11:
In file included from include/linux/scatterlist.h:9:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
547 | val = __raw_readb(PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:37:59: note: expanded from macro '__le16_to_cpu'
37 | #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
| ^
include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
| ^
In file included from drivers/ufs/core/ufshcd.c:25:
In file included from include/scsi/scsi_cmnd.h:5:
In file included from include/linux/dma-mapping.h:11:
In file included from include/linux/scatterlist.h:9:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:35:59: note: expanded from macro '__le32_to_cpu'
35 | #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
| ^
include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
115 | #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
| ^
In file included from drivers/ufs/core/ufshcd.c:25:
In file included from include/scsi/scsi_cmnd.h:5:
In file included from include/linux/dma-mapping.h:11:
In file included from include/linux/scatterlist.h:9:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
584 | __raw_writeb(value, PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
692 | readsb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
700 | readsw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
708 | readsl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
717 | writesb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
726 | writesw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
735 | writesl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
>> drivers/ufs/core/ufshcd.c:4352:23: error: use of undeclared identifier 'QUEUE_FLAG_NO_ZONE_WRITE_LOCK'; did you mean '__REQ_NO_ZONE_WRITE_LOCK'?
4352 | blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| __REQ_NO_ZONE_WRITE_LOCK
include/linux/blk_types.h:424:2: note: '__REQ_NO_ZONE_WRITE_LOCK' declared here
424 | __REQ_NO_ZONE_WRITE_LOCK,
| ^
drivers/ufs/core/ufshcd.c:4354:25: error: use of undeclared identifier 'QUEUE_FLAG_NO_ZONE_WRITE_LOCK'; did you mean '__REQ_NO_ZONE_WRITE_LOCK'?
4354 | blk_queue_flag_clear(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| __REQ_NO_ZONE_WRITE_LOCK
include/linux/blk_types.h:424:2: note: '__REQ_NO_ZONE_WRITE_LOCK' declared here
424 | __REQ_NO_ZONE_WRITE_LOCK,
| ^
>> drivers/ufs/core/ufshcd.c:4340:6: warning: no previous prototype for function 'ufshcd_update_no_zone_write_lock' [-Wmissing-prototypes]
4340 | void ufshcd_update_no_zone_write_lock(struct ufs_hba *hba,
| ^
drivers/ufs/core/ufshcd.c:4340:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
4340 | void ufshcd_update_no_zone_write_lock(struct ufs_hba *hba,
| ^
| static
drivers/ufs/core/ufshcd.c:5180:21: error: use of undeclared identifier 'QUEUE_FLAG_NO_ZONE_WRITE_LOCK'; did you mean '__REQ_NO_ZONE_WRITE_LOCK'?
5180 | blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| __REQ_NO_ZONE_WRITE_LOCK
include/linux/blk_types.h:424:2: note: '__REQ_NO_ZONE_WRITE_LOCK' declared here
424 | __REQ_NO_ZONE_WRITE_LOCK,
| ^
drivers/ufs/core/ufshcd.c:10277:44: warning: shift count >= width of type [-Wshift-count-overflow]
10277 | if (!dma_set_mask_and_coherent(hba->dev, DMA_BIT_MASK(64)))
| ^~~~~~~~~~~~~~~~
include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
| ^ ~~~
14 warnings and 3 errors generated.
vim +4352 drivers/ufs/core/ufshcd.c
4339
> 4340 void ufshcd_update_no_zone_write_lock(struct ufs_hba *hba,
4341 bool set_no_zone_write_lock)
4342 {
4343 struct scsi_device *sdev;
4344
4345 shost_for_each_device(sdev, hba->host)
4346 blk_freeze_queue_start(sdev->request_queue);
4347 shost_for_each_device(sdev, hba->host) {
4348 struct request_queue *q = sdev->request_queue;
4349
4350 blk_mq_freeze_queue_wait(q);
4351 if (set_no_zone_write_lock)
> 4352 blk_queue_flag_set(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
4353 else
4354 blk_queue_flag_clear(QUEUE_FLAG_NO_ZONE_WRITE_LOCK, q);
4355 blk_mq_unfreeze_queue(q);
4356 }
4357 }
4358
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 6/6] fs/f2fs: Disable zone write locking
2023-07-26 0:57 [PATCH v3 0/6] Improve the performance of F2FS on zoned UFS Bart Van Assche
` (4 preceding siblings ...)
2023-07-26 0:57 ` [PATCH v3 5/6] scsi: ufs: Disable zone write locking Bart Van Assche
@ 2023-07-26 0:57 ` Bart Van Assche
5 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-07-26 0:57 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Ming Lei, Chao Yu
Set the REQ_NO_ZONE_WRITE_LOCK flag to inform the block layer that F2FS
allocates and submits zoned writes in LBA order per zone.
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
fs/f2fs/data.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 5882afe71d82..6361553f4ab1 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -569,6 +569,7 @@ static void f2fs_submit_write_bio(struct f2fs_sb_info *sbi, struct bio *bio,
}
}
+ bio->bi_opf |= REQ_NO_ZONE_WRITE_LOCK;
trace_f2fs_submit_write_bio(sbi->sb, type, bio);
iostat_update_submit_ctx(bio, type);
submit_bio(bio);
^ permalink raw reply related [flat|nested] 15+ messages in thread