* [PATCH 00/12] Submit zoned writes in order
@ 2023-04-07 0:16 Bart Van Assche
2023-04-07 0:16 ` [PATCH 01/12] block: Send zoned writes to the I/O scheduler Bart Van Assche
` (11 more replies)
0 siblings, 12 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:16 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Hi Jens,
Tests with a zoned UFS prototype have shown that there are plenty of
opportunities for reordering in the block layer for zoned writes (REQ_OP_WRITE).
The UFS driver is more likely to trigger reordering than other SCSI drivers
because it reports BLK_STS_DEV_RESOURCE more often, e.g. during clock scaling.
This patch series makes sure that zoned writes are submitted in order without
affecting other workloads significantly.
Please consider this patch series for the next merge window.
Thanks,
Bart.
Bart Van Assche (12):
block: Send zoned writes to the I/O scheduler
block: Send flush requests to the I/O scheduler
block: Send requeued requests to the I/O scheduler
block: Requeue requests if a CPU is unplugged
block: One requeue list per hctx
block: Preserve the order of requeued requests
block: Make it easier to debug zoned write reordering
block: mq-deadline: Simplify deadline_skip_seq_writes()
block: mq-deadline: Disable head insertion for zoned writes
block: mq-deadline: Introduce a local variable
block: mq-deadline: Fix a race condition related to zoned writes
block: mq-deadline: Handle requeued requests correctly
block/blk-flush.c | 3 +-
block/blk-mq-debugfs.c | 66 +++++++++++------------
block/blk-mq.c | 115 +++++++++++++++++++++++------------------
block/blk.h | 19 +++++++
block/mq-deadline.c | 67 +++++++++++++++++++-----
include/linux/blk-mq.h | 35 ++++++-------
include/linux/blkdev.h | 4 --
7 files changed, 192 insertions(+), 117 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 01/12] block: Send zoned writes to the I/O scheduler
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
@ 2023-04-07 0:16 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 02/12] block: Send flush requests " Bart Van Assche
` (10 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:16 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Send zoned writes inserted by the device mapper to the I/O scheduler.
This prevents that zoned writes get reordered if a device mapper driver
has been stacked on top of a driver for a zoned block device.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 16 +++++++++++++---
block/blk.h | 19 +++++++++++++++++++
2 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index db93b1a71157..fefc9a728e0e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3008,9 +3008,19 @@ blk_status_t blk_insert_cloned_request(struct request *rq)
blk_account_io_start(rq);
/*
- * Since we have a scheduler attached on the top device,
- * bypass a potential scheduler on the bottom device for
- * insert.
+ * Send zoned writes to the I/O scheduler if an I/O scheduler has been
+ * attached.
+ */
+ if (q->elevator && blk_rq_is_seq_zoned_write(rq)) {
+ blk_mq_sched_insert_request(rq, /*at_head=*/false,
+ /*run_queue=*/true,
+ /*async=*/false);
+ return BLK_STS_OK;
+ }
+
+ /*
+ * If no I/O scheduler has been attached or if the request is not a
+ * zoned write bypass the I/O scheduler attached to the bottom device.
*/
blk_mq_run_dispatch_ops(q,
ret = blk_mq_request_issue_directly(rq, true));
diff --git a/block/blk.h b/block/blk.h
index d65d96994a94..4b6f8d7a6b84 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -118,6 +118,25 @@ static inline bool bvec_gap_to_prev(const struct queue_limits *lim,
return __bvec_gap_to_prev(lim, bprv, offset);
}
+/**
+ * blk_rq_is_seq_zoned_write() - Whether @rq is a write request for a sequential zone.
+ * @rq: Request to examine.
+ *
+ * In this context sequential zone means either a sequential write required or
+ * to a sequential write preferred zone.
+ */
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+ switch (req_op(rq)) {
+ case REQ_OP_WRITE:
+ case REQ_OP_WRITE_ZEROES:
+ return disk_zone_is_seq(rq->q->disk, blk_rq_pos(rq));
+ case REQ_OP_ZONE_APPEND:
+ default:
+ return false;
+ }
+}
+
static inline bool rq_mergeable(struct request *rq)
{
if (blk_rq_is_passthrough(rq))
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 02/12] block: Send flush requests to the I/O scheduler
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
2023-04-07 0:16 ` [PATCH 01/12] block: Send zoned writes to the I/O scheduler Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 03/12] block: Send requeued " Bart Van Assche
` (9 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Prevent that zoned writes with the FUA flag set are reordered against each
other or against other zoned writes. Separate the I/O scheduler members
from the flush members in struct request since with this patch applied a
request may pass through both an I/O scheduler and the flush machinery.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-flush.c | 3 ++-
block/blk-mq.c | 11 ++++-------
block/mq-deadline.c | 2 +-
include/linux/blk-mq.h | 27 +++++++++++----------------
4 files changed, 18 insertions(+), 25 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 53202eff545e..e0cf153388d8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -432,7 +432,8 @@ void blk_insert_flush(struct request *rq)
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
- blk_mq_request_bypass_insert(rq, false, true);
+ blk_mq_sched_insert_request(rq, /*at_head=*/false,
+ /*run_queue=*/true, /*async=*/true);
return;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fefc9a728e0e..250556546bbf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -390,8 +390,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
INIT_HLIST_NODE(&rq->hash);
RB_CLEAR_NODE(&rq->rb_node);
- if (!op_is_flush(data->cmd_flags) &&
- e->type->ops.prepare_request) {
+ if (e->type->ops.prepare_request) {
e->type->ops.prepare_request(rq);
rq->rq_flags |= RQF_ELVPRIV;
}
@@ -452,13 +451,11 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
data->rq_flags |= RQF_ELV;
/*
- * Flush/passthrough requests are special and go directly to the
- * dispatch list. Don't include reserved tags in the
- * limiting, as it isn't useful.
+ * Do not limit the depth for passthrough requests nor for
+ * requests with a reserved tag.
*/
- if (!op_is_flush(data->cmd_flags) &&
+ if (e->type->ops.limit_depth &&
!blk_op_is_passthrough(data->cmd_flags) &&
- e->type->ops.limit_depth &&
!(data->flags & BLK_MQ_REQ_RESERVED))
e->type->ops.limit_depth(data->cmd_flags, data);
}
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f10c2a0d18d4..d885ccf49170 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -789,7 +789,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
prio = ioprio_class_to_prio[ioprio_class];
per_prio = &dd->per_prio[prio];
- if (!rq->elv.priv[0]) {
+ if (!rq->elv.priv[0] && !(rq->rq_flags & RQF_FLUSH_SEQ)) {
per_prio->stats.inserted++;
rq->elv.priv[0] = (void *)(uintptr_t)1;
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 06caacd77ed6..5e6c79ad83d2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -169,25 +169,20 @@ struct request {
void *completion_data;
};
-
/*
* Three pointers are available for the IO schedulers, if they need
- * more they have to dynamically allocate it. Flush requests are
- * never put on the IO scheduler. So let the flush fields share
- * space with the elevator data.
+ * more they have to dynamically allocate it.
*/
- union {
- struct {
- struct io_cq *icq;
- void *priv[2];
- } elv;
-
- struct {
- unsigned int seq;
- struct list_head list;
- rq_end_io_fn *saved_end_io;
- } flush;
- };
+ struct {
+ struct io_cq *icq;
+ void *priv[2];
+ } elv;
+
+ struct {
+ unsigned int seq;
+ struct list_head list;
+ rq_end_io_fn *saved_end_io;
+ } flush;
union {
struct __call_single_data csd;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 03/12] block: Send requeued requests to the I/O scheduler
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
2023-04-07 0:16 ` [PATCH 01/12] block: Send zoned writes to the I/O scheduler Bart Van Assche
2023-04-07 0:17 ` [PATCH 02/12] block: Send flush requests " Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 04/12] block: Requeue requests if a CPU is unplugged Bart Van Assche
` (8 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Let the I/O scheduler control which requests are dispatched.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 22 ++++++++++------------
include/linux/blk-mq.h | 5 +++--
2 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 250556546bbf..f6ffa76bc159 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1426,15 +1426,7 @@ static void blk_mq_requeue_work(struct work_struct *work)
rq->rq_flags &= ~RQF_SOFTBARRIER;
list_del_init(&rq->queuelist);
- /*
- * If RQF_DONTPREP, rq has contained some driver specific
- * data, so insert it to hctx dispatch list to avoid any
- * merge.
- */
- if (rq->rq_flags & RQF_DONTPREP)
- blk_mq_request_bypass_insert(rq, false, false);
- else
- blk_mq_sched_insert_request(rq, true, false, false);
+ blk_mq_sched_insert_request(rq, /*at_head=*/true, false, false);
}
while (!list_empty(&rq_list)) {
@@ -2065,9 +2057,15 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
if (nr_budgets)
blk_mq_release_budgets(q, list);
- spin_lock(&hctx->lock);
- list_splice_tail_init(list, &hctx->dispatch);
- spin_unlock(&hctx->lock);
+ if (!q->elevator) {
+ spin_lock(&hctx->lock);
+ list_splice_tail_init(list, &hctx->dispatch);
+ spin_unlock(&hctx->lock);
+ } else {
+ q->elevator->type->ops.insert_requests(
+ hctx, list,
+ /*at_head=*/true);
+ }
/*
* Order adding requests to hctx->dispatch and checking
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5e6c79ad83d2..3a3bee9085e3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -64,8 +64,9 @@ typedef __u32 __bitwise req_flags_t;
#define RQF_RESV ((__force req_flags_t)(1 << 23))
/* flags that prevent us from merging requests: */
-#define RQF_NOMERGE_FLAGS \
- (RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+#define RQF_NOMERGE_FLAGS \
+ (RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_DONTPREP | \
+ RQF_SPECIAL_PAYLOAD)
enum mq_rq_state {
MQ_RQ_IDLE = 0,
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 04/12] block: Requeue requests if a CPU is unplugged
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (2 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 03/12] block: Send requeued " Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 05/12] block: One requeue list per hctx Bart Van Assche
` (7 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Requeue requests instead of sending these to the dispatch list if a CPU
is unplugged to prevent reordering of zoned writes.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f6ffa76bc159..8bb35deff5ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3496,9 +3496,17 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
if (list_empty(&tmp))
return 0;
- spin_lock(&hctx->lock);
- list_splice_tail_init(&tmp, &hctx->dispatch);
- spin_unlock(&hctx->lock);
+ if (hctx->queue->elevator) {
+ struct request *rq, *next;
+
+ list_for_each_entry_safe(rq, next, &tmp, queuelist)
+ blk_mq_requeue_request(rq, false);
+ blk_mq_kick_requeue_list(hctx->queue);
+ } else {
+ spin_lock(&hctx->lock);
+ list_splice_tail_init(&tmp, &hctx->dispatch);
+ spin_unlock(&hctx->lock);
+ }
blk_mq_run_hw_queue(hctx, true);
return 0;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 05/12] block: One requeue list per hctx
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (3 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 04/12] block: Requeue requests if a CPU is unplugged Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 06/12] block: Preserve the order of requeued requests Bart Van Assche
` (6 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Prepare for processing the requeue list from inside __blk_mq_run_hw_queue().
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq-debugfs.c | 66 +++++++++++++++++++++---------------------
block/blk-mq.c | 55 ++++++++++++++++++++++-------------
include/linux/blk-mq.h | 4 +++
include/linux/blkdev.h | 4 ---
4 files changed, 72 insertions(+), 57 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 212a7f301e73..5eb930754347 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -20,37 +20,6 @@ static int queue_poll_stat_show(void *data, struct seq_file *m)
return 0;
}
-static void *queue_requeue_list_start(struct seq_file *m, loff_t *pos)
- __acquires(&q->requeue_lock)
-{
- struct request_queue *q = m->private;
-
- spin_lock_irq(&q->requeue_lock);
- return seq_list_start(&q->requeue_list, *pos);
-}
-
-static void *queue_requeue_list_next(struct seq_file *m, void *v, loff_t *pos)
-{
- struct request_queue *q = m->private;
-
- return seq_list_next(v, &q->requeue_list, pos);
-}
-
-static void queue_requeue_list_stop(struct seq_file *m, void *v)
- __releases(&q->requeue_lock)
-{
- struct request_queue *q = m->private;
-
- spin_unlock_irq(&q->requeue_lock);
-}
-
-static const struct seq_operations queue_requeue_list_seq_ops = {
- .start = queue_requeue_list_start,
- .next = queue_requeue_list_next,
- .stop = queue_requeue_list_stop,
- .show = blk_mq_debugfs_rq_show,
-};
-
static int blk_flags_show(struct seq_file *m, const unsigned long flags,
const char *const *flag_name, int flag_name_count)
{
@@ -156,11 +125,10 @@ static ssize_t queue_state_write(void *data, const char __user *buf,
static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{ "poll_stat", 0400, queue_poll_stat_show },
- { "requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops },
{ "pm_only", 0600, queue_pm_only_show, NULL },
{ "state", 0600, queue_state_show, queue_state_write },
{ "zone_wlock", 0400, queue_zone_wlock_show, NULL },
- { },
+ {},
};
#define HCTX_STATE_NAME(name) [BLK_MQ_S_##name] = #name
@@ -513,6 +481,37 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
return 0;
}
+static void *hctx_requeue_list_start(struct seq_file *m, loff_t *pos)
+ __acquires(&hctx->requeue_lock)
+{
+ struct blk_mq_hw_ctx *hctx = m->private;
+
+ spin_lock_irq(&hctx->requeue_lock);
+ return seq_list_start(&hctx->requeue_list, *pos);
+}
+
+static void *hctx_requeue_list_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct blk_mq_hw_ctx *hctx = m->private;
+
+ return seq_list_next(v, &hctx->requeue_list, pos);
+}
+
+static void hctx_requeue_list_stop(struct seq_file *m, void *v)
+ __releases(&hctx->requeue_lock)
+{
+ struct blk_mq_hw_ctx *hctx = m->private;
+
+ spin_unlock_irq(&hctx->requeue_lock);
+}
+
+static const struct seq_operations hctx_requeue_list_seq_ops = {
+ .start = hctx_requeue_list_start,
+ .next = hctx_requeue_list_next,
+ .stop = hctx_requeue_list_stop,
+ .show = blk_mq_debugfs_rq_show,
+};
+
#define CTX_RQ_SEQ_OPS(name, type) \
static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
__acquires(&ctx->lock) \
@@ -628,6 +627,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
{"run", 0600, hctx_run_show, hctx_run_write},
{"active", 0400, hctx_active_show},
{"dispatch_busy", 0400, hctx_dispatch_busy_show},
+ {"requeue_list", 0400, .seq_ops = &hctx_requeue_list_seq_ops},
{"type", 0400, hctx_type_show},
{},
};
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8bb35deff5ec..1e285b0cfba3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1411,14 +1411,17 @@ EXPORT_SYMBOL(blk_mq_requeue_request);
static void blk_mq_requeue_work(struct work_struct *work)
{
- struct request_queue *q =
- container_of(work, struct request_queue, requeue_work.work);
+ struct blk_mq_hw_ctx *hctx =
+ container_of(work, struct blk_mq_hw_ctx, requeue_work.work);
LIST_HEAD(rq_list);
struct request *rq, *next;
- spin_lock_irq(&q->requeue_lock);
- list_splice_init(&q->requeue_list, &rq_list);
- spin_unlock_irq(&q->requeue_lock);
+ if (list_empty_careful(&hctx->requeue_list))
+ return;
+
+ spin_lock_irq(&hctx->requeue_lock);
+ list_splice_init(&hctx->requeue_list, &rq_list);
+ spin_unlock_irq(&hctx->requeue_lock);
list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
if (!(rq->rq_flags & (RQF_SOFTBARRIER | RQF_DONTPREP)))
@@ -1435,13 +1438,15 @@ static void blk_mq_requeue_work(struct work_struct *work)
blk_mq_sched_insert_request(rq, false, false, false);
}
- blk_mq_run_hw_queues(q, false);
+ blk_mq_run_hw_queue(hctx, false);
}
void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
bool kick_requeue_list)
{
struct request_queue *q = rq->q;
+ struct blk_mq_hw_ctx *hctx =
+ rq->mq_hctx ?: q->queue_ctx[0].hctxs[HCTX_TYPE_DEFAULT];
unsigned long flags;
/*
@@ -1450,14 +1455,14 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
*/
BUG_ON(rq->rq_flags & RQF_SOFTBARRIER);
- spin_lock_irqsave(&q->requeue_lock, flags);
+ spin_lock_irqsave(&hctx->requeue_lock, flags);
if (at_head) {
rq->rq_flags |= RQF_SOFTBARRIER;
- list_add(&rq->queuelist, &q->requeue_list);
+ list_add(&rq->queuelist, &hctx->requeue_list);
} else {
- list_add_tail(&rq->queuelist, &q->requeue_list);
+ list_add_tail(&rq->queuelist, &hctx->requeue_list);
}
- spin_unlock_irqrestore(&q->requeue_lock, flags);
+ spin_unlock_irqrestore(&hctx->requeue_lock, flags);
if (kick_requeue_list)
blk_mq_kick_requeue_list(q);
@@ -1465,15 +1470,25 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
void blk_mq_kick_requeue_list(struct request_queue *q)
{
- kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ queue_for_each_hw_ctx(q, hctx, i)
+ kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
+ &hctx->requeue_work, 0);
}
EXPORT_SYMBOL(blk_mq_kick_requeue_list);
void blk_mq_delay_kick_requeue_list(struct request_queue *q,
unsigned long msecs)
{
- kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work,
- msecs_to_jiffies(msecs));
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ queue_for_each_hw_ctx(q, hctx, i)
+ kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
+ &hctx->requeue_work,
+ msecs_to_jiffies(msecs));
}
EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
@@ -3595,6 +3610,10 @@ static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
{
+ INIT_DELAYED_WORK(&hctx->requeue_work, blk_mq_requeue_work);
+ INIT_LIST_HEAD(&hctx->requeue_list);
+ spin_lock_init(&hctx->requeue_lock);
+
hctx->queue_num = hctx_idx;
if (!(hctx->flags & BLK_MQ_F_STACKING))
@@ -4210,10 +4229,6 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
blk_mq_update_poll_flag(q);
- INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
- INIT_LIST_HEAD(&q->requeue_list);
- spin_lock_init(&q->requeue_lock);
-
q->nr_requests = set->queue_depth;
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
@@ -4758,10 +4773,10 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
unsigned long i;
- cancel_delayed_work_sync(&q->requeue_work);
-
- queue_for_each_hw_ctx(q, hctx, i)
+ queue_for_each_hw_ctx(q, hctx, i) {
+ cancel_delayed_work_sync(&hctx->requeue_work);
cancel_delayed_work_sync(&hctx->run_work);
+ }
}
static int __init blk_mq_init(void)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3a3bee9085e3..0157f1569980 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -311,6 +311,10 @@ struct blk_mq_hw_ctx {
unsigned long state;
} ____cacheline_aligned_in_smp;
+ struct list_head requeue_list;
+ spinlock_t requeue_lock;
+ struct delayed_work requeue_work;
+
/**
* @run_work: Used for scheduling a hardware queue run at a later time.
*/
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e3242e67a8e3..f5fa53cd13bd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -491,10 +491,6 @@ struct request_queue {
*/
struct blk_flush_queue *fq;
- struct list_head requeue_list;
- spinlock_t requeue_lock;
- struct delayed_work requeue_work;
-
struct mutex sysfs_lock;
struct mutex sysfs_dir_lock;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 06/12] block: Preserve the order of requeued requests
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (4 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 05/12] block: One requeue list per hctx Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 07/12] block: Make it easier to debug zoned write reordering Bart Van Assche
` (5 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
If a queue is run before all requeued requests have been sent to the I/O
scheduler, the I/O scheduler may dispatch the wrong request. Fix this by
making __blk_mq_run_hw_queue() process the requeue_list instead of
blk_mq_requeue_work().
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 35 ++++++++++-------------------------
include/linux/blk-mq.h | 1 -
2 files changed, 10 insertions(+), 26 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1e285b0cfba3..2cf317d49f56 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -64,6 +64,7 @@ static inline blk_qc_t blk_rq_to_qc(struct request *rq)
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
{
return !list_empty_careful(&hctx->dispatch) ||
+ !list_empty_careful(&hctx->requeue_list) ||
sbitmap_any_bit_set(&hctx->ctx_map) ||
blk_mq_sched_has_work(hctx);
}
@@ -1409,10 +1410,8 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
}
EXPORT_SYMBOL(blk_mq_requeue_request);
-static void blk_mq_requeue_work(struct work_struct *work)
+static void blk_mq_process_requeue_list(struct blk_mq_hw_ctx *hctx)
{
- struct blk_mq_hw_ctx *hctx =
- container_of(work, struct blk_mq_hw_ctx, requeue_work.work);
LIST_HEAD(rq_list);
struct request *rq, *next;
@@ -1437,8 +1436,6 @@ static void blk_mq_requeue_work(struct work_struct *work)
list_del_init(&rq->queuelist);
blk_mq_sched_insert_request(rq, false, false, false);
}
-
- blk_mq_run_hw_queue(hctx, false);
}
void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
@@ -1465,30 +1462,19 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
spin_unlock_irqrestore(&hctx->requeue_lock, flags);
if (kick_requeue_list)
- blk_mq_kick_requeue_list(q);
+ blk_mq_run_hw_queue(hctx, /*async=*/true);
}
void blk_mq_kick_requeue_list(struct request_queue *q)
{
- struct blk_mq_hw_ctx *hctx;
- unsigned long i;
-
- queue_for_each_hw_ctx(q, hctx, i)
- kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
- &hctx->requeue_work, 0);
+ blk_mq_run_hw_queues(q, true);
}
EXPORT_SYMBOL(blk_mq_kick_requeue_list);
void blk_mq_delay_kick_requeue_list(struct request_queue *q,
unsigned long msecs)
{
- struct blk_mq_hw_ctx *hctx;
- unsigned long i;
-
- queue_for_each_hw_ctx(q, hctx, i)
- kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
- &hctx->requeue_work,
- msecs_to_jiffies(msecs));
+ blk_mq_delay_run_hw_queues(q, msecs);
}
EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
@@ -2148,6 +2134,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
*/
WARN_ON_ONCE(in_interrupt());
+ blk_mq_process_requeue_list(hctx);
+
blk_mq_run_dispatch_ops(hctx->queue,
blk_mq_sched_dispatch_requests(hctx));
}
@@ -2319,7 +2307,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
* scheduler.
*/
if (!sq_hctx || sq_hctx == hctx ||
- !list_empty_careful(&hctx->dispatch))
+ blk_mq_hctx_has_pending(hctx))
blk_mq_run_hw_queue(hctx, async);
}
}
@@ -2355,7 +2343,7 @@ void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
* scheduler.
*/
if (!sq_hctx || sq_hctx == hctx ||
- !list_empty_careful(&hctx->dispatch))
+ blk_mq_hctx_has_pending(hctx))
blk_mq_delay_run_hw_queue(hctx, msecs);
}
}
@@ -3610,7 +3598,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
{
- INIT_DELAYED_WORK(&hctx->requeue_work, blk_mq_requeue_work);
INIT_LIST_HEAD(&hctx->requeue_list);
spin_lock_init(&hctx->requeue_lock);
@@ -4773,10 +4760,8 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
unsigned long i;
- queue_for_each_hw_ctx(q, hctx, i) {
- cancel_delayed_work_sync(&hctx->requeue_work);
+ queue_for_each_hw_ctx(q, hctx, i)
cancel_delayed_work_sync(&hctx->run_work);
- }
}
static int __init blk_mq_init(void)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0157f1569980..e62feb17af96 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -313,7 +313,6 @@ struct blk_mq_hw_ctx {
struct list_head requeue_list;
spinlock_t requeue_lock;
- struct delayed_work requeue_work;
/**
* @run_work: Used for scheduling a hardware queue run at a later time.
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 07/12] block: Make it easier to debug zoned write reordering
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (5 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 06/12] block: Preserve the order of requeued requests Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 08/12] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
` (4 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Issue a kernel warning if reordering could happen.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2cf317d49f56..07426dbbe720 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2480,6 +2480,8 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
{
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+ WARN_ON_ONCE(rq->q->elevator && blk_rq_is_seq_zoned_write(rq));
+
spin_lock(&hctx->lock);
if (at_head)
list_add(&rq->queuelist, &hctx->dispatch);
@@ -2572,6 +2574,8 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
bool run_queue = true;
int budget_token;
+ WARN_ON_ONCE(q->elevator && blk_rq_is_seq_zoned_write(rq));
+
/*
* RCU or SRCU read lock is needed before checking quiesced flag.
*
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 08/12] block: mq-deadline: Simplify deadline_skip_seq_writes()
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (6 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 07/12] block: Make it easier to debug zoned write reordering Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 09/12] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
` (3 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Make deadline_skip_seq_writes() shorter without changing its
functionality.
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index d885ccf49170..50a9d3b0a291 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -312,12 +312,9 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
struct request *rq)
{
sector_t pos = blk_rq_pos(rq);
- sector_t skipped_sectors = 0;
- while (rq) {
- if (blk_rq_pos(rq) != pos + skipped_sectors)
- break;
- skipped_sectors += blk_rq_sectors(rq);
+ while (rq && blk_rq_pos(rq) == pos) {
+ pos += blk_rq_sectors(rq);
rq = deadline_latter_request(rq);
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 09/12] block: mq-deadline: Disable head insertion for zoned writes
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (7 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 08/12] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 10/12] block: mq-deadline: Introduce a local variable Bart Van Assche
` (2 subsequent siblings)
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Make sure that zoned writes are submitted in LBA order.
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 50a9d3b0a291..891ee0da73ac 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -798,7 +798,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
trace_block_rq_insert(rq);
- if (at_head) {
+ if (at_head && !blk_rq_is_seq_zoned_write(rq)) {
list_add(&rq->queuelist, &per_prio->dispatch);
rq->fifo_time = jiffies;
} else {
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 10/12] block: mq-deadline: Introduce a local variable
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (8 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 09/12] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 0:17 ` [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
2023-04-07 0:17 ` [PATCH 12/12] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Prepare for adding more code that uses the request queue pointer.
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 891ee0da73ac..8c2bc9fdcf8c 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -368,6 +368,7 @@ static struct request *
deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
enum dd_data_dir data_dir)
{
+ struct request_queue *q;
struct request *rq;
unsigned long flags;
@@ -375,7 +376,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
if (!rq)
return NULL;
- if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+ q = rq->q;
+ if (data_dir == DD_READ || !blk_queue_is_zoned(q))
return rq;
/*
@@ -389,7 +391,7 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
while (rq) {
if (blk_req_can_dispatch_to_zone(rq))
break;
- if (blk_queue_nonrot(rq->q))
+ if (blk_queue_nonrot(q))
rq = deadline_latter_request(rq);
else
rq = deadline_skip_seq_writes(dd, rq);
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (9 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 10/12] block: mq-deadline: Introduce a local variable Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
2023-04-07 17:08 ` kernel test robot
2023-04-08 3:17 ` kernel test robot
2023-04-07 0:17 ` [PATCH 12/12] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
11 siblings, 2 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
Let deadline_next_request() only consider the first zoned write per
zone. This patch fixes a race condition between deadline_next_request()
and completion of zoned writes.
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 8c2bc9fdcf8c..d49e20d3011d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -389,12 +389,30 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
*/
spin_lock_irqsave(&dd->zone_lock, flags);
while (rq) {
+ unsigned int zno = blk_rq_zone_no(rq);
+
if (blk_req_can_dispatch_to_zone(rq))
break;
- if (blk_queue_nonrot(q))
- rq = deadline_latter_request(rq);
- else
+
+ WARN_ON_ONCE(!blk_queue_is_zoned(q));
+
+ if (!blk_queue_nonrot(q)) {
rq = deadline_skip_seq_writes(dd, rq);
+ if (!rq)
+ break;
+ rq = deadline_earlier_request(rq);
+ if (WARN_ON_ONCE(!rq))
+ break;
+ }
+
+ /*
+ * Skip all other write requests for the zone with zone number
+ * 'zno'. This prevents that this function selects a zoned write
+ * that is not the first write for a given zone.
+ */
+ while ((rq = deadline_latter_request(rq)) &&
+ blk_rq_zone_no(rq) == zno)
+ ;
}
spin_unlock_irqrestore(&dd->zone_lock, flags);
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 12/12] block: mq-deadline: Handle requeued requests correctly
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
` (10 preceding siblings ...)
2023-04-07 0:17 ` [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
@ 2023-04-07 0:17 ` Bart Van Assche
11 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-04-07 0:17 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Ming Lei,
Mike Snitzer, Jaegeuk Kim, Bart Van Assche
If a zoned write is requeued with an LBA that is lower than already
inserted zoned writes, make sure that it is submitted first.
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index d49e20d3011d..2e046ad8ca2c 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -162,8 +162,19 @@ static void
deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
{
struct rb_root *root = deadline_rb_root(per_prio, rq);
+ struct request **next_rq = &per_prio->next_rq[rq_data_dir(rq)];
elv_rb_add(root, rq);
+ if (*next_rq == NULL || !blk_queue_is_zoned(rq->q))
+ return;
+ /*
+ * If a request got requeued or requests have been submitted out of
+ * order, make sure that per zone the request with the lowest LBA is
+ * submitted first.
+ */
+ if (blk_rq_pos(rq) < blk_rq_pos(*next_rq) &&
+ blk_rq_zone_no(rq) == blk_rq_zone_no(*next_rq))
+ *next_rq = rq;
}
static inline void
@@ -822,6 +833,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
list_add(&rq->queuelist, &per_prio->dispatch);
rq->fifo_time = jiffies;
} else {
+ struct list_head *insert_before;
+
deadline_add_rq_rb(per_prio, rq);
if (rq_mergeable(rq)) {
@@ -834,7 +847,20 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* set expire time and add to fifo list
*/
rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
- list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+ insert_before = &per_prio->fifo_list[data_dir];
+ if (blk_rq_is_seq_zoned_write(rq)) {
+ const unsigned int zno = blk_rq_zone_no(rq);
+ struct request *prev;
+
+ while ((prev = deadline_earlier_request(rq))) {
+ if (blk_rq_zone_no(prev) != zno)
+ continue;
+ if (blk_rq_pos(rq) >= blk_rq_pos(prev))
+ break;
+ insert_before = &prev->queuelist;
+ }
+ }
+ list_add_tail(&rq->queuelist, insert_before);
}
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes
2023-04-07 0:17 ` [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
@ 2023-04-07 17:08 ` kernel test robot
2023-04-08 3:17 ` kernel test robot
1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2023-04-07 17:08 UTC (permalink / raw)
To: Bart Van Assche; +Cc: llvm, oe-kbuild-all
Hi Bart,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on linus/master v6.3-rc5 next-20230406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Bart-Van-Assche/block-Send-zoned-writes-to-the-I-O-scheduler/20230407-081846
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230407001710.104169-12-bvanassche%40acm.org
patch subject: [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes
config: i386-randconfig-r031-20230403 (https://download.01.org/0day-ci/archive/20230408/202304080013.UtHGCZkg-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/10abd923538d71bd2d53f8d8ec2f7bbfe746e6cb
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Bart-Van-Assche/block-Send-zoned-writes-to-the-I-O-scheduler/20230407-081846
git checkout 10abd923538d71bd2d53f8d8ec2f7bbfe746e6cb
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304080013.UtHGCZkg-lkp@intel.com/
All errors (new ones prefixed by >>):
>> block/mq-deadline.c:392:22: error: implicit declaration of function 'blk_rq_zone_no' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
unsigned int zno = blk_rq_zone_no(rq);
^
1 error generated.
vim +/blk_rq_zone_no +392 block/mq-deadline.c
362
363 /*
364 * For the specified data direction, return the next request to
365 * dispatch using sector position sorted lists.
366 */
367 static struct request *
368 deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
369 enum dd_data_dir data_dir)
370 {
371 struct request_queue *q;
372 struct request *rq;
373 unsigned long flags;
374
375 rq = per_prio->next_rq[data_dir];
376 if (!rq)
377 return NULL;
378
379 q = rq->q;
380 if (data_dir == DD_READ || !blk_queue_is_zoned(q))
381 return rq;
382
383 /*
384 * Look for a write request that can be dispatched, that is one with
385 * an unlocked target zone. For some HDDs, breaking a sequential
386 * write stream can lead to lower throughput, so make sure to preserve
387 * sequential write streams, even if that stream crosses into the next
388 * zones and these zones are unlocked.
389 */
390 spin_lock_irqsave(&dd->zone_lock, flags);
391 while (rq) {
> 392 unsigned int zno = blk_rq_zone_no(rq);
393
394 if (blk_req_can_dispatch_to_zone(rq))
395 break;
396
397 WARN_ON_ONCE(!blk_queue_is_zoned(q));
398
399 if (!blk_queue_nonrot(q)) {
400 rq = deadline_skip_seq_writes(dd, rq);
401 if (!rq)
402 break;
403 rq = deadline_earlier_request(rq);
404 if (WARN_ON_ONCE(!rq))
405 break;
406 }
407
408 /*
409 * Skip all other write requests for the zone with zone number
410 * 'zno'. This prevents that this function selects a zoned write
411 * that is not the first write for a given zone.
412 */
413 while ((rq = deadline_latter_request(rq)) &&
414 blk_rq_zone_no(rq) == zno)
415 ;
416 }
417 spin_unlock_irqrestore(&dd->zone_lock, flags);
418
419 return rq;
420 }
421
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes
2023-04-07 0:17 ` [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
2023-04-07 17:08 ` kernel test robot
@ 2023-04-08 3:17 ` kernel test robot
1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2023-04-08 3:17 UTC (permalink / raw)
To: Bart Van Assche; +Cc: oe-kbuild-all
Hi Bart,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on linus/master v6.3-rc5 next-20230406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Bart-Van-Assche/block-Send-zoned-writes-to-the-I-O-scheduler/20230407-081846
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230407001710.104169-12-bvanassche%40acm.org
patch subject: [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20230408/202304081128.OL6BHqTs-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/10abd923538d71bd2d53f8d8ec2f7bbfe746e6cb
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Bart-Van-Assche/block-Send-zoned-writes-to-the-I-O-scheduler/20230407-081846
git checkout 10abd923538d71bd2d53f8d8ec2f7bbfe746e6cb
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304081128.OL6BHqTs-lkp@intel.com/
All errors (new ones prefixed by >>):
block/mq-deadline.c: In function 'deadline_next_request':
>> block/mq-deadline.c:392:36: error: implicit declaration of function 'blk_rq_zone_no'; did you mean 'bdev_zone_no'? [-Werror=implicit-function-declaration]
392 | unsigned int zno = blk_rq_zone_no(rq);
| ^~~~~~~~~~~~~~
| bdev_zone_no
cc1: some warnings being treated as errors
vim +392 block/mq-deadline.c
362
363 /*
364 * For the specified data direction, return the next request to
365 * dispatch using sector position sorted lists.
366 */
367 static struct request *
368 deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
369 enum dd_data_dir data_dir)
370 {
371 struct request_queue *q;
372 struct request *rq;
373 unsigned long flags;
374
375 rq = per_prio->next_rq[data_dir];
376 if (!rq)
377 return NULL;
378
379 q = rq->q;
380 if (data_dir == DD_READ || !blk_queue_is_zoned(q))
381 return rq;
382
383 /*
384 * Look for a write request that can be dispatched, that is one with
385 * an unlocked target zone. For some HDDs, breaking a sequential
386 * write stream can lead to lower throughput, so make sure to preserve
387 * sequential write streams, even if that stream crosses into the next
388 * zones and these zones are unlocked.
389 */
390 spin_lock_irqsave(&dd->zone_lock, flags);
391 while (rq) {
> 392 unsigned int zno = blk_rq_zone_no(rq);
393
394 if (blk_req_can_dispatch_to_zone(rq))
395 break;
396
397 WARN_ON_ONCE(!blk_queue_is_zoned(q));
398
399 if (!blk_queue_nonrot(q)) {
400 rq = deadline_skip_seq_writes(dd, rq);
401 if (!rq)
402 break;
403 rq = deadline_earlier_request(rq);
404 if (WARN_ON_ONCE(!rq))
405 break;
406 }
407
408 /*
409 * Skip all other write requests for the zone with zone number
410 * 'zno'. This prevents that this function selects a zoned write
411 * that is not the first write for a given zone.
412 */
413 while ((rq = deadline_latter_request(rq)) &&
414 blk_rq_zone_no(rq) == zno)
415 ;
416 }
417 spin_unlock_irqrestore(&dd->zone_lock, flags);
418
419 return rq;
420 }
421
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-04-08 3:18 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-07 0:16 [PATCH 00/12] Submit zoned writes in order Bart Van Assche
2023-04-07 0:16 ` [PATCH 01/12] block: Send zoned writes to the I/O scheduler Bart Van Assche
2023-04-07 0:17 ` [PATCH 02/12] block: Send flush requests " Bart Van Assche
2023-04-07 0:17 ` [PATCH 03/12] block: Send requeued " Bart Van Assche
2023-04-07 0:17 ` [PATCH 04/12] block: Requeue requests if a CPU is unplugged Bart Van Assche
2023-04-07 0:17 ` [PATCH 05/12] block: One requeue list per hctx Bart Van Assche
2023-04-07 0:17 ` [PATCH 06/12] block: Preserve the order of requeued requests Bart Van Assche
2023-04-07 0:17 ` [PATCH 07/12] block: Make it easier to debug zoned write reordering Bart Van Assche
2023-04-07 0:17 ` [PATCH 08/12] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
2023-04-07 0:17 ` [PATCH 09/12] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
2023-04-07 0:17 ` [PATCH 10/12] block: mq-deadline: Introduce a local variable Bart Van Assche
2023-04-07 0:17 ` [PATCH 11/12] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
2023-04-07 17:08 ` kernel test robot
2023-04-08 3:17 ` kernel test robot
2023-04-07 0:17 ` [PATCH 12/12] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.