* Re: [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Christoph Hellwig @ 2017-04-04 6:42 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-5-bart.vanassche@sandisk.com>
> +static void scsi_restart_hctx(struct request_queue *q,
> + struct blk_mq_hw_ctx *hctx)
> +{
> + struct blk_mq_tags *tags = hctx->tags;
> + struct blk_mq_tag_set *set = q->tag_set;
> + int i;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(q, &set->tag_list, tag_set_list)
> + queue_for_each_hw_ctx(q, hctx, i)
> + if (hctx->tags == tags)
> + blk_mq_sched_restart_hctx(hctx);
> + rcu_read_unlock();
> +}
This looks like generic block layer code, why is it in SCSI?
^ permalink raw reply
* Re: [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Christoph Hellwig @ 2017-04-04 6:42 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-4-bart.vanassche@sandisk.com>
On Mon, Apr 03, 2017 at 04:22:26PM -0700, Bart Van Assche wrote:
> If a tag set is shared among multiple hardware queues, leave
> it to the block driver to rerun hardware queues. Hence remove
> QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
> Remove blk_mq_sched_mark_restart_queue() because this
> function has no callers.
This looks fine, but I think it needs to also actually implement at
least a dummy restart_hctx method for SCSI and NVMe to keep the
existing functionality.
^ permalink raw reply
* Re: [PATCH v2 2/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Christoph Hellwig @ 2017-04-04 6:40 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-3-bart.vanassche@sandisk.com>
On Mon, Apr 03, 2017 at 04:22:25PM -0700, Bart Van Assche wrote:
> A later patch in this series will namely use RCU to iterate over
> this list.
It also adds a couple lockdep_assert_held calls, which might be worth
mentionining.
Otherwise looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH v2 1/5] blk-mq: Export blk_mq_sched_restart_hctx()
From: Christoph Hellwig @ 2017-04-04 6:40 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-2-bart.vanassche@sandisk.com>
> +void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
> + clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> + if (blk_mq_hctx_has_pending(hctx))
> + blk_mq_run_hw_queue(hctx, true);
> }
> +EXPORT_SYMBOL(blk_mq_sched_restart_hctx);
_GPL export like the other _hctx functions, please.
Otherwise looks fine:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
From: Christoph Hellwig @ 2017-04-04 6:34 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-7-git-send-email-sagi@grimberg.me>
On Sun, Apr 02, 2017 at 04:41:32PM +0300, Sagi Grimberg wrote:
> Use the geneic block layer affinity mapping helper. Also,
generic
> nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
> + nr_io_queues = min_t(unsigned int, nr_io_queues,
> + ibdev->num_comp_vectors);
> +
Add a comment here?
Otherwise looks fine:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Christoph Hellwig @ 2017-04-04 6:33 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-6-git-send-email-sagi@grimberg.me>
On Sun, Apr 02, 2017 at 04:41:31PM +0300, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> block/Kconfig | 5 ++++
> block/Makefile | 1 +
> block/blk-mq-rdma.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
> include/linux/blk-mq-rdma.h | 10 ++++++++
> 4 files changed, 72 insertions(+)
> create mode 100644 block/blk-mq-rdma.c
> create mode 100644 include/linux/blk-mq-rdma.h
>
> diff --git a/block/Kconfig b/block/Kconfig
> index 89cd28f8d051..3ab42bbb06d5 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
> depends on BLOCK && VIRTIO
> default y
>
> +config BLK_MQ_RDMA
> + bool
> + depends on BLOCK && INFINIBAND
> + default y
> +
> source block/Kconfig.iosched
> diff --git a/block/Makefile b/block/Makefile
> index 081bb680789b..4498603dbc83 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
> obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
> obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o
> obj-$(CONFIG_BLK_MQ_VIRTIO) += blk-mq-virtio.o
> +obj-$(CONFIG_BLK_MQ_RDMA) += blk-mq-rdma.o
> obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
> obj-$(CONFIG_BLK_WBT) += blk-wbt.o
> obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"
> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set: tagset to provide the mapping for
> + * @dev: rdma device associated with @set.
> + * @first_vec: first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues. It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> + struct ib_device *dev, int first_vec)
> +{
> + const struct cpumask *mask;
> + unsigned int queue, cpu;
> +
> + if (set->nr_hw_queues > dev->num_comp_vectors)
> + goto fallback;
maybe print a warning here?
Otherwise looks fine:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
From: Christoph Hellwig @ 2017-04-04 6:33 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-5-git-send-email-sagi@grimberg.me>
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
From: Christoph Hellwig @ 2017-04-04 6:32 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-4-git-send-email-sagi@grimberg.me>
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
From: Christoph Hellwig @ 2017-04-04 6:32 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-3-git-send-email-sagi@grimberg.me>
> @@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
>
> static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
> {
> - return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
> + return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
> + MLX5_EQ_VEC_COMP_BASE + ix));
This looks ok for now, but if we look at the callers we'd probably
want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
the uses directly in mlx5e_open_channel as well as the stored away
->cpu field. But maybe that should be left for another patch after
this one.
> + struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
I usually move assignments inside structures onto a separate line to make
it more readable, e.g.
struct irq_affinity irqdesc = {
.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
};
Otherwise this looks fine:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
From: Christoph Hellwig @ 2017-04-04 6:27 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-2-git-send-email-sagi@grimberg.me>
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH v3] blk-mq: remap queues when adding/removing hardware queues
From: Christoph Hellwig @ 2017-04-04 6:24 UTC (permalink / raw)
To: Omar Sandoval
Cc: Jens Axboe, linux-block, Christoph Hellwig, Keith Busch,
Josef Bacik, kernel-team
In-Reply-To: <9753ebd0c51a9d49f110a6d0d00888170905d97a.1490993257.git.osandov@fb.com>
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* [PATCH v2 5/5] scsi: Ensure that scsi_run_queue() runs all hardware queues
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Martin K . Petersen, James Bottomley,
Bart Van Assche, Christoph Hellwig
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>
commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. blk_mq_start_stopped_hw_queues()
only runs queues that had been stopped. Hence change the
blk_mq_start_stopped_hw_queues() call in scsi_run_queue() into
blk_mq_run_hw_queues(). Remove the blk_mq_start_stopped_hw_queues()
call from scsi_end_request() because __blk_mq_finish_request()
already runs all hardware queues if needed.
Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
---
drivers/scsi/scsi_lib.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0e240aebc150..4459b18f62a1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
scsi_starved_list_run(sdev->host);
if (q->mq_ops)
- blk_mq_start_stopped_hw_queues(q, false);
+ blk_mq_run_hw_queues(q, false);
else
blk_run_queue(q);
}
@@ -681,8 +681,6 @@ static bool scsi_end_request(struct request *req, int error,
if (scsi_target(sdev)->single_lun ||
!list_empty(&sdev->host->starved_list))
kblockd_schedule_work(&sdev->requeue_work);
- else
- blk_mq_start_stopped_hw_queues(q, true);
} else {
unsigned long flags;
--
2.12.0
^ permalink raw reply related
* [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Martin K . Petersen, James Bottomley,
Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>
This patch avoids that if multiple SCSI devices are associated with
a SCSI host that a queue can get stuck if scsi_queue_rq() returns
"busy".
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
drivers/scsi/scsi_lib.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c1519660824b..0e240aebc150 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -555,6 +555,21 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
scsi_run_queue(sdev->request_queue);
}
+static void scsi_restart_hctx(struct request_queue *q,
+ struct blk_mq_hw_ctx *hctx)
+{
+ struct blk_mq_tags *tags = hctx->tags;
+ struct blk_mq_tag_set *set = q->tag_set;
+ int i;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(q, &set->tag_list, tag_set_list)
+ queue_for_each_hw_ctx(q, hctx, i)
+ if (hctx->tags == tags)
+ blk_mq_sched_restart_hctx(hctx);
+ rcu_read_unlock();
+}
+
static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
{
if (!blk_rq_is_passthrough(cmd->request)) {
@@ -2156,6 +2171,7 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
static const struct blk_mq_ops scsi_mq_ops = {
.queue_rq = scsi_queue_rq,
+ .restart_hctx = scsi_restart_hctx,
.complete = scsi_softirq_done,
.timeout = scsi_timeout,
.init_request = scsi_init_request,
--
2.12.0
^ permalink raw reply related
* [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Martin K . Petersen, James Bottomley,
Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>
If a tag set is shared among multiple hardware queues, leave
it to the block driver to rerun hardware queues. Hence remove
QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
Remove blk_mq_sched_mark_restart_queue() because this
function has no callers.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
block/blk-mq-sched.c | 15 ++++-----------
block/blk-mq-sched.h | 14 --------------
include/linux/blk-mq.h | 7 +++++++
include/linux/blkdev.h | 1 -
4 files changed, 11 insertions(+), 26 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 414ed4b3d266..f0c691a3ef9e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -328,18 +328,11 @@ EXPORT_SYMBOL(blk_mq_sched_restart_hctx);
void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
- unsigned int i;
-
- if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
- if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
- queue_for_each_hw_ctx(q, hctx, i)
- if (test_bit(BLK_MQ_S_SCHED_RESTART,
- &hctx->state))
- blk_mq_sched_restart_hctx(hctx);
- }
- } else if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+
+ if (q->mq_ops->restart_hctx)
+ q->mq_ops->restart_hctx(q, hctx);
+ else if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
blk_mq_sched_restart_hctx(hctx);
- }
}
/*
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..fe62b1eccf4c 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -131,20 +131,6 @@ static inline void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
}
-/*
- * Mark a hardware queue and the request queue it belongs to as needing a
- * restart.
- */
-static inline void blk_mq_sched_mark_restart_queue(struct blk_mq_hw_ctx *hctx)
-{
- struct request_queue *q = hctx->queue;
-
- if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
- set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
- if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
- set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-}
-
static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
{
return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f62f3ce2dc65..ebeea36ff9cd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,6 +86,7 @@ struct blk_mq_queue_data {
};
typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);
+typedef void (restart_fn)(struct request_queue *q, struct blk_mq_hw_ctx *hctx);
typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -109,6 +110,12 @@ struct blk_mq_ops {
queue_rq_fn *queue_rq;
/*
+ * Called upon request completion to rerun all hctxs that share a tag
+ * set with the hctx for which a request finished.
+ */
+ restart_fn *restart_hctx;
+
+ /*
* Called on request timeout
*/
timeout_fn *timeout;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a2dc6b390d48..a80543ec8be7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,7 +615,6 @@ struct request_queue {
#define QUEUE_FLAG_FLUSH_NQ 25 /* flush not queueuable */
#define QUEUE_FLAG_DAX 26 /* device supports DAX */
#define QUEUE_FLAG_STATS 27 /* track rq completion times */
-#define QUEUE_FLAG_RESTART 28 /* queue needs restart at completion */
#define QUEUE_FLAG_POLL_STATS 29 /* collecting stats for hybrid polling */
#define QUEUE_FLAG_REGISTERED 30 /* queue has been registered to a disk */
--
2.12.0
^ permalink raw reply related
* [PATCH v2 2/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Martin K . Petersen, James Bottomley,
Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>
A later patch in this series will namely use RCU to iterate over
this list.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
block/blk-mq.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..8fb983e6e2e4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2093,6 +2093,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
{
struct request_queue *q;
+ lockdep_assert_held(&set->tag_list_lock);
+
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_freeze_queue(q);
queue_set_hctx_shared(q, shared);
@@ -2113,6 +2115,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
blk_mq_update_tag_set_depth(set, false);
}
mutex_unlock(&set->tag_list_lock);
+
+ synchronize_rcu();
}
static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
@@ -2618,6 +2622,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
{
struct request_queue *q;
+ lockdep_assert_held(&set->tag_list_lock);
+
if (nr_hw_queues > nr_cpu_ids)
nr_hw_queues = nr_cpu_ids;
if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
--
2.12.0
^ permalink raw reply related
* [PATCH v2 1/5] blk-mq: Export blk_mq_sched_restart_hctx()
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Martin K . Petersen, James Bottomley,
Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>
Since a later patch will add a call to this function from the
SCSI core, export this function. Move the BLK_MQ_S_SCHED_RESTART
bit test from blk_mq_sched_restart_hctx() into the callers of
this function. This leads to some code duplication but that will
be addressed in another patch in this series.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
block/blk-mq-sched.c | 17 +++++++++--------
include/linux/blk-mq.h | 1 +
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..414ed4b3d266 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -317,14 +317,13 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
return true;
}
-static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
- if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
- clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
- if (blk_mq_hctx_has_pending(hctx))
- blk_mq_run_hw_queue(hctx, true);
- }
+ clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+ if (blk_mq_hctx_has_pending(hctx))
+ blk_mq_run_hw_queue(hctx, true);
}
+EXPORT_SYMBOL(blk_mq_sched_restart_hctx);
void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
{
@@ -334,9 +333,11 @@ void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
queue_for_each_hw_ctx(q, hctx, i)
- blk_mq_sched_restart_hctx(hctx);
+ if (test_bit(BLK_MQ_S_SCHED_RESTART,
+ &hctx->state))
+ blk_mq_sched_restart_hctx(hctx);
}
- } else {
+ } else if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
blk_mq_sched_restart_hctx(hctx);
}
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ea2e9dcd3aef..f62f3ce2dc65 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -237,6 +237,7 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
void blk_mq_start_hw_queues(struct request_queue *q);
void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx);
void blk_mq_run_hw_queues(struct request_queue *q, bool async);
void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
--
2.12.0
^ permalink raw reply related
* [PATCH v2 0/5] Avoid that scsi-mq queue processing stalls
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Martin K . Petersen, James Bottomley,
Bart Van Assche
Hello Jens,
The five patches in this patch series fix the queue lockup I reported
last week on the linux-block mailing list. Please consider these patches
for kernel v4.11.
Thanks,
Bart.
Changes compared to v1 of this patch:
- Reworked scsi_restart_queues() such that it no longer takes the SCSI
host lock.
- Added two patches - one for exporting blk_mq_sched_restart_hctx() and
another one to make iterating with RCU over blk_mq_tag_set.tag_list safe.
Bart Van Assche (5):
block: Export blk_mq_sched_restart_hctx()
block: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
blk-mq: Introduce blk_mq_ops.restart_hctx
scsi: Add scsi_restart_queues()
scsi: Ensure that scsi_run_queue() runs all hardware queues
block/blk-mq-sched.c | 22 ++++++++--------------
block/blk-mq-sched.h | 14 --------------
block/blk-mq.c | 6 ++++++
drivers/scsi/scsi_lib.c | 20 +++++++++++++++++---
include/linux/blk-mq.h | 8 ++++++++
include/linux/blkdev.h | 1 -
6 files changed, 39 insertions(+), 32 deletions(-)
--
2.12.0
^ permalink raw reply
* [PATCH 5/5] blk-mq-sched: provide hooks for initializing hardware queue data
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-mq-sched.c | 81 +++++++++++++++++++++++++-----------------------
block/blk-mq-sched.h | 4 ---
include/linux/elevator.h | 2 ++
3 files changed, 45 insertions(+), 42 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 63281fe34090..435f91bc724f 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
}
EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
- int (*init)(struct blk_mq_hw_ctx *),
- void (*exit)(struct blk_mq_hw_ctx *))
-{
- struct blk_mq_hw_ctx *hctx;
- int ret;
- int i;
-
- queue_for_each_hw_ctx(q, hctx, i) {
- hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
- if (!hctx->sched_data) {
- ret = -ENOMEM;
- goto error;
- }
-
- if (init) {
- ret = init(hctx);
- if (ret) {
- /*
- * We don't want to give exit() a partially
- * initialized sched_data. init() must clean up
- * if it fails.
- */
- kfree(hctx->sched_data);
- hctx->sched_data = NULL;
- goto error;
- }
- }
- }
-
- return 0;
-error:
- blk_mq_sched_free_hctx_data(q, exit);
- return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
static void __blk_mq_sched_assign_ioc(struct request_queue *q,
struct request *rq,
struct bio *bio,
@@ -464,11 +427,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
unsigned int hctx_idx)
{
struct elevator_queue *e = q->elevator;
+ int ret;
if (!e)
return 0;
- return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+ ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+ if (ret)
+ return ret;
+
+ if (e->type->ops.mq.init_hctx) {
+ ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
+ if (ret) {
+ blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+ return ret;
+ }
+ }
+
+ return 0;
}
void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -479,12 +455,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
if (!e)
return;
+ if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
+ e->type->ops.mq.exit_hctx(hctx, hctx_idx);
+ hctx->sched_data = NULL;
+ }
+
blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
}
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
{
struct blk_mq_hw_ctx *hctx;
+ struct elevator_queue *eq;
unsigned int i;
int ret;
@@ -509,6 +491,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
if (ret)
goto err;
+ if (e->ops.mq.init_hctx) {
+ queue_for_each_hw_ctx(q, hctx, i) {
+ ret = e->ops.mq.init_hctx(hctx, i);
+ if (ret) {
+ eq = q->elevator;
+ blk_mq_exit_sched(q, eq);
+ kobject_put(&eq->kobj);
+ return ret;
+ }
+ }
+ }
+
return 0;
err:
@@ -519,6 +513,17 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned int i;
+
+ if (e->type->ops.mq.exit_hctx) {
+ queue_for_each_hw_ctx(q, hctx, i) {
+ if (hctx->sched_data) {
+ e->type->ops.mq.exit_hctx(hctx, i);
+ hctx->sched_data = NULL;
+ }
+ }
+ }
if (e->type->ops.mq.exit_sched)
e->type->ops.mq.exit_sched(e);
blk_mq_sched_tags_teardown(q);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index e704956e0862..c6e760df0fb4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,10 +4,6 @@
#include "blk-mq.h"
#include "blk-mq-tag.h"
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
- int (*init)(struct blk_mq_hw_ctx *),
- void (*exit)(struct blk_mq_hw_ctx *));
-
void blk_mq_sched_free_hctx_data(struct request_queue *q,
void (*exit)(struct blk_mq_hw_ctx *));
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 22d39e8d4de1..b7ec315ee7e7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -93,6 +93,8 @@ struct blk_mq_hw_ctx;
struct elevator_mq_ops {
int (*init_sched)(struct request_queue *, struct elevator_type *);
void (*exit_sched)(struct elevator_queue *);
+ int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
+ void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
--
2.12.2
^ permalink raw reply related
* [PATCH 3/5] blk-mq-sched: fix crash in switch error path
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-mq-sched.c | 13 +++++--
block/blk-mq-sched.h | 2 +-
block/blk-mq.c | 2 --
block/blk-sysfs.c | 2 +-
block/elevator.c | 94 +++++++++++++++++++++++++++---------------------
include/linux/elevator.h | 2 +-
6 files changed, 67 insertions(+), 48 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3fd918bb13a2..63281fe34090 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -450,7 +450,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
return ret;
}
-void blk_mq_sched_teardown(struct request_queue *q)
+static void blk_mq_sched_tags_teardown(struct request_queue *q)
{
struct blk_mq_tag_set *set = q->tag_set;
struct blk_mq_hw_ctx *hctx;
@@ -512,10 +512,19 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
return 0;
err:
- blk_mq_sched_teardown(q);
+ blk_mq_sched_tags_teardown(q);
+ q->elevator = NULL;
return ret;
}
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
+{
+ if (e->type->ops.mq.exit_sched)
+ e->type->ops.mq.exit_sched(e);
+ blk_mq_sched_tags_teardown(q);
+ q->elevator = NULL;
+}
+
int blk_mq_sched_init(struct request_queue *q)
{
int ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 19db25e0c95a..e704956e0862 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -33,7 +33,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
struct request *(*get_rq)(struct blk_mq_hw_ctx *));
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
-void blk_mq_sched_teardown(struct request_queue *q);
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
unsigned int hctx_idx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac830cb488d7..477951d10cc9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2153,8 +2153,6 @@ void blk_mq_release(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
unsigned int i;
- blk_mq_sched_teardown(q);
-
/* hctx kobj stays in hctx */
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..c47db43a40cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -803,7 +803,7 @@ static void blk_release_queue(struct kobject *kobj)
if (q->elevator) {
ioc_clear_queue(q);
- elevator_exit(q->elevator);
+ elevator_exit(q, q->elevator);
}
blk_free_queue_stats(q->stats);
diff --git a/block/elevator.c b/block/elevator.c
index f236ef1d2be9..dbeecf7be719 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -252,11 +252,11 @@ int elevator_init(struct request_queue *q, char *name)
}
EXPORT_SYMBOL(elevator_init);
-void elevator_exit(struct elevator_queue *e)
+void elevator_exit(struct request_queue *q, struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
if (e->uses_mq && e->type->ops.mq.exit_sched)
- e->type->ops.mq.exit_sched(e);
+ blk_mq_exit_sched(q, e);
else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
e->type->ops.sq.elevator_exit_fn(e);
mutex_unlock(&e->sysfs_lock);
@@ -941,6 +941,45 @@ void elv_unregister(struct elevator_type *e)
}
EXPORT_SYMBOL_GPL(elv_unregister);
+static int elevator_switch_mq(struct request_queue *q,
+ struct elevator_type *new_e)
+{
+ int ret;
+
+ blk_mq_freeze_queue(q);
+ blk_mq_quiesce_queue(q);
+
+ if (q->elevator) {
+ if (q->elevator->registered)
+ elv_unregister_queue(q);
+ ioc_clear_queue(q);
+ elevator_exit(q, q->elevator);
+ }
+
+ ret = blk_mq_init_sched(q, new_e);
+ if (ret)
+ goto out;
+
+ if (new_e) {
+ ret = elv_register_queue(q);
+ if (ret) {
+ elevator_exit(q, q->elevator);
+ goto out;
+ }
+ }
+
+ if (new_e)
+ blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+ else
+ blk_add_trace_msg(q, "elv switch: none");
+
+out:
+ blk_mq_unfreeze_queue(q);
+ blk_mq_start_stopped_hw_queues(q, true);
+ return ret;
+
+}
+
/*
* switch to new_e io scheduler. be careful not to introduce deadlocks -
* we don't free the old io scheduler, before we have allocated what we
@@ -953,10 +992,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
bool old_registered = false;
int err;
- if (q->mq_ops) {
- blk_mq_freeze_queue(q);
- blk_mq_quiesce_queue(q);
- }
+ if (q->mq_ops)
+ return elevator_switch_mq(q, new_e);
/*
* Turn on BYPASS and drain all requests w/ elevator private data.
@@ -968,11 +1005,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (old) {
old_registered = old->registered;
- if (old->uses_mq)
- blk_mq_sched_teardown(q);
-
- if (!q->mq_ops)
- blk_queue_bypass_start(q);
+ blk_queue_bypass_start(q);
/* unregister and clear all auxiliary data of the old elevator */
if (old_registered)
@@ -982,53 +1015,32 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
}
/* allocate, init and register new elevator */
- if (q->mq_ops)
- err = blk_mq_init_sched(q, new_e);
- else
- err = new_e->ops.sq.elevator_init_fn(q, new_e);
+ err = new_e->ops.sq.elevator_init_fn(q, new_e);
if (err)
goto fail_init;
- if (new_e) {
- err = elv_register_queue(q);
- if (err)
- goto fail_register;
- }
+ err = elv_register_queue(q);
+ if (err)
+ goto fail_register;
/* done, kill the old one and finish */
if (old) {
- elevator_exit(old);
- if (!q->mq_ops)
- blk_queue_bypass_end(q);
+ elevator_exit(q, old);
+ blk_queue_bypass_end(q);
}
- if (q->mq_ops) {
- blk_mq_unfreeze_queue(q);
- blk_mq_start_stopped_hw_queues(q, true);
- }
-
- if (new_e)
- blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
- else
- blk_add_trace_msg(q, "elv switch: none");
+ blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
return 0;
fail_register:
- if (q->mq_ops)
- blk_mq_sched_teardown(q);
- elevator_exit(q->elevator);
+ elevator_exit(q, q->elevator);
fail_init:
/* switch failed, restore and re-register old elevator */
if (old) {
q->elevator = old;
elv_register_queue(q);
- if (!q->mq_ops)
- blk_queue_bypass_end(q);
- }
- if (q->mq_ops) {
- blk_mq_unfreeze_queue(q);
- blk_mq_start_stopped_hw_queues(q, true);
+ blk_queue_bypass_end(q);
}
return err;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index aebecc4ed088..22d39e8d4de1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -211,7 +211,7 @@ extern ssize_t elv_iosched_show(struct request_queue *, char *);
extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
extern int elevator_init(struct request_queue *, char *);
-extern void elevator_exit(struct elevator_queue *);
+extern void elevator_exit(struct request_queue *, struct elevator_queue *);
extern int elevator_change(struct request_queue *, const char *);
extern bool elv_bio_merge_ok(struct request *, struct bio *);
extern struct elevator_queue *elevator_alloc(struct request_queue *,
--
2.12.2
^ permalink raw reply related
* [PATCH 4/5] blk-mq: remap queues when adding/removing hardware queues
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.
Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-mq.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 477951d10cc9..b49fdfde7e06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2483,6 +2483,14 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
return 0;
}
+static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
+{
+ if (set->ops->map_queues)
+ return set->ops->map_queues(set);
+ else
+ return blk_mq_map_queues(set);
+}
+
/*
* Alloc a tag set to be associated with one or more request queues.
* May fail with EINVAL for various error conditions. May adjust the
@@ -2537,10 +2545,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
if (!set->mq_map)
goto out_free_tags;
- if (set->ops->map_queues)
- ret = set->ops->map_queues(set);
- else
- ret = blk_mq_map_queues(set);
+ ret = blk_mq_update_queue_map(set);
if (ret)
goto out_free_mq_map;
@@ -2632,6 +2637,7 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
blk_mq_freeze_queue(q);
set->nr_hw_queues = nr_hw_queues;
+ blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
blk_mq_queue_reinit(q, cpu_online_mask);
--
2.12.2
^ permalink raw reply related
* [PATCH 2/5] blk-mq-sched: set up scheduler tags when bringing up new queues
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-mq-sched.c | 22 ++++++++++++++++++++++
block/blk-mq-sched.h | 5 +++++
block/blk-mq.c | 9 ++++++++-
3 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e08ba915343e..3fd918bb13a2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -460,6 +460,28 @@ void blk_mq_sched_teardown(struct request_queue *q)
blk_mq_sched_free_tags(set, hctx, i);
}
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ unsigned int hctx_idx)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (!e)
+ return 0;
+
+ return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+}
+
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ unsigned int hctx_idx)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (!e)
+ return;
+
+ blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+}
+
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
{
struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 873f9af5a35b..19db25e0c95a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -35,6 +35,11 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
void blk_mq_sched_teardown(struct request_queue *q);
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ unsigned int hctx_idx);
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ unsigned int hctx_idx);
+
int blk_mq_sched_init(struct request_queue *q);
static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..ac830cb488d7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1839,6 +1839,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
hctx->fq->flush_rq, hctx_idx,
flush_start_tag + hctx_idx);
+ blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
+
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
@@ -1905,9 +1907,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
goto free_bitmap;
+ if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
+ goto exit_hctx;
+
hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
if (!hctx->fq)
- goto exit_hctx;
+ goto sched_exit_hctx;
if (set->ops->init_request &&
set->ops->init_request(set->driver_data,
@@ -1922,6 +1927,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
free_fq:
kfree(hctx->fq);
+ sched_exit_hctx:
+ blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
exit_hctx:
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
--
2.12.2
^ permalink raw reply related
* [PATCH 1/5] blk-mq-sched: refactor scheduler initialization
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-mq-sched.c | 82 ++++++++++++++++++++++++++++------------------------
block/blk-mq-sched.h | 2 +-
block/elevator.c | 32 ++++++++------------
3 files changed, 57 insertions(+), 59 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..e08ba915343e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -431,11 +431,45 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
}
}
-int blk_mq_sched_setup(struct request_queue *q)
+static int blk_mq_sched_alloc_tags(struct request_queue *q,
+ struct blk_mq_hw_ctx *hctx,
+ unsigned int hctx_idx)
+{
+ struct blk_mq_tag_set *set = q->tag_set;
+ int ret;
+
+ hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
+ set->reserved_tags);
+ if (!hctx->sched_tags)
+ return -ENOMEM;
+
+ ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
+ if (ret)
+ blk_mq_sched_free_tags(set, hctx, hctx_idx);
+
+ return ret;
+}
+
+void blk_mq_sched_teardown(struct request_queue *q)
{
struct blk_mq_tag_set *set = q->tag_set;
struct blk_mq_hw_ctx *hctx;
- int ret, i;
+ int i;
+
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_sched_free_tags(set, hctx, i);
+}
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned int i;
+ int ret;
+
+ if (!e) {
+ q->elevator = NULL;
+ return 0;
+ }
/*
* Default to 256, since we don't split into sync/async like the
@@ -443,49 +477,21 @@ int blk_mq_sched_setup(struct request_queue *q)
*/
q->nr_requests = 2 * BLKDEV_MAX_RQ;
- /*
- * We're switching to using an IO scheduler, so setup the hctx
- * scheduler tags and switch the request map from the regular
- * tags to scheduler tags. First allocate what we need, so we
- * can safely fail and fallback, if needed.
- */
- ret = 0;
queue_for_each_hw_ctx(q, hctx, i) {
- hctx->sched_tags = blk_mq_alloc_rq_map(set, i,
- q->nr_requests, set->reserved_tags);
- if (!hctx->sched_tags) {
- ret = -ENOMEM;
- break;
- }
- ret = blk_mq_alloc_rqs(set, hctx->sched_tags, i, q->nr_requests);
+ ret = blk_mq_sched_alloc_tags(q, hctx, i);
if (ret)
- break;
+ goto err;
}
- /*
- * If we failed, free what we did allocate
- */
- if (ret) {
- queue_for_each_hw_ctx(q, hctx, i) {
- if (!hctx->sched_tags)
- continue;
- blk_mq_sched_free_tags(set, hctx, i);
- }
-
- return ret;
- }
+ ret = e->ops.mq.init_sched(q, e);
+ if (ret)
+ goto err;
return 0;
-}
-void blk_mq_sched_teardown(struct request_queue *q)
-{
- struct blk_mq_tag_set *set = q->tag_set;
- struct blk_mq_hw_ctx *hctx;
- int i;
-
- queue_for_each_hw_ctx(q, hctx, i)
- blk_mq_sched_free_tags(set, hctx, i);
+err:
+ blk_mq_sched_teardown(q);
+ return ret;
}
int blk_mq_sched_init(struct request_queue *q)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..873f9af5a35b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,7 +32,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
struct list_head *rq_list,
struct request *(*get_rq)(struct blk_mq_hw_ctx *));
-int blk_mq_sched_setup(struct request_queue *q);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
void blk_mq_sched_teardown(struct request_queue *q);
int blk_mq_sched_init(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..f236ef1d2be9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -242,17 +242,12 @@ int elevator_init(struct request_queue *q, char *name)
}
}
- if (e->uses_mq) {
- err = blk_mq_sched_setup(q);
- if (!err)
- err = e->ops.mq.init_sched(q, e);
- } else
+ if (e->uses_mq)
+ err = blk_mq_init_sched(q, e);
+ else
err = e->ops.sq.elevator_init_fn(q, e);
- if (err) {
- if (e->uses_mq)
- blk_mq_sched_teardown(q);
+ if (err)
elevator_put(e);
- }
return err;
}
EXPORT_SYMBOL(elevator_init);
@@ -987,21 +982,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
}
/* allocate, init and register new elevator */
- if (new_e) {
- if (new_e->uses_mq) {
- err = blk_mq_sched_setup(q);
- if (!err)
- err = new_e->ops.mq.init_sched(q, new_e);
- } else
- err = new_e->ops.sq.elevator_init_fn(q, new_e);
- if (err)
- goto fail_init;
+ if (q->mq_ops)
+ err = blk_mq_init_sched(q, new_e);
+ else
+ err = new_e->ops.sq.elevator_init_fn(q, new_e);
+ if (err)
+ goto fail_init;
+ if (new_e) {
err = elv_register_queue(q);
if (err)
goto fail_register;
- } else
- q->elevator = NULL;
+ }
/* done, kill the old one and finish */
if (old) {
--
2.12.2
^ permalink raw reply related
* [PATCH 0/5] blk-mq: scheduler and hw queue initialization fixes/enhancements
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
From: Omar Sandoval <osandov@fb.com>
Hi, Jens,
This series has some fixes and enhancements for blk-mq:
- Patch 1 is a cleanup in preparation for the rest of the series
- Patch 2 is a fix necessary for patch 4 when scheduling is enabled,
making sure we bring up new hardware queues with scheduler tags
- Patch 3 makes error handling in elevator_switch() more robust, making
us fall back to none like you recommended last time
- Patch 4 is the remap fix from last week
- Patch 5 is an extension of patch 2 for multiqueue schedulers that
allocate per-hctx data. Nothing in-tree needs it, but Kyber will.
Let me know if you'd prefer deferring this to 4.12 or want to apply 1-4
for 4.11. These are based on block/for-next, so the latter might require
a respin.
Thanks!
Omar Sandoval (5):
blk-mq-sched: refactor scheduler initialization
blk-mq-sched: set up scheduler tags when bringing up new queues
blk-mq-sched: fix crash in switch error path
blk-mq: remap queues when adding/removing hardware queues
blk-mq-sched: provide hooks for initializing hardware queue data
block/blk-mq-sched.c | 178 +++++++++++++++++++++++++++++------------------
block/blk-mq-sched.h | 13 ++--
block/blk-mq.c | 25 +++++--
block/blk-sysfs.c | 2 +-
block/elevator.c | 114 +++++++++++++++---------------
include/linux/elevator.h | 4 +-
6 files changed, 198 insertions(+), 138 deletions(-)
--
2.12.2
^ permalink raw reply
* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-03 21:25 UTC (permalink / raw)
To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
linux-raid
Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <9505ff12-7307-7dec-76b5-2a233a592634@profitbricks.com>
[-- Attachment #1: Type: text/plain, Size: 1927 bytes --]
On Mon, Apr 03 2017, Michael Wang wrote:
> blk_attempt_plug_merge() try to merge bio into request and chain them
> by 'bi_next', while after the bio is done inside request, we forgot to
> reset the 'bi_next'.
>
> This lead into BUG while removing all the underlying devices from md-raid1,
> the bio once go through:
>
> md_do_sync()
> sync_request()
> generic_make_request()
This is a read request from the "first" device.
> blk_queue_bio()
> blk_attempt_plug_merge()
> CHAINED HERE
>
> will keep chained and reused by:
>
> raid1d()
> sync_request_write()
> generic_make_request()
This is a write request to some other device, isn't it?
If sync_request_write() is using a bio that has already been used, it
should call bio_reset() and fill in the details again.
However I don't see how that would happen.
Can you give specific details on the situation that triggers the bug?
Thanks,
NeilBrown
> BUG_ON(bio->bi_next)
>
> After reset the 'bi_next' this can no longer happen.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
> block/blk-core.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 43b7d06..91223b2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
> struct bio *bio = req->bio;
> unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>
> - if (bio_bytes == bio->bi_iter.bi_size)
> + if (bio_bytes == bio->bi_iter.bi_size) {
> req->bio = bio->bi_next;
> + bio->bi_next = NULL;
> + }
>
> req_bio_endio(req, bio, bio_bytes, error);
>
> --
> 2.5.0
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [PATCH 0/7] block: T10/DIF Fixes and cleanups v2
From: Martin K. Petersen @ 2017-04-03 21:12 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>
Dmitry Monakhov <dmonakhov@openvz.org> writes:
Dmitry,
> This patch set fix various problems spotted during T10/DIF integrity
> machinery testing.
>
> TOC:
> ## Fix various bugs in T10/DIF/DIX infrastructure
> 0001-bio-integrity-Do-not-allocate-integrity-context-for-fsync
> 0002-bio-integrity-save-original-iterator-for-verify-stage
> 0003-bio-integrity-bio_trim-should-truncate-integrity-vec
> 0004-bio-integrity-fix-interface-for-bio_integrity_trim
> ## Cleanup T10/DIF/DIX infrastructure
> 0005-bio-integrity-add-bio_integrity_setup-helper
> 0006-T10-Move-opencoded-contants-to-common-header
> ## General bulletproof protection for block layer
> 0007-Guard-bvec-iteration-logic-v2
Looks like a nice cleanup of some of the things that have rotted a bit
as a result of the immutable bvec efforts. No major objections from
here. I'll try your series on my qual setup tomorrow to make sure
everything is working correctly.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox