* Re: [Nbd] [PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers
From: Mel Gorman @ 2017-04-06 11:25 UTC (permalink / raw)
To: Wouter Verhelst
Cc: Michal Hocko, Vlastimil Babka, nbd-general, Chris Leech,
linux-scsi, Josef Bacik, netdev, linux-kernel, linux-block,
linux-mm, Eric Dumazet, Lee Duncan, Johannes Weiner,
Andrew Morton, open-iscsi, David S. Miller
In-Reply-To: <20170406063810.dmv4fg2irsqgdvyq@grep.be>
On Thu, Apr 06, 2017 at 08:38:10AM +0200, Wouter Verhelst wrote:
> On Wed, Apr 05, 2017 at 01:30:31PM +0200, Michal Hocko wrote:
> > On Wed 05-04-17 09:46:59, Vlastimil Babka wrote:
> > > We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
> > > clearing of PF_MEMALLOC. Let's convert the code which was using the generic
> > > tsk_restore_flags(). No functional change.
> >
> > It would be really great to revisit why those places outside of the mm
> > proper really need this flag. I know this is a painful exercise but I
> > wouldn't be surprised if there were abusers there.
> [...]
> > > ---
> > > drivers/block/nbd.c | 7 ++++---
> > > drivers/scsi/iscsi_tcp.c | 7 ++++---
> > > net/core/dev.c | 7 ++++---
> > > net/core/sock.c | 7 ++++---
> > > 4 files changed, 16 insertions(+), 12 deletions(-)
>
> These were all done to make swapping over network safe. The idea is that
> if a socket has SOCK_MEMALLOC set, incoming packets for that socket can
> access PFMEMALLOC reserves (whereas other sockets cannot); this all in
> the hope that one packe destined to that socket will contain the TCP ACK
> that confirms the swapout was successful and we can now release RAM
> pages for other processes.
>
> I don't know whether they need the PF_MEMALLOC flag specifically (not a
> kernel hacker), but they do need to interact with it at any rate.
>
At the time it was required to get access to emergency reserves so swapping
can continue. The flip side is that the memory is then protected so pages
allocated from emergency reserves are not used for network traffic that
is not involved with swap. This means that under heavy swap load, it was
perfectly possible for unrelated traffic to get dropped for quite some
time.
--
Mel Gorman
SUSE Labs
^ permalink raw reply
* [PATCH v2 6/6] nvme-rdma: use intelligent affinity based queue mappings
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
In-Reply-To: <1491474998-16423-1-git-send-email-sagi@grimberg.me>
Use the generic block layer affinity mapping helper. Also,
limit nr_hw_queues to the rdma device number of irq vectors
as we don't really need more.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/rdma.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4aae363943e3..22334b6e8fc3 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -19,6 +19,7 @@
#include <linux/string.h>
#include <linux/atomic.h>
#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
#include <linux/types.h>
#include <linux/list.h>
#include <linux/mutex.h>
@@ -496,14 +497,10 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue,
queue->device = dev;
/*
- * The admin queue is barely used once the controller is live, so don't
- * bother to spread it out.
+ * Spread I/O queues completion vectors according their queue index.
+ * Admin queues can always go on completion vector 0.
*/
- if (idx == 0)
- comp_vector = 0;
- else
- comp_vector = idx % ibdev->num_comp_vectors;
-
+ comp_vector = idx == 0 ? idx : idx - 1;
/* +1 for ib_stop_cq */
queue->ib_cq = ib_alloc_cq(dev->dev, queue,
@@ -645,10 +642,20 @@ static int nvme_rdma_connect_io_queues(struct nvme_rdma_ctrl *ctrl)
static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
{
struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
+ struct ib_device *ibdev = ctrl->device->dev;
unsigned int nr_io_queues;
int i, ret;
nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+
+ /*
+ * we map queues according to the device irq vectors for
+ * optimal locality so we don't need more queues than
+ * completion vectors.
+ */
+ nr_io_queues = min_t(unsigned int, nr_io_queues,
+ ibdev->num_comp_vectors);
+
ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
if (ret)
return ret;
@@ -1523,6 +1530,13 @@ static void nvme_rdma_complete_rq(struct request *rq)
nvme_complete_rq(rq);
}
+static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
+{
+ struct nvme_rdma_ctrl *ctrl = set->driver_data;
+
+ return blk_mq_rdma_map_queues(set, ctrl->device->dev, 0);
+}
+
static const struct blk_mq_ops nvme_rdma_mq_ops = {
.queue_rq = nvme_rdma_queue_rq,
.complete = nvme_rdma_complete_rq,
@@ -1532,6 +1546,7 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
.init_hctx = nvme_rdma_init_hctx,
.poll = nvme_rdma_poll,
.timeout = nvme_rdma_timeout,
+ .map_queues = nvme_rdma_map_queues,
};
static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
--
2.7.4
^ permalink raw reply related
* [PATCH v2 5/6] block: Add rdma affinity based queue mapping helper
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
In-Reply-To: <1491474998-16423-1-git-send-email-sagi@grimberg.me>
Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.
Reviewed-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
block/Kconfig | 5 +++++
block/Makefile | 1 +
block/blk-mq-rdma.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/blk-mq-rdma.h | 10 +++++++++
4 files changed, 70 insertions(+)
create mode 100644 block/blk-mq-rdma.c
create mode 100644 include/linux/blk-mq-rdma.h
diff --git a/block/Kconfig b/block/Kconfig
index 89cd28f8d051..3ab42bbb06d5 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
depends on BLOCK && VIRTIO
default y
+config BLK_MQ_RDMA
+ bool
+ depends on BLOCK && INFINIBAND
+ default y
+
source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..4498603dbc83 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o
obj-$(CONFIG_BLK_MQ_VIRTIO) += blk-mq-virtio.o
+obj-$(CONFIG_BLK_MQ_RDMA) += blk-mq-rdma.o
obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
obj-$(CONFIG_BLK_WBT) += blk-wbt.o
obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
new file mode 100644
index 000000000000..7dc07b43858b
--- /dev/null
+++ b/block/blk-mq-rdma.c
@@ -0,0 +1,54 @@
+/*
+ * Copyright (c) 2017 Sagi Grimberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>
+
+/**
+ * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
+ * @set: tagset to provide the mapping for
+ * @dev: rdma device associated with @set.
+ * @first_vec: first interrupt vectors to use for queues (usually 0)
+ *
+ * This function assumes the rdma device @dev has at least as many available
+ * interrupt vetors as @set has queues. It will then query it's affinity mask
+ * and built queue mapping that maps a queue to the CPUs that have irq affinity
+ * for the corresponding vector.
+ *
+ * In case either the driver passed a @dev with less vectors than
+ * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
+ * vector, we fallback to the naive mapping.
+ */
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+ struct ib_device *dev, int first_vec)
+{
+ const struct cpumask *mask;
+ unsigned int queue, cpu;
+
+ if (set->nr_hw_queues > dev->num_comp_vectors)
+ goto fallback;
+
+ for (queue = 0; queue < set->nr_hw_queues; queue++) {
+ mask = ib_get_vector_affinity(dev, first_vec + queue);
+ if (!mask)
+ goto fallback;
+
+ for_each_cpu(cpu, mask)
+ set->mq_map[cpu] = queue;
+ }
+
+ return 0;
+fallback:
+ return blk_mq_map_queues(set);
+}
+EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
new file mode 100644
index 000000000000..b4ade198007d
--- /dev/null
+++ b/include/linux/blk-mq-rdma.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_BLK_MQ_RDMA_H
+#define _LINUX_BLK_MQ_RDMA_H
+
+struct blk_mq_tag_set;
+struct ib_device;
+
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+ struct ib_device *dev, int first_vec);
+
+#endif /* _LINUX_BLK_MQ_RDMA_H */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 4/6] mlx5: support ->get_vector_affinity
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
In-Reply-To: <1491474998-16423-1-git-send-email-sagi@grimberg.me>
Simply refer to the generic affinity mask helper.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/infiniband/hw/mlx5/main.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 4dc0a8785fe0..b12bc2294895 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3319,6 +3319,15 @@ static int mlx5_ib_get_hw_stats(struct ib_device *ibdev,
return port->q_cnts.num_counters;
}
+const struct cpumask *mlx5_ib_get_vector_affinity(struct ib_device *ibdev,
+ int comp_vector)
+{
+ struct mlx5_ib_dev *dev = to_mdev(ibdev);
+
+ return pci_irq_get_affinity(dev->mdev->pdev,
+ MLX5_EQ_VEC_COMP_BASE + comp_vector);
+}
+
static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
{
struct mlx5_ib_dev *dev;
@@ -3449,6 +3458,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
dev->ib_dev.get_port_immutable = mlx5_port_immutable;
dev->ib_dev.get_dev_fw_str = get_dev_fw_str;
+ dev->ib_dev.get_vector_affinity = mlx5_ib_get_vector_affinity;
if (mlx5_core_is_pf(mdev)) {
dev->ib_dev.get_vf_config = mlx5_ib_get_vf_config;
dev->ib_dev.set_vf_link_state = mlx5_ib_set_vf_link_state;
--
2.7.4
^ permalink raw reply related
* [PATCH v2 3/6] RDMA/core: expose affinity mappings per completion vector
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
In-Reply-To: <1491474998-16423-1-git-send-email-sagi@grimberg.me>
This will allow ULPs to intelligently locate threads based
on completion vector cpu affinity mappings. In case the
driver does not expose a get_vector_affinity callout, return
NULL so the caller can maintain a fallback logic.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
include/rdma/ib_verbs.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0f1813c13687..d44b62791c64 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2150,6 +2150,8 @@ struct ib_device {
*/
int (*get_port_immutable)(struct ib_device *, u8, struct ib_port_immutable *);
void (*get_dev_fw_str)(struct ib_device *, char *str, size_t str_len);
+ const struct cpumask *(*get_vector_affinity)(struct ib_device *ibdev,
+ int comp_vector);
};
struct ib_client {
@@ -3377,4 +3379,26 @@ void ib_drain_qp(struct ib_qp *qp);
int ib_resolve_eth_dmac(struct ib_device *device,
struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_get_vector_affinity - Get the affinity mappings of a given completion
+ * vector
+ * @device: the rdma device
+ * @comp_vector: index of completion vector
+ *
+ * Returns NULL on failure, otherwise a corresponding cpu map of the
+ * completion vector (returns all-cpus map if the device driver doesn't
+ * implement get_vector_affinity).
+ */
+static inline const struct cpumask *
+ib_get_vector_affinity(struct ib_device *device, int comp_vector)
+{
+ if (comp_vector > device->num_comp_vectors ||
+ !device->get_vector_affinity)
+ return NULL;
+
+ return device->get_vector_affinity(device, comp_vector);
+
+}
+
#endif /* IB_VERBS_H */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 2/6] mlx5: move affinity hints assignments to generic code
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
In-Reply-To: <1491474998-16423-1-git-send-email-sagi@grimberg.me>
generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).
The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.
Also, remove mlx5e_get_cpu routine as we have generic helpers
to get cpumask and node given a irq vector, so use them
directly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 12 ++--
drivers/net/ethernet/mellanox/mlx5/core/main.c | 83 ++---------------------
include/linux/mlx5/driver.h | 1 -
3 files changed, 10 insertions(+), 86 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index eec0d172761e..a51399c54c62 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1373,11 +1373,6 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
mlx5e_destroy_cq(cq);
}
-static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
-{
- return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
-}
-
static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
struct mlx5e_channel_param *cparam)
{
@@ -1535,19 +1530,20 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
struct mlx5e_cq_moder icosq_cq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
struct mlx5e_cq_moder rx_cq_profile;
- int cpu = mlx5e_get_cpu(priv, ix);
struct mlx5e_channel *c;
struct mlx5e_sq *sq;
int err;
int i;
- c = kzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
+ c = kzalloc_node(sizeof(*c), GFP_KERNEL,
+ pci_irq_get_node(priv->mdev->pdev, MLX5_EQ_VEC_COMP_BASE + ix));
if (!c)
return -ENOMEM;
c->priv = priv;
c->ix = ix;
- c->cpu = cpu;
+ c->cpu = cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
+ MLX5_EQ_VEC_COMP_BASE + ix));
c->pdev = &priv->mdev->pdev->dev;
c->netdev = priv->netdev;
c->mkey_be = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 7c8672cbb369..0c2d8e22d94a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -312,6 +312,9 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
{
struct mlx5_priv *priv = &dev->priv;
struct mlx5_eq_table *table = &priv->eq_table;
+ struct irq_affinity irqdesc = {
+ .pre_vectors = MLX5_EQ_VEC_COMP_BASE,
+ };
int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
int nvec;
@@ -325,9 +328,10 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
if (!priv->irq_info)
goto err_free_msix;
- nvec = pci_alloc_irq_vectors(dev->pdev,
+ nvec = pci_alloc_irq_vectors_affinity(dev->pdev,
MLX5_EQ_VEC_COMP_BASE + 1, nvec,
- PCI_IRQ_MSIX);
+ PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+ &irqdesc);
if (nvec < 0)
return nvec;
@@ -600,71 +604,6 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
return (u64)timer_l | (u64)timer_h1 << 32;
}
-static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
- struct mlx5_priv *priv = &mdev->priv;
- int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
- int err;
-
- if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
- mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
- return -ENOMEM;
- }
-
- cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
- priv->irq_info[i].mask);
-
- err = irq_set_affinity_hint(irq, priv->irq_info[i].mask);
- if (err) {
- mlx5_core_warn(mdev, "irq_set_affinity_hint failed,irq 0x%.4x",
- irq);
- goto err_clear_mask;
- }
-
- return 0;
-
-err_clear_mask:
- free_cpumask_var(priv->irq_info[i].mask);
- return err;
-}
-
-static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
- struct mlx5_priv *priv = &mdev->priv;
- int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-
- irq_set_affinity_hint(irq, NULL);
- free_cpumask_var(priv->irq_info[i].mask);
-}
-
-static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
-{
- int err;
- int i;
-
- for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
- err = mlx5_irq_set_affinity_hint(mdev, i);
- if (err)
- goto err_out;
- }
-
- return 0;
-
-err_out:
- for (i--; i >= 0; i--)
- mlx5_irq_clear_affinity_hint(mdev, i);
-
- return err;
-}
-
-static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
-{
- int i;
-
- for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
- mlx5_irq_clear_affinity_hint(mdev, i);
-}
-
int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
unsigned int *irqn)
{
@@ -1116,12 +1055,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
goto err_stop_eqs;
}
- err = mlx5_irq_set_affinity_hints(dev);
- if (err) {
- dev_err(&pdev->dev, "Failed to alloc affinity hint cpumask\n");
- goto err_affinity_hints;
- }
-
err = mlx5_init_fs(dev);
if (err) {
dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1165,9 +1098,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
mlx5_cleanup_fs(dev);
err_fs:
- mlx5_irq_clear_affinity_hints(dev);
-
-err_affinity_hints:
free_comp_eqs(dev);
err_stop_eqs:
@@ -1234,7 +1164,6 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
mlx5_eswitch_detach(dev->priv.eswitch);
#endif
mlx5_cleanup_fs(dev);
- mlx5_irq_clear_affinity_hints(dev);
free_comp_eqs(dev);
mlx5_stop_eqs(dev);
mlx5_put_uars_page(dev, priv->uar);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index a9891df94ce0..4c7cf4dfb024 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -527,7 +527,6 @@ struct mlx5_core_sriov {
};
struct mlx5_irq_info {
- cpumask_var_t mask;
char name[MLX5_MAX_IRQ_NAME];
};
--
2.7.4
^ permalink raw reply related
* [PATCH v2 1/6] mlx5: convert to generic pci_alloc_irq_vectors
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
In-Reply-To: <1491474998-16423-1-git-send-email-sagi@grimberg.me>
Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 9 ++----
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/health.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 33 ++++++++--------------
.../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 1 -
include/linux/mlx5/driver.h | 1 -
7 files changed, 17 insertions(+), 33 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8ef64c4db2c2..eec0d172761e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -389,7 +389,7 @@ static void mlx5e_enable_async_events(struct mlx5e_priv *priv)
static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
{
clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLED, &priv->state);
- synchronize_irq(mlx5_get_msix_vec(priv->mdev, MLX5_EQ_VEC_ASYNC));
+ synchronize_irq(pci_irq_vector(priv->mdev->pdev, MLX5_EQ_VEC_ASYNC));
}
static inline int mlx5e_get_wqe_mtt_sz(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ea5d8d37a75c..e2c33c493b89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -575,7 +575,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
name, pci_name(dev->pdev));
eq->eqn = MLX5_GET(create_eq_out, out, eq_number);
- eq->irqn = priv->msix_arr[vecidx].vector;
+ eq->irqn = pci_irq_vector(dev->pdev, vecidx);
eq->dev = dev;
eq->doorbell = priv->uar->map + MLX5_EQ_DOORBEL_OFFSET;
err = request_irq(eq->irqn, handler, 0,
@@ -610,7 +610,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
return 0;
err_irq:
- free_irq(priv->msix_arr[vecidx].vector, eq);
+ free_irq(eq->irqn, eq);
err_eq:
mlx5_cmd_destroy_eq(dev, eq->eqn);
@@ -651,11 +651,6 @@ int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
}
EXPORT_SYMBOL_GPL(mlx5_destroy_unmap_eq);
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx)
-{
- return dev->priv.msix_arr[MLX5_EQ_VEC_ASYNC].vector;
-}
-
int mlx5_eq_init(struct mlx5_core_dev *dev)
{
int err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index fcd5bc7e31db..6bf5d70b4117 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1596,7 +1596,7 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, int vport_num)
/* Mark this vport as disabled to discard new events */
vport->enabled = false;
- synchronize_irq(mlx5_get_msix_vec(esw->dev, MLX5_EQ_VEC_ASYNC));
+ synchronize_irq(pci_irq_vector(esw->dev->pdev, MLX5_EQ_VEC_ASYNC));
/* Wait for current already scheduled events to complete */
flush_workqueue(esw->work_queue);
/* Disable events from this vport */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d0515391d33b..8b38d5cfd4c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -80,7 +80,7 @@ static void trigger_cmd_completions(struct mlx5_core_dev *dev)
u64 vector;
/* wait for pending handlers to complete */
- synchronize_irq(dev->priv.msix_arr[MLX5_EQ_VEC_CMD].vector);
+ synchronize_irq(pci_irq_vector(dev->pdev, MLX5_EQ_VEC_CMD));
spin_lock_irqsave(&dev->cmd.alloc_lock, flags);
vector = ~dev->cmd.bitmask & ((1ul << (1 << dev->cmd.log_sz)) - 1);
if (!vector)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index e2bd600d19de..7c8672cbb369 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -308,13 +308,12 @@ static void release_bar(struct pci_dev *pdev)
pci_release_regions(pdev);
}
-static int mlx5_enable_msix(struct mlx5_core_dev *dev)
+static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
{
struct mlx5_priv *priv = &dev->priv;
struct mlx5_eq_table *table = &priv->eq_table;
int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
int nvec;
- int i;
nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
MLX5_EQ_VEC_COMP_BASE;
@@ -322,17 +321,13 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
if (nvec <= MLX5_EQ_VEC_COMP_BASE)
return -ENOMEM;
- priv->msix_arr = kcalloc(nvec, sizeof(*priv->msix_arr), GFP_KERNEL);
-
priv->irq_info = kcalloc(nvec, sizeof(*priv->irq_info), GFP_KERNEL);
- if (!priv->msix_arr || !priv->irq_info)
+ if (!priv->irq_info)
goto err_free_msix;
- for (i = 0; i < nvec; i++)
- priv->msix_arr[i].entry = i;
-
- nvec = pci_enable_msix_range(dev->pdev, priv->msix_arr,
- MLX5_EQ_VEC_COMP_BASE + 1, nvec);
+ nvec = pci_alloc_irq_vectors(dev->pdev,
+ MLX5_EQ_VEC_COMP_BASE + 1, nvec,
+ PCI_IRQ_MSIX);
if (nvec < 0)
return nvec;
@@ -342,7 +337,6 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
err_free_msix:
kfree(priv->irq_info);
- kfree(priv->msix_arr);
return -ENOMEM;
}
@@ -350,9 +344,8 @@ static void mlx5_disable_msix(struct mlx5_core_dev *dev)
{
struct mlx5_priv *priv = &dev->priv;
- pci_disable_msix(dev->pdev);
+ pci_free_irq_vectors(dev->pdev);
kfree(priv->irq_info);
- kfree(priv->msix_arr);
}
struct mlx5_reg_host_endianess {
@@ -610,8 +603,7 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
{
struct mlx5_priv *priv = &mdev->priv;
- struct msix_entry *msix = priv->msix_arr;
- int irq = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+ int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
int err;
if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
@@ -639,8 +631,7 @@ static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
{
struct mlx5_priv *priv = &mdev->priv;
- struct msix_entry *msix = priv->msix_arr;
- int irq = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+ int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
irq_set_affinity_hint(irq, NULL);
free_cpumask_var(priv->irq_info[i].mask);
@@ -763,8 +754,8 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
}
#ifdef CONFIG_RFS_ACCEL
- irq_cpu_rmap_add(dev->rmap,
- dev->priv.msix_arr[i + MLX5_EQ_VEC_COMP_BASE].vector);
+ irq_cpu_rmap_add(dev->rmap, pci_irq_vector(dev->pdev,
+ MLX5_EQ_VEC_COMP_BASE + i));
#endif
snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", i);
err = mlx5_create_map_eq(dev, eq,
@@ -1101,9 +1092,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
goto err_stop_poll;
}
- err = mlx5_enable_msix(dev);
+ err = mlx5_alloc_irq_vectors(dev);
if (err) {
- dev_err(&pdev->dev, "enable msix failed\n");
+ dev_err(&pdev->dev, "alloc irq vectors failed\n");
goto err_cleanup_once;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index b3dabe6e8836..42bfcf20d875 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -109,7 +109,6 @@ int mlx5_destroy_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
u32 element_id);
int mlx5_wait_for_vf_pages(struct mlx5_core_dev *dev);
u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev);
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx);
struct mlx5_eq *mlx5_eqn2eq(struct mlx5_core_dev *dev, int eqn);
void mlx5_cq_tasklet_cb(unsigned long data);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..a9891df94ce0 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -589,7 +589,6 @@ struct mlx5_port_module_event_stats {
struct mlx5_priv {
char name[MLX5_MAX_NAME_LEN];
struct mlx5_eq_table eq_table;
- struct msix_entry *msix_arr;
struct mlx5_irq_info *irq_info;
/* pages stuff */
--
2.7.4
^ permalink raw reply related
* [PATCH v2 0/6] Automatic affinity settings for nvme over rdma
From: Sagi Grimberg @ 2017-04-06 10:36 UTC (permalink / raw)
To: Jens Axboe, linux-nvme, linux-block, linux-rdma
Cc: Christoph Hellwig, Or Gerlitz, Saeed Mahameed, Leon Romanovsky
This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.
Changes from v1:
- Removed mlx5e_get_cpu as Christoph suggested
- Fixed up nvme-rdma queue comp_vector selection to get a better match
- Added a comment on why we limit on @dev->num_comp_vectors
- rebased to Jens's for-4.12/block
- Collected review tags
Sagi Grimberg (6):
mlx5: convert to generic pci_alloc_irq_vectors
mlx5: move affinity hints assignments to generic code
RDMA/core: expose affinity mappings per completion vector
mlx5: support ->get_vector_affinity
block: Add rdma affinity based queue mapping helper
nvme-rdma: use intelligent affinity based queue mappings
block/Kconfig | 5 +
block/Makefile | 1 +
block/blk-mq-rdma.c | 54 +++++++++++
drivers/infiniband/hw/mlx5/main.c | 10 ++
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 14 +--
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 9 +-
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/health.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 108 +++------------------
.../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 1 -
drivers/nvme/host/rdma.c | 29 ++++--
include/linux/blk-mq-rdma.h | 10 ++
include/linux/mlx5/driver.h | 2 -
include/rdma/ib_verbs.h | 24 +++++
14 files changed, 149 insertions(+), 122 deletions(-)
create mode 100644 block/blk-mq-rdma.c
create mode 100644 include/linux/blk-mq-rdma.h
--
2.7.4
^ permalink raw reply
* [PATCH] blk-mq-pci: Remove unneeded includes
From: Sagi Grimberg @ 2017-04-06 10:33 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Christoph Hellwig
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
block/blk-mq-pci.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 0c3354cf3552..70b8a6765cc7 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -10,12 +10,9 @@
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
-#include <linux/kobject.h>
-#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/pci.h>
-#include <linux/module.h>
/**
* blk_mq_pci_map_queues - provide a default queue mapping for PCI device
--
2.7.4
^ permalink raw reply related
* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Sagi Grimberg @ 2017-04-06 9:23 UTC (permalink / raw)
To: Max Gurtovoy, linux-rdma, linux-nvme, linux-block
Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <becb84ac-7819-a207-56b1-70f16cb80e42@mellanox.com>
> shouldn't you include <linux/kobject.h> and <linux/blkdev.h> like in
> commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?
Not really. We can lose these from blk-mq-pci.c as well.
>> +#include <linux/blk-mq.h>
>> +#include <linux/blk-mq-rdma.h>
>> +#include <rdma/ib_verbs.h>
>> +#include <linux/module.h>
>> +#include "blk-mq.h"
>
> Is this include needed ?
You're right, I can just keep:
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>
^ permalink raw reply
* Re: [RFC PATCH] loop: set queue logical block size
From: Hannes Reinecke @ 2017-04-06 8:52 UTC (permalink / raw)
To: Omar Sandoval, linux-block; +Cc: Ming Lei, kernel-team
In-Reply-To: <07d6b9266284b98aa3a4706062bd135275220d38.1491466688.git.osandov@fb.com>
On 04/06/2017 10:19 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> The request queue created when we create a loop device has the default
> logical block size of 512. When we associate the device with an fd, we
> set the block size on the block_device but don't update the logical
> block size of the request_queue. This makes it impossibe to use direct
> I/O with a backing file on a device with a block size >512, as the
> following check in __loop_update_dio() fails:
>
> sb_bsize = bdev_logical_block_size(inode->i_sb->s_bdev);
> if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
> ...
>
> Fix it by updating the logical block size when we set the fd.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> drivers/block/loop.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index cc981f34e017..1bb22903ad1a 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -941,6 +941,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> /* let user-space know about the new size */
> kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
>
> + blk_queue_logical_block_size(lo->lo_queue, lo_blocksize);
> set_blocksize(bdev, lo_blocksize);
>
> lo->lo_state = Lo_bound;
>
Gnaa.
I've tried so change exactly this by my patchset titled 'loop: enable
different logical blocksizes'.
And I even got agreement from Jens to merge it once it got another
review. Which, of course, no-one did.
Care to review that instead so that it finally can go in?
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
hare@suse.de +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)
^ permalink raw reply
* Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd
From: Johannes Thumshirn @ 2017-04-06 8:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, linux-block,
linux-scsi
In-Reply-To: <20170405171812.19911-2-hch@lst.de>
On Wed, Apr 05, 2017 at 07:18:08PM +0200, Christoph Hellwig wrote:
> ->retries is counting the number of times a command is resubmitted, and
> be cleared on the first time we see the command. We currently don't do
> that for non-PCIe command, which is easily fixed by moving the setup
> to common code.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
Thanks a lot.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply
* Re: [PATCH 1/2] block: Implement global tagset
From: Hannes Reinecke @ 2017-04-06 8:49 UTC (permalink / raw)
To: Arun Easi, Hannes Reinecke
Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
Christoph Hellwig, Bart van Assche, linux-block, linux-scsi
In-Reply-To: <alpine.LRH.2.00.1704052257580.727@mvluser05.qlc.com>
On 04/06/2017 08:27 AM, Arun Easi wrote:
> Hi Hannes,
>
> Thanks for taking a crack at the issue. My comments below..
>
> On Tue, 4 Apr 2017, 5:07am, Hannes Reinecke wrote:
>
>> Most legacy HBAs have a tagset per HBA, not per queue. To map
>> these devices onto block-mq this patch implements a new tagset
>> flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
>> to use just one tagset for all hardware queues.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>> ---
>> block/blk-mq-tag.c | 12 ++++++++----
>> block/blk-mq.c | 10 ++++++++--
>> include/linux/blk-mq.h | 1 +
>> 3 files changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index e48bc2c..a14e76c 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>> void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>> busy_tag_iter_fn *fn, void *priv)
>> {
>> - int i;
>> + int i, lim = tagset->nr_hw_queues;
>>
>> - for (i = 0; i < tagset->nr_hw_queues; i++) {
>> + if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
>> + lim = 1;
>> + for (i = 0; i < lim; i++) {
>> if (tagset->tags && tagset->tags[i])
>> blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
>> }
>> @@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>
>> int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
>> {
>> - int i, j, ret = 0;
>> + int i, j, ret = 0, lim = set->nr_hw_queues;
>>
>> if (!set->ops->reinit_request)
>> goto out;
>>
>> - for (i = 0; i < set->nr_hw_queues; i++) {
>> + if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
>> + lim = 1;
>> + for (i = 0; i < lim; i++) {
>> struct blk_mq_tags *tags = set->tags[i];
>>
>> for (j = 0; j < tags->nr_tags; j++) {
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 159187a..db96ed0 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
>> {
>> int ret = 0;
>>
>> + if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
>> + set->tags[hctx_idx] = set->tags[0];
>> + return true;
>> + }
>
> So, this effectively make all request allocations to the same NUMA node
> locality of the hctx_idx 0, correct? Is the performance hit you were
> talking about in the cover letter?
>
Yes. It does make the request allocations local to NUMA node 0, but then
this will only affect LLDDs which are actually _using_ NUMA locality
when allocating the request nodes.
However, SCSI doesn't set a NUMA node locality to begin with, so this
doesn't affect us.
No, what I meant is this:
the 'sbitmap' allocator already splits up the bitmap into several words,
which then should provide a better NUMA locality per map.
When we're using a shared global map it's unclear whether the individual
words of the sbitmap can and will be moved to the various NUMA nodes, or
whether we suffer from non-locality.
My tests so far have been inconclusive; but then I'm not happy with the
testcase anyway (using null_blk I only get 250k/250k r/w IOPs, which I
found rather disappointing).
> Do you have any other alternatives in mind? Dynamic growing/shrinking
> tags/request-pool in hctx with a fixed base as start?
>
> One alternative that comes to my mind is to move the divvy up logic to
> SCSI (instead of LLD doing it), IOW:
>
> 1. Have SCSI set tag_set.queue_depth to can_queue/nr_hw_queues
> 2. Have blk_mq_unique_tag() (or a new i/f) returning "hwq * nr_hw_queue +
> rq->tag"
>
> That would make the tags linear in the can_queue space, but could result
> in poor use of LLD resource if a given hctx has used up all it's tags.
>
Exactly. This is the method I used for implementing mq support for lpfc
and mpt3sas; however the complaint there indeed was that we might be
running into a tag starvation scenario with a large number of LUNs and
single-threaded I/O submission.
> On a related note, would not the current use of can_queue in SCSI lead to
> poor resource utilization in MQ cases? Like, block layer allocating
> nr_hw_queues * tags+request+driver_data.etc * can_queue, but SCSI limiting
> the number of requests to can_queue.
>
Yes, indeed. That's another problem which we should be looking at.
However, it's only ever relevant if we indeed implement some divvying
logic; if we move to the shared tags approach it should work as designed.
> BTW, if you would like me to try out this patch on my setup, please let me
> know.
>
Oh, yes. Please do.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.com +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)
^ permalink raw reply
* Re: [RFC PATCH] loop: set queue logical block size
From: Ming Lei @ 2017-04-06 8:43 UTC (permalink / raw)
To: Omar Sandoval; +Cc: linux-block, kernel-team
In-Reply-To: <07d6b9266284b98aa3a4706062bd135275220d38.1491466688.git.osandov@fb.com>
On Thu, Apr 06, 2017 at 01:19:45AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> The request queue created when we create a loop device has the default
> logical block size of 512. When we associate the device with an fd, we
> set the block size on the block_device but don't update the logical
> block size of the request_queue. This makes it impossibe to use direct
> I/O with a backing file on a device with a block size >512, as the
> following check in __loop_update_dio() fails:
>
> sb_bsize = bdev_logical_block_size(inode->i_sb->s_bdev);
> if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
> ...
>
> Fix it by updating the logical block size when we set the fd.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> drivers/block/loop.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index cc981f34e017..1bb22903ad1a 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -941,6 +941,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> /* let user-space know about the new size */
> kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
>
> + blk_queue_logical_block_size(lo->lo_queue, lo_blocksize);
> set_blocksize(bdev, lo_blocksize);
>
> lo->lo_state = Lo_bound;
Looks fine!
Reviewed-by: Ming Lei <minlei@redhat.com>
Thanks,
Ming
^ permalink raw reply
* Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
From: Christoph Hellwig @ 2017-04-06 8:38 UTC (permalink / raw)
To: 廖亨权
Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-block,
Sagi Grimberg, linux-nvme, linux-scsi
In-Reply-To: <5a5f1f1.8f7d.15b4267e21e.Coremail.liaohengquan1986@163.com>
On Thu, Apr 06, 2017 at 04:35:56PM +0800, 廖亨权 wrote:
> Hi, Guys,
> I want to ask if there is any plan to plant the NVMe driver to Vxworks OS?Thank you so much.---end quoted text---
The Linux NVMe team has no plans for a Vxworks NVMe driver at the moment.
^ permalink raw reply
* Re:Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
From: 廖亨权 @ 2017-04-06 8:35 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Jens Axboe, linux-block, Sagi Grimberg,
linux-nvme, linux-scsi
In-Reply-To: <20170405151430.GS20181@localhost.localdomain>
[-- Attachment #1: Type: text/plain, Size: 117 bytes --]
Hi, Guys,
I want to ask if there is any plan to plant the NVMe driver to Vxworks OS?Thank you so much.
[-- Attachment #2: Type: text/html, Size: 316 bytes --]
^ permalink raw reply
* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
From: Sagi Grimberg @ 2017-04-06 8:34 UTC (permalink / raw)
To: Max Gurtovoy, linux-rdma, linux-nvme, linux-block
Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <86ed1762-a990-691f-e043-3d7dcac8fe85@mellanox.com>
> Hi Sagi,
Hey Max,
> the patchset looks good and of course we can add support for more
> drivers in the future.
> have you run some performance testing with the nvmf initiator ?
I'm limited by the target machine in terms of IOPs, but the host shows
~10% cpu usage decrease, and latency improves slightly as well
which is more apparent depending on which cpu I'm running my IO
thread (due to the mismatch in comp_vectors and queue mappings
some queues have irq vectors mapped to a core on a different numa
node.
^ permalink raw reply
* Re: [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
From: Sagi Grimberg @ 2017-04-06 8:30 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz
In-Reply-To: <20170404063444.GF8978@lst.de>
>> Use the geneic block layer affinity mapping helper. Also,
>
> generic
>
>> nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
>> + nr_io_queues = min_t(unsigned int, nr_io_queues,
>> + ibdev->num_comp_vectors);
>> +
>
> Add a comment here?
Will do
^ permalink raw reply
* Re: [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
From: Sagi Grimberg @ 2017-04-06 8:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
Or Gerlitz
In-Reply-To: <20170404063213.GB8978@lst.de>
>> static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
>> {
>> - return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
>> + return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
>> + MLX5_EQ_VEC_COMP_BASE + ix));
>
> This looks ok for now, but if we look at the callers we'd probably
> want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
> the uses directly in mlx5e_open_channel as well as the stored away
> ->cpu field. But maybe that should be left for another patch after
> this one.
Its small enough to fold it in.
>> + struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
>
> I usually move assignments inside structures onto a separate line to make
> it more readable, e.g.
>
> struct irq_affinity irqdesc = {
> .pre_vectors = MLX5_EQ_VEC_COMP_BASE,
> };
Will do.
^ permalink raw reply
* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Ming Lei @ 2017-04-06 8:23 UTC (permalink / raw)
To: Omar Sandoval; +Cc: Jens Axboe, linux-block, kernel-team
In-Reply-To: <20170406075751.GA15461@vader>
On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> > On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > >
> > > While dispatching requests, if we fail to get a driver tag, we mark the
> > > hardware queue as waiting for a tag and put the requests on a
> > > hctx->dispatch list to be run later when a driver tag is freed. However,
> > > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > > queues if using a single-queue scheduler with a multiqueue device. If
> >
> > It can't perform well by using a SQ scheduler on a MQ device, so just be
> > curious why someone wants to do that in this way,:-)
>
> I don't know why anyone would want to, but it has to work :) The only
> reason we noticed this is because when the NBD device is created, it
> only has a single queue, so we automatically assign mq-deadline to it.
> Later, we update the number of queues, but it's still using mq-deadline.
>
> > I guess you mean that ops.mq.dispatch_request() may dispatch requests
> > from other hardware queues in blk_mq_sched_dispatch_requests() instead
> > of current hctx.
>
> Yup, that's right. It's weird, and I talked to Jens about just forcing
> the MQ device into an SQ mode when using an SQ scheduler, but this way
> works fine more or less.
Or just switch the elevator to the MQ default one when the device becomes
MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
hctx?
>
> > If that is true, it looks like a issue in usage of I/O scheduler, since
> > the mq-deadline scheduler just queues requests in one per-request_queue
> > linked list, for MQ device, the scheduler queue should have been per-hctx.
>
> That's an option, but that's a different scheduling policy. Again, I
> agree that it's strange, but it's reasonable behavior.
IMO, the current mq-deadline isn't good/ready for MQ device, and it
doesn't make sense to use it for MQ.
>
> > > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > > are processing. This means we end up using the hardware queue of the
> > > previous request, which may or may not be the same as that of the
> > > current request. If it isn't, the wrong hardware queue will end up
> > > waiting for a tag, and the requests will be on the wrong dispatch list,
> > > leading to a hang.
> > >
> > > The fix is twofold:
> > >
> > > 1. Make sure we save which hardware queue we were trying to get a
> > > request for in blk_mq_get_driver_tag() regardless of whether it
> > > succeeds or not.
> >
> > It shouldn't have be needed, once rq->mq_ctx is set, the hctx should been
> > fixed already, that means we can just pass the hctx to blk_mq_get_driver_tag().
>
> We still need to do the blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) step to
> get the hctx. We could do that at every callsite of
> blk_mq_get_driver_tag(), but this way it's just factored into
> blk_mq_get_driver_tag().
We can just use the hctx passed to blk_mq_dispatch_rq_list().
>
> Thanks!
>
> > > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> > > blk_mq_hw_queue to make it clear that it must handle multiple
> > > hardware queues, since I've already messed this up on a couple of
> > > occasions.
The 2nd part makes code less readable too, and it should have been
enough to pass 'hctx' into blk_mq_dispatch_rq_list() if we make
sure that ops.mq.dispatch_request() only returns requests from current hw
queue, since we have passed 'hctx' to elevator already.
Thanks,
Ming
^ permalink raw reply
* [RFC PATCH] loop: set queue logical block size
From: Omar Sandoval @ 2017-04-06 8:19 UTC (permalink / raw)
To: linux-block; +Cc: Ming Lei, kernel-team
From: Omar Sandoval <osandov@fb.com>
The request queue created when we create a loop device has the default
logical block size of 512. When we associate the device with an fd, we
set the block size on the block_device but don't update the logical
block size of the request_queue. This makes it impossibe to use direct
I/O with a backing file on a device with a block size >512, as the
following check in __loop_update_dio() fails:
sb_bsize = bdev_logical_block_size(inode->i_sb->s_bdev);
if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
...
Fix it by updating the logical block size when we set the fd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
drivers/block/loop.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f34e017..1bb22903ad1a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -941,6 +941,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
/* let user-space know about the new size */
kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
+ blk_queue_logical_block_size(lo->lo_queue, lo_blocksize);
set_blocksize(bdev, lo_blocksize);
lo->lo_state = Lo_bound;
--
2.12.2
^ permalink raw reply related
* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Omar Sandoval @ 2017-04-06 7:57 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, kernel-team
In-Reply-To: <20170406043108.GA29955@ming.t460p>
On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> >
> > While dispatching requests, if we fail to get a driver tag, we mark the
> > hardware queue as waiting for a tag and put the requests on a
> > hctx->dispatch list to be run later when a driver tag is freed. However,
> > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > queues if using a single-queue scheduler with a multiqueue device. If
>
> It can't perform well by using a SQ scheduler on a MQ device, so just be
> curious why someone wants to do that in this way,:-)
I don't know why anyone would want to, but it has to work :) The only
reason we noticed this is because when the NBD device is created, it
only has a single queue, so we automatically assign mq-deadline to it.
Later, we update the number of queues, but it's still using mq-deadline.
> I guess you mean that ops.mq.dispatch_request() may dispatch requests
> from other hardware queues in blk_mq_sched_dispatch_requests() instead
> of current hctx.
Yup, that's right. It's weird, and I talked to Jens about just forcing
the MQ device into an SQ mode when using an SQ scheduler, but this way
works fine more or less.
> If that is true, it looks like a issue in usage of I/O scheduler, since
> the mq-deadline scheduler just queues requests in one per-request_queue
> linked list, for MQ device, the scheduler queue should have been per-hctx.
That's an option, but that's a different scheduling policy. Again, I
agree that it's strange, but it's reasonable behavior.
> > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > are processing. This means we end up using the hardware queue of the
> > previous request, which may or may not be the same as that of the
> > current request. If it isn't, the wrong hardware queue will end up
> > waiting for a tag, and the requests will be on the wrong dispatch list,
> > leading to a hang.
> >
> > The fix is twofold:
> >
> > 1. Make sure we save which hardware queue we were trying to get a
> > request for in blk_mq_get_driver_tag() regardless of whether it
> > succeeds or not.
>
> It shouldn't have be needed, once rq->mq_ctx is set, the hctx should been
> fixed already, that means we can just pass the hctx to blk_mq_get_driver_tag().
We still need to do the blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) step to
get the hctx. We could do that at every callsite of
blk_mq_get_driver_tag(), but this way it's just factored into
blk_mq_get_driver_tag().
Thanks!
> > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> > blk_mq_hw_queue to make it clear that it must handle multiple
> > hardware queues, since I've already messed this up on a couple of
> > occasions.
> >
> > This didn't appear in testing with nvme and mq-deadline because nvme has
> > more driver tags than the default number of scheduler tags. However,
> > with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> >
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
^ permalink raw reply
* Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Michal Hocko @ 2017-04-06 7:27 UTC (permalink / raw)
To: Adrian Hunter
Cc: Vlastimil Babka, Richard Weinberger, Andrew Morton, linux-mm,
linux-kernel, Mel Gorman, Johannes Weiner, linux-block,
nbd-general, open-iscsi, linux-scsi, netdev, Boris Brezillon
In-Reply-To: <fe1c21a4-0bc6-529c-5446-382b01d4c99e@intel.com>
On Thu 06-04-17 09:33:44, Adrian Hunter wrote:
> On 05/04/17 14:39, Vlastimil Babka wrote:
> > On 04/05/2017 01:36 PM, Richard Weinberger wrote:
> >> Michal,
> >>
> >> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
> >>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
> >>>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
> >>>> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
> >>>> No functional change.
> >>>
> >>> This one smells like an abuser. Why the hell should read/write path
> >>> touch memory reserves at all!
> >>
> >> Could be. Let's ask Adrian, AFAIK he wrote that code.
> >> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
> >
> > I was thinking about it and concluded that since the simulator can be
> > used as a block device where reclaimed pages go to, writing the data out
> > is a memalloc operation. Then reading can be called as part of r-m-w
> > cycle, so reading as well.
>
> IIRC it was to avoid getting stuck with nandsim waiting on memory reclaim
> and memory reclaim waiting on nandsim.
I've got lost in the indirection. Could you describe how would reclaim
get stuck waiting on these paths please?
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Michal Hocko @ 2017-04-06 6:53 UTC (permalink / raw)
To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML, Ming Lei
In-Reply-To: <878tnegtoo.fsf@notabene.neil.brown.name>
On Thu 06-04-17 12:23:51, NeilBrown wrote:
[...]
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..95679d988725 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -847,10 +847,12 @@ static void loop_unprepare_queue(struct loop_device *lo)
> static int loop_prepare_queue(struct loop_device *lo)
> {
> kthread_init_worker(&lo->worker);
> - lo->worker_task = kthread_run(kthread_worker_fn,
> + lo->worker_task = kthread_create(kthread_worker_fn,
> &lo->worker, "loop%d", lo->lo_number);
> if (IS_ERR(lo->worker_task))
> return -ENOMEM;
> + lo->worker_task->flags |= PF_LESS_THROTTLE;
> + wake_up_process(lo->worker_task);
> set_user_nice(lo->worker_task, MIN_NICE);
> return 0;
This should work for the current implementation because kthread_create
will return only after the full initialization has been done. No idea
whether we can rely on that in future. I also think it would be cleaner
to set the flag on current and keep the current semantic that only
current changes its flags.
So while I do not have a strong opinion on this I think defining loop
specific thread function which set PF_LESS_THROTTLE as the first thing
is more elegant and less error prone longerm. A short comment explaining
why we use the flag there would be also preferred.
I will leave the decision to you.
Thanks.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
From: Adrian Hunter @ 2017-04-06 6:33 UTC (permalink / raw)
To: Vlastimil Babka, Richard Weinberger, Michal Hocko
Cc: Andrew Morton, linux-mm, linux-kernel, Mel Gorman,
Johannes Weiner, linux-block, nbd-general, open-iscsi, linux-scsi,
netdev, Boris Brezillon
In-Reply-To: <9b9d5bca-e125-e07b-b700-196cc800bbd7@suse.cz>
On 05/04/17 14:39, Vlastimil Babka wrote:
> On 04/05/2017 01:36 PM, Richard Weinberger wrote:
>> Michal,
>>
>> Am 05.04.2017 um 13:31 schrieb Michal Hocko:
>>> On Wed 05-04-17 09:47:00, Vlastimil Babka wrote:
>>>> Nandsim has own functions set_memalloc() and clear_memalloc() for robust
>>>> setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
>>>> No functional change.
>>>
>>> This one smells like an abuser. Why the hell should read/write path
>>> touch memory reserves at all!
>>
>> Could be. Let's ask Adrian, AFAIK he wrote that code.
>> Adrian, can you please clarify why nandsim needs to play with PF_MEMALLOC?
>
> I was thinking about it and concluded that since the simulator can be
> used as a block device where reclaimed pages go to, writing the data out
> is a memalloc operation. Then reading can be called as part of r-m-w
> cycle, so reading as well.
IIRC it was to avoid getting stuck with nandsim waiting on memory reclaim
and memory reclaim waiting on nandsim.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox