Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 1/7] bio-integrity: Do not allocate integrity context for bio w/o data
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491204212-9952-1-git-send-email-dmonakhov@openvz.org>

If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode: 0000 [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: G        W       4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: ffff880137786440 task.stack: ffffc90000ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
FS:  00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 block/bio-integrity.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
 	if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
 		return false;
 
+	if (!bio_sectors(bio))
+		return false;
+
 	/* Already protected? */
 	if (bio_integrity(bio))
 		return false;
-- 
2.9.3

^ permalink raw reply related

* [PATCH 0/7] block: T10/DIF Fixes and cleanups v2
From: Dmitry Monakhov @ 2017-04-03  7:23 UTC (permalink / raw)
  To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov

This patch set fix various problems spotted during T10/DIF integrity machinery testing.

TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for-fsync
0002-bio-integrity-save-original-iterator-for-verify-stage
0003-bio-integrity-bio_trim-should-truncate-integrity-vec
0004-bio-integrity-fix-interface-for-bio_integrity_trim
## Cleanup T10/DIF/DIX infrastructure
0005-bio-integrity-add-bio_integrity_setup-helper
0006-T10-Move-opencoded-contants-to-common-header
## General bulletproof protection for block layer
0007-Guard-bvec-iteration-logic-v2

Changes since V1
 - fix issues potted by kbuild bot
 - Replace BUG_ON with error logic for 7'th patch

Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85

^ permalink raw reply

* Re: [PATCH 3/3] scsi: Ensure that scsi_run_queue() runs all hardware queues
From: Hannes Reinecke @ 2017-04-03  6:12 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Sagi Grimberg
In-Reply-To: <20170331231205.16640-4-bart.vanassche@sandisk.com>

On 04/01/2017 01:12 AM, Bart Van Assche wrote:
> commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
> queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
> for the BLK_MQ_RQ_QUEUE_BUSY case. blk_mq_start_stopped_hw_queues()
> only runs queues that had been stopped. Hence change the
> blk_mq_start_stopped_hw_queues() call in scsi_run_queue() into
> blk_mq_run_hw_queues(). Remove the blk_mq_start_stopped_hw_queues()
> call from scsi_end_request() because __blk_mq_finish_request()
> already runs all hardware queues if needed.
> 
> Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/scsi/scsi_lib.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 2/3] scsi: Add scsi_restart_queues()
From: Hannes Reinecke @ 2017-04-03  6:10 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170331231205.16640-3-bart.vanassche@sandisk.com>

On 04/01/2017 01:12 AM, Bart Van Assche wrote:
> This patch avoids that if multiple SCSI devices are associated with
> a SCSI host that a queue can get stuck if scsi_queue_rq() returns
> "busy".
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/scsi/scsi_lib.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 1/3] blk-mq: Introduce blk_mq_ops.restart_queues
From: Hannes Reinecke @ 2017-04-03  6:10 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170331231205.16640-2-bart.vanassche@sandisk.com>

On 04/01/2017 01:12 AM, Bart Van Assche wrote:
> If a tag set is shared among multiple request queues, leave
> it to the block driver to restart queues. Hence remove
> QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_queues.
> Remove blk_mq_sched_mark_restart_queue() because this
> function has no callers.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  block/blk-mq-sched.c   | 11 +++--------
>  block/blk-mq-sched.h   | 14 --------------
>  include/linux/blk-mq.h |  4 ++++
>  include/linux/blkdev.h |  1 -
>  4 files changed, 7 insertions(+), 23 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Hannes Reinecke @ 2017-04-03  6:06 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170331163313.31821-1-hch@lst.de>

On 03/31/2017 06:32 PM, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.
> 
> A git tree is also avaiable at:
> 
>     git://git.infradead.org/users/hch/block.git discard-rework
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework
Thank you for doing this.

For this series:

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Minchan Kim @ 2017-04-03  5:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Thumshirn, Hannes Reinecke, Nitin Gupta,
	Christoph Hellwig, Sergey Senozhatsky, yizhan,
	Linux Block Layer Mailinglist, Linux Kernel Mailinglist,
	Andrew Morton
In-Reply-To: <17796bb7-6657-46d9-9731-d4c0656e6200@fb.com>

Hi Jens,

On Thu, Mar 30, 2017 at 07:38:26PM -0600, Jens Axboe wrote:
> On 03/30/2017 05:45 PM, Minchan Kim wrote:
> > On Thu, Mar 30, 2017 at 09:35:56AM -0600, Jens Axboe wrote:
> >> On 03/30/2017 09:08 AM, Minchan Kim wrote:
> >>> Hi Jens,
> >>>
> >>> It seems you miss this.
> >>> Could you handle this?
> >>
> >> I can, but I'm a little confused. The comment talks about replacing
> >> the one I merged with this one, I can't do that. I'm assuming you
> >> are talking about this commit:
> > 
> > Right.
> > 
> >>
> >> commit 0bc315381fe9ed9fb91db8b0e82171b645ac008f
> >> Author: Johannes Thumshirn <jthumshirn@suse.de>
> >> Date:   Mon Mar 6 11:23:35 2017 +0100
> >>
> >>     zram: set physical queue limits to avoid array out of bounds accesses
> >>
> >> which is in mainline. The patch still applies, though.
> > 
> > You mean it's already in mainline so you cannot replace but can revert.
> > Right?
> > If so, please revert it and merge this one.
> 
> Let's please fold it into the other patch. That's cleaner and it makes
> logical sense.

Understood.

> 
> >> Do we really REALLY need this for 4.11, or can we queue for 4.12 and
> >> mark it stable?
> > 
> > Not urgent because one in mainline fixes the problem so I'm okay
> > with 4.12 but I don't want mark it as -stable.
> 
> OK good, please resend with the two-line revert in your current
> patch, and I'll get it queued up for 4.12.

Yeb. If so, now that I think about it, it would be better to handle
it via Andrew's tree because Andrew have been handled zram's patches
and I have several pending patches based on it.
So, I will send new patchset with it to Andrew.

Thanks!

^ permalink raw reply

* [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: NeilBrown @ 2017-04-03  1:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 2128 bytes --]

When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file.  This double-throttling can trigger
positive feedback loops that create significant delays.  The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.

The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server.  It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.

To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
time taken.

When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable.  52-460 seconds.  Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.

There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/loop.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ecb6461ed81..a7e1dd215fc2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
 {
 	struct loop_cmd *cmd =
 		container_of(work, struct loop_cmd, work);
+	int oldflags = current->flags & PF_LESS_THROTTLE;

+	current->flags |= PF_LESS_THROTTLE;
 	loop_handle_cmd(cmd);
+	current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;
 }

 static int loop_init_request(void *data, struct request *rq,
-- 
2.12.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH 0/9] convert genericirq.tmpl and kernel-api.tmpl to DocBook
From: Jonathan Corbet @ 2017-04-02 20:34 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Linux Media Mailing List, Linux Doc Mailing List,
	Mauro Carvalho Chehab, Noam Camus, James Morris, zijun_hu,
	Markus Heiser, linux-clk, Jani Nikula, Andrew Morton, Jens Axboe,
	Nicholas Piggin, Russell King, linux-block, Kirill A. Shutemov,
	Mauro Carvalho Chehab, Joonsoo Kim, Ingo Molnar, Bjorn Helgaas,
	Serge E. Hallyn, Michal Hocko, Ross Zwisler, Chris Wilson,
	linux-mm, linux-security-module, Silvio Fricke, Takashi Iwai,
	Sebastian Andrzej Siewior, Jan Kara, Vlastimil Babka, linux-pci,
	Matt Fleming, Johannes Weiner, Andrey Ryabinin, Andy Lutomirski,
	Mel Gorman, Andy Shevchenko, Alexey Dobriyan, Hillf Danton
In-Reply-To: <cover.1490904090.git.mchehab@s-opensource.com>

On Thu, 30 Mar 2017 17:11:27 -0300
Mauro Carvalho Chehab <mchehab@s-opensource.com> wrote:

> This series converts just two documents, adding them to the
> core-api.rst book. It addresses the errors/warnings that popup
> after the conversion.
> 
> I had to add two fixes to scripts/kernel-doc, in order to solve
> some of the issues.

I've applied the set, including the add-on to move some stuff to
driver-api - thanks.

For whatever reason, I had a hard time applying a few of these; "git am"
would tell me this:

> Applying: docs-rst: core_api: move driver-specific stuff to drivers_api
> fatal: sha1 information is lacking or useless (Documentation/driver-api/index.rst).
> Patch failed at 0001 docs-rst: core_api: move driver-specific stuff to drivers_api
> The copy of the patch that failed is found in: .git/rebase-apply/patch

I was able to get around this, but it took some hand work.  How are you
generating these?

Thanks,

jon

^ permalink raw reply

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/Kconfig               |  5 ++++
 block/Makefile              |  1 +
 block/blk-mq-rdma.c         | 56 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-rdma.h | 10 ++++++++
 4 files changed, 72 insertions(+)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

diff --git a/block/Kconfig b/block/Kconfig
index 89cd28f8d051..3ab42bbb06d5 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
 	depends on BLOCK && VIRTIO
 	default y
 
+config BLK_MQ_RDMA
+	bool
+	depends on BLOCK && INFINIBAND
+	default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..4498603dbc83 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
 obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
+obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
new file mode 100644
index 000000000000..d402f7c93528
--- /dev/null
+++ b/block/blk-mq-rdma.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (c) 2017 Sagi Grimberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>
+#include <linux/module.h>
+#include "blk-mq.h"
+
+/**
+ * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
+ * @set:	tagset to provide the mapping for
+ * @dev:	rdma device associated with @set.
+ * @first_vec:	first interrupt vectors to use for queues (usually 0)
+ *
+ * This function assumes the rdma device @dev has at least as many available
+ * interrupt vetors as @set has queues.  It will then query it's affinity mask
+ * and built queue mapping that maps a queue to the CPUs that have irq affinity
+ * for the corresponding vector.
+ *
+ * In case either the driver passed a @dev with less vectors than
+ * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
+ * vector, we fallback to the naive mapping.
+ */
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec)
+{
+	const struct cpumask *mask;
+	unsigned int queue, cpu;
+
+	if (set->nr_hw_queues > dev->num_comp_vectors)
+		goto fallback;
+
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		mask = ib_get_vector_affinity(dev, first_vec + queue);
+		if (!mask)
+			goto fallback;
+
+		for_each_cpu(cpu, mask)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+fallback:
+	return blk_mq_map_queues(set);
+}
+EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
new file mode 100644
index 000000000000..b4ade198007d
--- /dev/null
+++ b/include/linux/blk-mq-rdma.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_BLK_MQ_RDMA_H
+#define _LINUX_BLK_MQ_RDMA_H
+
+struct blk_mq_tag_set;
+struct ib_device;
+
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec);
+
+#endif /* _LINUX_BLK_MQ_RDMA_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

This will allow ULPs to intelligently locate threads based
on completion vector cpu affinity mappings. In case the
driver does not expose a get_vector_affinity callout, return
NULL so the caller can maintain a fallback logic.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 include/rdma/ib_verbs.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0f1813c13687..d44b62791c64 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2150,6 +2150,8 @@ struct ib_device {
 	 */
 	int (*get_port_immutable)(struct ib_device *, u8, struct ib_port_immutable *);
 	void (*get_dev_fw_str)(struct ib_device *, char *str, size_t str_len);
+	const struct cpumask *(*get_vector_affinity)(struct ib_device *ibdev,
+						     int comp_vector);
 };
 
 struct ib_client {
@@ -3377,4 +3379,26 @@ void ib_drain_qp(struct ib_qp *qp);
 
 int ib_resolve_eth_dmac(struct ib_device *device,
 			struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_get_vector_affinity - Get the affinity mappings of a given completion
+ *   vector
+ * @device:         the rdma device
+ * @comp_vector:    index of completion vector
+ *
+ * Returns NULL on failure, otherwise a corresponding cpu map of the
+ * completion vector (returns all-cpus map if the device driver doesn't
+ * implement get_vector_affinity).
+ */
+static inline const struct cpumask *
+ib_get_vector_affinity(struct ib_device *device, int comp_vector)
+{
+	if (comp_vector > device->num_comp_vectors ||
+	    !device->get_vector_affinity)
+		return NULL;
+
+	return device->get_vector_affinity(device, comp_vector);
+
+}
+
 #endif /* IB_VERBS_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

Use the geneic block layer affinity mapping helper. Also,
limit nr_hw_queues to the rdma device number of irq vectors
as we don't really need more.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4aae363943e3..81ee5b1207c8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -19,6 +19,7 @@
 #include <linux/string.h>
 #include <linux/atomic.h>
 #include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -645,10 +646,14 @@ static int nvme_rdma_connect_io_queues(struct nvme_rdma_ctrl *ctrl)
 static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
 {
 	struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
+	struct ib_device *ibdev = ctrl->device->dev;
 	unsigned int nr_io_queues;
 	int i, ret;
 
 	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+	nr_io_queues = min_t(unsigned int, nr_io_queues,
+				ibdev->num_comp_vectors);
+
 	ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
 	if (ret)
 		return ret;
@@ -1523,6 +1528,13 @@ static void nvme_rdma_complete_rq(struct request *rq)
 	nvme_complete_rq(rq);
 }
 
+static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
+{
+	struct nvme_rdma_ctrl *ctrl = set->driver_data;
+
+	return blk_mq_rdma_map_queues(set, ctrl->device->dev, 0);
+}
+
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
@@ -1532,6 +1544,7 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.init_hctx	= nvme_rdma_init_hctx,
 	.poll		= nvme_rdma_poll,
 	.timeout	= nvme_rdma_timeout,
+	.map_queues	= nvme_rdma_map_queues,
 };
 
 static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
-- 
2.7.4

^ permalink raw reply related

* [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

Simply refer to the generic affinity mask helper.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/infiniband/hw/mlx5/main.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 4dc0a8785fe0..b12bc2294895 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3319,6 +3319,15 @@ static int mlx5_ib_get_hw_stats(struct ib_device *ibdev,
 	return port->q_cnts.num_counters;
 }
 
+const struct cpumask *mlx5_ib_get_vector_affinity(struct ib_device *ibdev,
+		int comp_vector)
+{
+	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+
+	return pci_irq_get_affinity(dev->mdev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + comp_vector);
+}
+
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
 	struct mlx5_ib_dev *dev;
@@ -3449,6 +3458,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	dev->ib_dev.check_mr_status	= mlx5_ib_check_mr_status;
 	dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
 	dev->ib_dev.get_dev_fw_str      = get_dev_fw_str;
+	dev->ib_dev.get_vector_affinity	= mlx5_ib_get_vector_affinity;
 	if (mlx5_core_is_pf(mdev)) {
 		dev->ib_dev.get_vf_config	= mlx5_ib_get_vf_config;
 		dev->ib_dev.set_vf_link_state	= mlx5_ib_set_vf_link_state;
-- 
2.7.4

^ permalink raw reply related

* [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).

The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c    | 81 ++---------------------
 include/linux/mlx5/driver.h                       |  1 -
 3 files changed, 6 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index eec0d172761e..2bab0e1ceb94 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
 
 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
+	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + ix));
 }
 
 static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 7c8672cbb369..8624a7451064 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -312,6 +312,7 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	struct mlx5_eq_table *table = &priv->eq_table;
+	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
 	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
 	int nvec;
 
@@ -325,9 +326,10 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 	if (!priv->irq_info)
 		goto err_free_msix;
 
-	nvec = pci_alloc_irq_vectors(dev->pdev,
+	nvec = pci_alloc_irq_vectors_affinity(dev->pdev,
 			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
-			PCI_IRQ_MSIX);
+			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+			&irqdesc);
 	if (nvec < 0)
 		return nvec;
 
@@ -600,71 +602,6 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 	return (u64)timer_l | (u64)timer_h1 << 32;
 }
 
-static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-	struct mlx5_priv *priv  = &mdev->priv;
-	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-	int err;
-
-	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
-		mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
-		return -ENOMEM;
-	}
-
-	cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
-			priv->irq_info[i].mask);
-
-	err = irq_set_affinity_hint(irq, priv->irq_info[i].mask);
-	if (err) {
-		mlx5_core_warn(mdev, "irq_set_affinity_hint failed,irq 0x%.4x",
-			       irq);
-		goto err_clear_mask;
-	}
-
-	return 0;
-
-err_clear_mask:
-	free_cpumask_var(priv->irq_info[i].mask);
-	return err;
-}
-
-static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-	struct mlx5_priv *priv  = &mdev->priv;
-	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-
-	irq_set_affinity_hint(irq, NULL);
-	free_cpumask_var(priv->irq_info[i].mask);
-}
-
-static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
-{
-	int err;
-	int i;
-
-	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
-		err = mlx5_irq_set_affinity_hint(mdev, i);
-		if (err)
-			goto err_out;
-	}
-
-	return 0;
-
-err_out:
-	for (i--; i >= 0; i--)
-		mlx5_irq_clear_affinity_hint(mdev, i);
-
-	return err;
-}
-
-static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
-{
-	int i;
-
-	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
-		mlx5_irq_clear_affinity_hint(mdev, i);
-}
-
 int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
 		    unsigned int *irqn)
 {
@@ -1116,12 +1053,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_stop_eqs;
 	}
 
-	err = mlx5_irq_set_affinity_hints(dev);
-	if (err) {
-		dev_err(&pdev->dev, "Failed to alloc affinity hint cpumask\n");
-		goto err_affinity_hints;
-	}
-
 	err = mlx5_init_fs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1165,9 +1096,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_cleanup_fs(dev);
 
 err_fs:
-	mlx5_irq_clear_affinity_hints(dev);
-
-err_affinity_hints:
 	free_comp_eqs(dev);
 
 err_stop_eqs:
@@ -1234,7 +1162,6 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_eswitch_detach(dev->priv.eswitch);
 #endif
 	mlx5_cleanup_fs(dev);
-	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
 	mlx5_stop_eqs(dev);
 	mlx5_put_uars_page(dev, priv->uar);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index a9891df94ce0..4c7cf4dfb024 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -527,7 +527,6 @@ struct mlx5_core_sriov {
 };
 
 struct mlx5_irq_info {
-	cpumask_var_t mask;
 	char name[MLX5_MAX_IRQ_NAME];
 };
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |  9 ++----
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     | 33 ++++++++--------------
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  1 -
 include/linux/mlx5/driver.h                        |  1 -
 7 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8ef64c4db2c2..eec0d172761e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -389,7 +389,7 @@ static void mlx5e_enable_async_events(struct mlx5e_priv *priv)
 static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
 {
 	clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLED, &priv->state);
-	synchronize_irq(mlx5_get_msix_vec(priv->mdev, MLX5_EQ_VEC_ASYNC));
+	synchronize_irq(pci_irq_vector(priv->mdev->pdev, MLX5_EQ_VEC_ASYNC));
 }
 
 static inline int mlx5e_get_wqe_mtt_sz(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ea5d8d37a75c..e2c33c493b89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -575,7 +575,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 		 name, pci_name(dev->pdev));
 
 	eq->eqn = MLX5_GET(create_eq_out, out, eq_number);
-	eq->irqn = priv->msix_arr[vecidx].vector;
+	eq->irqn = pci_irq_vector(dev->pdev, vecidx);
 	eq->dev = dev;
 	eq->doorbell = priv->uar->map + MLX5_EQ_DOORBEL_OFFSET;
 	err = request_irq(eq->irqn, handler, 0,
@@ -610,7 +610,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 	return 0;
 
 err_irq:
-	free_irq(priv->msix_arr[vecidx].vector, eq);
+	free_irq(eq->irqn, eq);
 
 err_eq:
 	mlx5_cmd_destroy_eq(dev, eq->eqn);
@@ -651,11 +651,6 @@ int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
 }
 EXPORT_SYMBOL_GPL(mlx5_destroy_unmap_eq);
 
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx)
-{
-	return dev->priv.msix_arr[MLX5_EQ_VEC_ASYNC].vector;
-}
-
 int mlx5_eq_init(struct mlx5_core_dev *dev)
 {
 	int err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index fcd5bc7e31db..6bf5d70b4117 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1596,7 +1596,7 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, int vport_num)
 	/* Mark this vport as disabled to discard new events */
 	vport->enabled = false;
 
-	synchronize_irq(mlx5_get_msix_vec(esw->dev, MLX5_EQ_VEC_ASYNC));
+	synchronize_irq(pci_irq_vector(esw->dev->pdev, MLX5_EQ_VEC_ASYNC));
 	/* Wait for current already scheduled events to complete */
 	flush_workqueue(esw->work_queue);
 	/* Disable events from this vport */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d0515391d33b..8b38d5cfd4c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -80,7 +80,7 @@ static void trigger_cmd_completions(struct mlx5_core_dev *dev)
 	u64 vector;
 
 	/* wait for pending handlers to complete */
-	synchronize_irq(dev->priv.msix_arr[MLX5_EQ_VEC_CMD].vector);
+	synchronize_irq(pci_irq_vector(dev->pdev, MLX5_EQ_VEC_CMD));
 	spin_lock_irqsave(&dev->cmd.alloc_lock, flags);
 	vector = ~dev->cmd.bitmask & ((1ul << (1 << dev->cmd.log_sz)) - 1);
 	if (!vector)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index e2bd600d19de..7c8672cbb369 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -308,13 +308,12 @@ static void release_bar(struct pci_dev *pdev)
 	pci_release_regions(pdev);
 }
 
-static int mlx5_enable_msix(struct mlx5_core_dev *dev)
+static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	struct mlx5_eq_table *table = &priv->eq_table;
 	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
 	int nvec;
-	int i;
 
 	nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
 	       MLX5_EQ_VEC_COMP_BASE;
@@ -322,17 +321,13 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
 	if (nvec <= MLX5_EQ_VEC_COMP_BASE)
 		return -ENOMEM;
 
-	priv->msix_arr = kcalloc(nvec, sizeof(*priv->msix_arr), GFP_KERNEL);
-
 	priv->irq_info = kcalloc(nvec, sizeof(*priv->irq_info), GFP_KERNEL);
-	if (!priv->msix_arr || !priv->irq_info)
+	if (!priv->irq_info)
 		goto err_free_msix;
 
-	for (i = 0; i < nvec; i++)
-		priv->msix_arr[i].entry = i;
-
-	nvec = pci_enable_msix_range(dev->pdev, priv->msix_arr,
-				     MLX5_EQ_VEC_COMP_BASE + 1, nvec);
+	nvec = pci_alloc_irq_vectors(dev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
+			PCI_IRQ_MSIX);
 	if (nvec < 0)
 		return nvec;
 
@@ -342,7 +337,6 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
 
 err_free_msix:
 	kfree(priv->irq_info);
-	kfree(priv->msix_arr);
 	return -ENOMEM;
 }
 
@@ -350,9 +344,8 @@ static void mlx5_disable_msix(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 
-	pci_disable_msix(dev->pdev);
+	pci_free_irq_vectors(dev->pdev);
 	kfree(priv->irq_info);
-	kfree(priv->msix_arr);
 }
 
 struct mlx5_reg_host_endianess {
@@ -610,8 +603,7 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
 	struct mlx5_priv *priv  = &mdev->priv;
-	struct msix_entry *msix = priv->msix_arr;
-	int irq                 = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
 	int err;
 
 	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
@@ -639,8 +631,7 @@ static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
 	struct mlx5_priv *priv  = &mdev->priv;
-	struct msix_entry *msix = priv->msix_arr;
-	int irq                 = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
 
 	irq_set_affinity_hint(irq, NULL);
 	free_cpumask_var(priv->irq_info[i].mask);
@@ -763,8 +754,8 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
 		}
 
 #ifdef CONFIG_RFS_ACCEL
-		irq_cpu_rmap_add(dev->rmap,
-				 dev->priv.msix_arr[i + MLX5_EQ_VEC_COMP_BASE].vector);
+		irq_cpu_rmap_add(dev->rmap, pci_irq_vector(dev->pdev,
+				 MLX5_EQ_VEC_COMP_BASE + i));
 #endif
 		snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", i);
 		err = mlx5_create_map_eq(dev, eq,
@@ -1101,9 +1092,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_stop_poll;
 	}
 
-	err = mlx5_enable_msix(dev);
+	err = mlx5_alloc_irq_vectors(dev);
 	if (err) {
-		dev_err(&pdev->dev, "enable msix failed\n");
+		dev_err(&pdev->dev, "alloc irq vectors failed\n");
 		goto err_cleanup_once;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index b3dabe6e8836..42bfcf20d875 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -109,7 +109,6 @@ int mlx5_destroy_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
 					u32 element_id);
 int mlx5_wait_for_vf_pages(struct mlx5_core_dev *dev);
 u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev);
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx);
 struct mlx5_eq *mlx5_eqn2eq(struct mlx5_core_dev *dev, int eqn);
 void mlx5_cq_tasklet_cb(unsigned long data);
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..a9891df94ce0 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -589,7 +589,6 @@ struct mlx5_port_module_event_stats {
 struct mlx5_priv {
 	char			name[MLX5_MAX_NAME_LEN];
 	struct mlx5_eq_table	eq_table;
-	struct msix_entry	*msix_arr;
 	struct mlx5_irq_info	*irq_info;
 
 	/* pages stuff */
-- 
2.7.4

^ permalink raw reply related

* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.

First two patches modify mlx5 core driver to use generic API
to allocate array of irq vectors with automatic affinity
settings instead of open-coding exactly what it does (and
slightly worse).

Then, in order to obtain an affinity map for a given completion
vector, we expose a new RDMA core API, and implement it in mlx5.

The third part is addition of a rdma-based queue mapping helper
to blk-mq that maps the tagset hctx's according to the device
affinity mappings.

I'd happily convert some more drivers, but I'll need volunteers
to test as I don't have access to any other devices.

I cc'd @netdev (and Saeed + Or) as this is the place that most of
mlx5 core action takes place, so Saeed, would love to hear your
feedback.

Any feedback is welcome.

Sagi Grimberg (6):
  mlx5: convert to generic pci_alloc_irq_vectors
  mlx5: move affinity hints assignments to generic code
  RDMA/core: expose affinity mappings per completion vector
  mlx5: support ->get_vector_affinity
  block: Add rdma affinity based queue mapping helper
  nvme-rdma: use intelligent affinity based queue mappings

 block/Kconfig                                      |   5 +
 block/Makefile                                     |   1 +
 block/blk-mq-rdma.c                                |  56 +++++++++++
 drivers/infiniband/hw/mlx5/main.c                  |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
 drivers/nvme/host/rdma.c                           |  13 +++
 include/linux/blk-mq-rdma.h                        |  10 ++
 include/linux/mlx5/driver.h                        |   2 -
 include/rdma/ib_verbs.h                            |  24 +++++
 14 files changed, 138 insertions(+), 108 deletions(-)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH V2 16/16] block, bfq: split bfq-iosched.c into multiple source files
From: kbuild test robot @ 2017-04-02 10:02 UTC (permalink / raw)
  To: Paolo Valente
  Cc: kbuild-all, Jens Axboe, Tejun Heo, Fabio Checconi,
	Arianna Avanzini, linux-block, linux-kernel, ulf.hansson,
	linus.walleij, broonie, Paolo Valente
In-Reply-To: <20170331124743.3530-17-paolo.valente@linaro.org>

[-- Attachment #1: Type: text/plain, Size: 2131 bytes --]

Hi Paolo,

[auto build test ERROR on block/for-next]
[also build test ERROR on v4.11-rc4 next-20170331]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-introduce-the-BFQ-v0-I-O-scheduler-as-an-extra-scheduler/20170402-100622
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "bfq_mark_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfqg_stats_update_dequeue" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_clear_bfqq_busy" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_clear_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_bfqq_non_blocking_wait_rq" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_clear_bfqq_wait_request" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_timeout" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfqg_stats_set_start_empty_time" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_weights_tree_add" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_put_queue" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_bfqq_sync" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfqg_to_blkg" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfqq_group" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_weights_tree_remove" [block/bfq-wf2q.ko] undefined!
>> ERROR: "bfq_bic_update_cgroup" [block/bfq-iosched.ko] undefined!
>> ERROR: "bfqg_stats_set_start_idle_time" [block/bfq-iosched.ko] undefined!
>> ERROR: "bfqg_stats_update_completion" [block/bfq-iosched.ko] undefined!
>> ERROR: "bfq_bfqq_move" [block/bfq-iosched.ko] undefined!
>> ERROR: "bfqg_put" [block/bfq-iosched.ko] undefined!
>> ERROR: "next_queue_may_preempt" [block/bfq-iosched.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 59001 bytes --]

^ permalink raw reply

* Re: [PATCH 3/3] scsi: Ensure that scsi_run_queue() runs all hardware queues
From: Sagi Grimberg @ 2017-04-02  7:49 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170331231205.16640-4-bart.vanassche@sandisk.com>

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply

* Re: [PATCH] blk-mq: add random early detection I/O scheduler
From: Bart Van Assche @ 2017-04-01 23:29 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, osandov@osandov.com,
	linux-block@vger.kernel.org, axboe@fb.com
  Cc: kernel-team@fb.com
In-Reply-To: <cea23aca-e7b2-f35d-f064-d5e44a479977@fb.com>

T24gU2F0LCAyMDE3LTA0LTAxIGF0IDE2OjA3IC0wNjAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBP
biAwNC8wMS8yMDE3IDAxOjU1IFBNLCBPbWFyIFNhbmRvdmFsIHdyb3RlOg0KPiA+IEZyb206IE9t
YXIgU2FuZG92YWwgPG9zYW5kb3ZAZmIuY29tPg0KPiA+IA0KPiA+IFRoaXMgcGF0Y2ggaW50cm9k
dWNlcyBhIG5ldyBJL08gc2NoZWR1bGVyIGJhc2VkIG9uIHRoZSBjbGFzc2ljIHJhbmRvbQ0KPiA+
IGVhcmx5IGRldGVjdGlvbiBhY3RpdmUgcXVldWUgbWFuYWdlbWVudCBhbGdvcml0aG0gWzFdLiBS
YW5kb20gZWFybHkNCj4gPiBkZXRlY3Rpb24gaXMgb25lIG9mIHRoZSBzaW1wbGVzdCBhbmQgbW9z
dCBzdHVkaWVkIEFRTSBhbGdvcml0aG1zIGZvcg0KPiA+IG5ldHdvcmtpbmcsIGJ1dCB1bnRpbCBu
b3csIGl0IGhhc24ndCBiZWVuIGFwcGxpZWQgdG8gZGlzayBJL08NCj4gPiBzY2hlZHVsaW5nLg0K
PiA+IA0KPiA+IFdoZW4gYXBwbGllZCB0byBuZXR3b3JrIHJvdXRlcnMsIFJFRCBwcm9iYWJpbGlz
dGljYWxseSBlaXRoZXIgbWFya3MNCj4gPiBwYWNrZXRzIHdpdGggRUNOIG9yIGRyb3BzIHRoZW0s
IGRlcGVuZGluZyBvbiB0aGUgY29uZmlndXJhdGlvbi4gV2hlbg0KPiA+IGRlYWxpbmcgd2l0aCBk
aXNrIEkvTywgUE9TSVggZG9lcyBub3QgaGF2ZSBhbnkgbWVjaGFuaXNtIHdpdGggd2hpY2ggdG8N
Cj4gPiBub3RpZnkgdGhlIGNhbGxlciB0aGF0IHRoZSBkaXNrIGlzIGNvbmdlc3RlZCwgc28gd2Ug
aW5zdGVhZCBvbmx5IHByb3ZpZGUNCj4gPiB0aGUgbGF0dGVyIHN0cmF0ZWd5LiBJbmNsdWRlZCBp
biB0aGlzIHBhdGNoIGlzIGEgbWlub3IgY2hhbmdlIHRvIHRoZQ0KPiA+IGJsay1tcSB0byBzdXBw
b3J0IHRoaXMuDQo+IA0KPiBUaGlzIGlzIGdyZWF0IHdvcmsuIElmIHdlIGNvbWJpbmUgdGhpcyB3
aXRoIGEgdGhpbiBwcm92aXNpb25pbmcgdGFyZ2V0LA0KPiB3ZSBjYW4gZXZlbiB1c2UgdGhpcyB0
byBzYXZlIHNwYWNlIG9uIHRoZSBiYWNrZW5kLiBCZXR0ZXIgbGF0ZW5jaWVzLA0KPiBBTkQgbG93
ZXIgZGlzayB1dGlsaXphdGlvbi4NCj4gDQo+IEknbSB0ZW1wdGVkIHRvIGp1c3QgcXVldWUgdGhp
cyB1cCBmb3IgdGhpcyBjeWNsZSBhbmQgbWFrZSBpdCB0aGUgZGVmYXVsdC4NCg0KSGVsbG8gSmVu
cywNCg0KRGlkIHlvdSBtZWFuIG1ha2luZyB0aGlzIHRoZSBkZWZhdWx0IHNjaGVkdWxlciBmb3Ig
U1NEcyBvbmx5IG9yIGZvciBhbGwgdHlwZXMNCm9mIGJsb2NrIGRldmljZXM/IE91ciAoV2VzdGVy
biBEaWdpdGFsKSBleHBlcmllbmNlIGlzIHRoYXQgYW55IEkvTyBzY2hlZHVsZXINCnRoYXQgbGlt
aXRzIHRoZSBxdWV1ZSBkZXB0aCByZWR1Y2VzIHRocm91Z2hwdXQgZm9yIGF0IGxlYXN0IGRhdGEt
Y2VudGVyIHN0eWxlDQp3b3JrbG9hZHMgd2hlbiB1c2luZyBoYXJkIGRpc2tzLiBUaGlzIGlzIHdo
eSBBZGFtIGlzIHdvcmtpbmcgb24gaW1wcm92aW5nIEkvTw0KcHJpb3JpdHkgc3VwcG9ydCBmb3Ig
dGhlIExpbnV4IGJsb2NrIGxheWVyLiBUaGF0IGFwcHJvYWNoIG5hbWVseSBhbGxvd3MgdG8NCnJl
ZHVjZSBsYXRlbmN5IG9mIGNlcnRhaW4gcmVxdWVzdHMgd2l0aG91dCBzaWduaWZpY2FudGx5IGlt
cGFjdGluZyBhdmVyYWdlDQpsYXRlbmN5IGFuZCB0aHJvdWdocHV0Lg0KDQpCYXJ0Lg==

^ permalink raw reply

* Re: [PATCH] blk-mq: add random early detection I/O scheduler
From: Jens Axboe @ 2017-04-01 22:07 UTC (permalink / raw)
  To: Omar Sandoval, linux-block, linux-kernel; +Cc: kernel-team
In-Reply-To: <e9c15e2066177c3efdfe6d134cf9c80b5e8f8d1b.1491076459.git.osandov@fb.com>

On 04/01/2017 01:55 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This patch introduces a new I/O scheduler based on the classic random
> early detection active queue management algorithm [1]. Random early
> detection is one of the simplest and most studied AQM algorithms for
> networking, but until now, it hasn't been applied to disk I/O
> scheduling.
> 
> When applied to network routers, RED probabilistically either marks
> packets with ECN or drops them, depending on the configuration. When
> dealing with disk I/O, POSIX does not have any mechanism with which to
> notify the caller that the disk is congested, so we instead only provide
> the latter strategy. Included in this patch is a minor change to the
> blk-mq to support this.

This is great work. If we combine this with a thin provisioning target,
we can even use this to save space on the backend. Better latencies,
AND lower disk utilization.

I'm tempted to just queue this up for this cycle and make it the default.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] blk-mq: add random early detection I/O scheduler
From: Omar Sandoval @ 2017-04-01 19:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-kernel; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This patch introduces a new I/O scheduler based on the classic random
early detection active queue management algorithm [1]. Random early
detection is one of the simplest and most studied AQM algorithms for
networking, but until now, it hasn't been applied to disk I/O
scheduling.

When applied to network routers, RED probabilistically either marks
packets with ECN or drops them, depending on the configuration. When
dealing with disk I/O, POSIX does not have any mechanism with which to
notify the caller that the disk is congested, so we instead only provide
the latter strategy. Included in this patch is a minor change to the
blk-mq to support this.

Performance results are extremely promising. This scheduling technique
does not require any cross-hardware queue data sharing, as limits are
applied on a per-hardware queue basis, making RED highly scalable.
Additionally, with RED, I/O latencies on a heavily loaded device can be
better than even a completely idle device, as is demonstrated by this
fio job:

----
[global]
filename=/dev/sda
direct=1
runtime=10s
time_based
group_reporting

[idle_reader]
rate_iops=1000
ioengine=sync
rw=randread

[contended_reader]
stonewall
numjobs=4
ioengine=libaio
iodepth=1024
rw=randread
----

1: http://www.icir.org/floyd/papers/red/red.html

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/Kconfig.iosched |   6 ++
 block/Makefile        |   1 +
 block/blk-mq.c        |   2 +
 block/red-iosched.c   | 191 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 200 insertions(+)
 create mode 100644 block/red-iosched.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc8684788d..e8bdd144ec9f 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,12 @@ config MQ_IOSCHED_DEADLINE
 	---help---
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_RED
+	tristate "Random early detection I/O scheduler"
+	default y
+	---help---
+	  Block I/O adaptation of the RED active queue management algorithm.
+
 endmenu
 
 endif
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..607ee6e27901 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_RED)	+= red-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..d7792ca0432c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1542,6 +1542,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
 	if (unlikely(!rq)) {
 		__wbt_done(q->rq_wb, wb_acct);
+		bio_advance(bio, bio->bi_iter.bi_size);
+		bio_endio(bio);
 		return BLK_QC_T_NONE;
 	}
 
diff --git a/block/red-iosched.c b/block/red-iosched.c
new file mode 100644
index 000000000000..862158a02e95
--- /dev/null
+++ b/block/red-iosched.c
@@ -0,0 +1,191 @@
+/*
+ * Random early detection I/O scheduler.
+ *
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <https://www.gnu.org/licenses/>.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+
+enum {
+	RED_DEFAULT_MIN_THRESH = 16,
+	RED_DEFAULT_MAX_THRESH = 256,
+	RED_MAX_MAX_THRESH = 256,
+};
+
+struct red_queue_data {
+	struct request_queue *q;
+	unsigned int min_thresh, max_thresh;
+};
+
+static int red_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct red_queue_data *rqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	rqd = kmalloc_node(sizeof(*rqd), GFP_KERNEL, q->node);
+	if (!rqd) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	rqd->min_thresh = RED_DEFAULT_MIN_THRESH;
+	rqd->max_thresh = RED_DEFAULT_MAX_THRESH;
+
+	eq->elevator_data = rqd;
+	q->elevator = eq;
+
+	return 0;
+}
+
+static void red_exit_sched(struct elevator_queue *e)
+{
+	struct red_queue_data *rqd = e->elevator_data;
+
+	kfree(rqd);
+}
+
+static struct request *red_get_request(struct request_queue *q,
+				       unsigned int op,
+				       struct blk_mq_alloc_data *data)
+{
+	struct red_queue_data *rqd = q->elevator->elevator_data;
+	unsigned int queue_length;
+	u32 drop_prob;
+
+	queue_length = sbitmap_weight(&data->hctx->sched_tags->bitmap_tags.sb);
+	if (queue_length <= rqd->min_thresh)
+		goto enqueue;
+	else if (queue_length >= rqd->max_thresh)
+		goto drop;
+
+	drop_prob = (U32_MAX / (rqd->max_thresh - rqd->min_thresh) *
+		     (queue_length - rqd->min_thresh));
+
+	if (prandom_u32() <= drop_prob)
+		goto drop;
+
+enqueue:
+	return __blk_mq_alloc_request(data, op);
+
+drop:
+	/*
+	 * Non-blocking callers will return EWOULDBLOCK; blocking callers should
+	 * check the return code and retry.
+	 */
+	return NULL;
+}
+
+static ssize_t red_min_thresh_show(struct elevator_queue *e, char *page)
+{
+	struct red_queue_data *rqd = e->elevator_data;
+
+	return sprintf(page, "%u\n", rqd->min_thresh);
+}
+
+static ssize_t red_min_thresh_store(struct elevator_queue *e, const char *page,
+				    size_t count)
+{
+	struct red_queue_data *rqd = e->elevator_data;
+	unsigned int thresh;
+	int ret;
+
+	ret = kstrtouint(page, 10, &thresh);
+	if (ret)
+		return ret;
+
+	if (thresh >= rqd->max_thresh)
+		return -EINVAL;
+
+	rqd->min_thresh = thresh;
+
+	return count;
+}
+
+static ssize_t red_max_thresh_show(struct elevator_queue *e, char *page)
+{
+	struct red_queue_data *rqd = e->elevator_data;
+
+	return sprintf(page, "%u\n", rqd->max_thresh);
+}
+
+static ssize_t red_max_thresh_store(struct elevator_queue *e, const char *page,
+				    size_t count)
+{
+	struct red_queue_data *rqd = e->elevator_data;
+	unsigned int thresh;
+	int ret;
+
+	ret = kstrtouint(page, 10, &thresh);
+	if (ret)
+		return ret;
+
+	if (thresh <= rqd->min_thresh || thresh > RED_MAX_MAX_THRESH)
+		return -EINVAL;
+
+	rqd->max_thresh = thresh;
+
+	return count;
+}
+
+#define RED_THRESH_ATTR(which) __ATTR(which##_thresh, 0644, red_##which##_thresh_show, red_##which##_thresh_store)
+static struct elv_fs_entry red_sched_attrs[] = {
+	RED_THRESH_ATTR(min),
+	RED_THRESH_ATTR(max),
+	__ATTR_NULL
+};
+#undef RED_THRESH_ATTR
+
+static struct elevator_type red_sched = {
+	.ops.mq = {
+		.init_sched = red_init_sched,
+		.exit_sched = red_exit_sched,
+		.get_request = red_get_request,
+	},
+	.uses_mq = true,
+	.elevator_attrs = red_sched_attrs,
+	.elevator_name = "red",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init red_init(void)
+{
+	return elv_register(&red_sched);
+}
+
+static void __exit red_exit(void)
+{
+	elv_unregister(&red_sched);
+}
+
+module_init(red_init);
+module_exit(red_exit);
+
+MODULE_AUTHOR("Omar Sandoval");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Random early detection I/O scheduler");
-- 
2.12.1

^ permalink raw reply related

* Re: [PATCH v4 1/2] blk-mq: Export queue state through /sys/kernel/debug/block/*/state
From: Hannes Reinecke @ 2017-04-01  8:05 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block@vger.kernel.org, Omar Sandoval
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA1310F4@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>

On 04/01/2017 01:23 AM, Bart Van Assche wrote:
> Make it possible to check whether or not a block layer queue has
> been stopped. Make it possible to start and to run a blk-mq queue
> from user space.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
>
> ---
>
> Changes compared to v3:
> - Return -ENOENT for attempts to run or start a queue after it has reached the
>   state "dead". This is needed to avoid a use-after-free and potentially a kernel
>   crash.
>
> ---
>
>  block/blk-mq-debugfs.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
>
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 4b3f962a9c7a..bd3afa4e1602 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -43,6 +43,117 @@ static int blk_mq_debugfs_seq_open(struct inode *inode, struct file *file,
>  	return ret;
>  }
>
> +static int blk_flags_show(struct seq_file *m, const unsigned long flags,
> +			  const char *const *flag_name, int flag_name_count)
> +{
> +	bool sep = false;
> +	int i;
> +
> +	for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) {
> +		if (!(flags & BIT(i)))
> +			continue;
> +		if (sep)
> +			seq_puts(m, " ");
> +		sep = true;
> +		if (i < flag_name_count && flag_name[i])
> +			seq_puts(m, flag_name[i]);
> +		else
> +			seq_printf(m, "%d", i);
> +	}
> +	seq_puts(m, "\n");
> +	return 0;
> +}
> +
> +static const char *const blk_queue_flag_name[] = {
> +	[QUEUE_FLAG_QUEUED]	 = "QUEUED",
> +	[QUEUE_FLAG_STOPPED]	 = "STOPPED",
> +	[QUEUE_FLAG_SYNCFULL]	 = "SYNCFULL",
> +	[QUEUE_FLAG_ASYNCFULL]	 = "ASYNCFULL",
> +	[QUEUE_FLAG_DYING]	 = "DYING",
> +	[QUEUE_FLAG_BYPASS]	 = "BYPASS",
> +	[QUEUE_FLAG_BIDI]	 = "BIDI",
> +	[QUEUE_FLAG_NOMERGES]	 = "NOMERGES",
> +	[QUEUE_FLAG_SAME_COMP]	 = "SAME_COMP",
> +	[QUEUE_FLAG_FAIL_IO]	 = "FAIL_IO",
> +	[QUEUE_FLAG_STACKABLE]	 = "STACKABLE",
> +	[QUEUE_FLAG_NONROT]	 = "NONROT",
> +	[QUEUE_FLAG_IO_STAT]	 = "IO_STAT",
> +	[QUEUE_FLAG_DISCARD]	 = "DISCARD",
> +	[QUEUE_FLAG_NOXMERGES]	 = "NOXMERGES",
> +	[QUEUE_FLAG_ADD_RANDOM]	 = "ADD_RANDOM",
> +	[QUEUE_FLAG_SECERASE]	 = "SECERASE",
> +	[QUEUE_FLAG_SAME_FORCE]	 = "SAME_FORCE",
> +	[QUEUE_FLAG_DEAD]	 = "DEAD",
> +	[QUEUE_FLAG_INIT_DONE]	 = "INIT_DONE",
> +	[QUEUE_FLAG_NO_SG_MERGE] = "NO_SG_MERGE",
> +	[QUEUE_FLAG_POLL]	 = "POLL",
> +	[QUEUE_FLAG_WC]		 = "WC",
> +	[QUEUE_FLAG_FUA]	 = "FUA",
> +	[QUEUE_FLAG_FLUSH_NQ]	 = "FLUSH_NQ",
> +	[QUEUE_FLAG_DAX]	 = "DAX",
> +	[QUEUE_FLAG_STATS]	 = "STATS",
> +	[QUEUE_FLAG_POLL_STATS]	 = "POLL_STATS",
> +	[QUEUE_FLAG_REGISTERED]	 = "REGISTERED",
> +};
> +
> +static int blk_queue_flags_show(struct seq_file *m, void *v)
> +{
> +	struct request_queue *q = m->private;
> +
> +	blk_flags_show(m, q->queue_flags, blk_queue_flag_name,
> +		       ARRAY_SIZE(blk_queue_flag_name));
> +	return 0;
> +}
> +
> +static ssize_t blk_queue_flags_store(struct file *file, const char __user *ubuf,
> +				     size_t len, loff_t *offp)
> +{
> +	struct request_queue *q = file_inode(file)->i_private;
> +	char op[16] = { }, *s;
> +
> +	/*
> +	 * The debugfs attributes are removed after blk_cleanup_queue() has
> +	 * called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set
> +	 * to avoid triggering a use-after-free.
> +	 */
> +	if (blk_queue_dead(q))
> +		return -ENOENT;
> +
> +	len = min(len, sizeof(op) - 1);
> +	if (copy_from_user(op, ubuf, len))
> +		return -EFAULT;
> +	s = op;
> +	strsep(&s, " \t\n"); /* strip trailing whitespace */
> +	if (strcmp(op, "run") == 0) {
> +		blk_mq_run_hw_queues(q, true);
> +	} else if (strcmp(op, "start") == 0) {
> +		blk_mq_start_stopped_hw_queues(q, true);
> +	} else {
> +		pr_err("%s: unsupported operation %s. Use either 'run' or 'start'\n",
> +		       __func__, op);
> +		return -EINVAL;
> +	}
> +	return len;
> +}
> +
I would have added 'stop' for completeness, but that's probably for very 
specific cases only.

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes

^ permalink raw reply

* [PATCH v4 1/2] blk-mq: Export queue state through /sys/kernel/debug/block/*/state
From: Bart Van Assche @ 2017-03-31 23:23 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block@vger.kernel.org, Omar Sandoval, Hannes Reinecke
In-Reply-To: <20170330182127.24288-2-bart.vanassche@sandisk.com>

Make it possible to check whether or not a block layer queue has=0A=
been stopped. Make it possible to start and to run a blk-mq queue=0A=
from user space.=0A=
=0A=
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>=0A=
Cc: Omar Sandoval <osandov@fb.com>=0A=
Cc: Hannes Reinecke <hare@suse.com>=0A=
=0A=
---=0A=
=0A=
Changes compared to v3:=0A=
- Return -ENOENT for attempts to run or start a queue after it has reached =
the=0A=
  state "dead". This is needed to avoid a use-after-free and potentially a =
kernel=0A=
  crash.=0A=
=0A=
---=0A=
=0A=
 block/blk-mq-debugfs.c | 114 +++++++++++++++++++++++++++++++++++++++++++++=
++++=0A=
 1 file changed, 114 insertions(+)=0A=
=0A=
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c=0A=
index 4b3f962a9c7a..bd3afa4e1602 100644=0A=
--- a/block/blk-mq-debugfs.c=0A=
+++ b/block/blk-mq-debugfs.c=0A=
@@ -43,6 +43,117 @@ static int blk_mq_debugfs_seq_open(struct inode *inode,=
 struct file *file,=0A=
 	return ret;=0A=
 }=0A=
 =0A=
+static int blk_flags_show(struct seq_file *m, const unsigned long flags,=
=0A=
+			  const char *const *flag_name, int flag_name_count)=0A=
+{=0A=
+	bool sep =3D false;=0A=
+	int i;=0A=
+=0A=
+	for (i =3D 0; i < sizeof(flags) * BITS_PER_BYTE; i++) {=0A=
+		if (!(flags & BIT(i)))=0A=
+			continue;=0A=
+		if (sep)=0A=
+			seq_puts(m, " ");=0A=
+		sep =3D true;=0A=
+		if (i < flag_name_count && flag_name[i])=0A=
+			seq_puts(m, flag_name[i]);=0A=
+		else=0A=
+			seq_printf(m, "%d", i);=0A=
+	}=0A=
+	seq_puts(m, "\n");=0A=
+	return 0;=0A=
+}=0A=
+=0A=
+static const char *const blk_queue_flag_name[] =3D {=0A=
+	[QUEUE_FLAG_QUEUED]	 =3D "QUEUED",=0A=
+	[QUEUE_FLAG_STOPPED]	 =3D "STOPPED",=0A=
+	[QUEUE_FLAG_SYNCFULL]	 =3D "SYNCFULL",=0A=
+	[QUEUE_FLAG_ASYNCFULL]	 =3D "ASYNCFULL",=0A=
+	[QUEUE_FLAG_DYING]	 =3D "DYING",=0A=
+	[QUEUE_FLAG_BYPASS]	 =3D "BYPASS",=0A=
+	[QUEUE_FLAG_BIDI]	 =3D "BIDI",=0A=
+	[QUEUE_FLAG_NOMERGES]	 =3D "NOMERGES",=0A=
+	[QUEUE_FLAG_SAME_COMP]	 =3D "SAME_COMP",=0A=
+	[QUEUE_FLAG_FAIL_IO]	 =3D "FAIL_IO",=0A=
+	[QUEUE_FLAG_STACKABLE]	 =3D "STACKABLE",=0A=
+	[QUEUE_FLAG_NONROT]	 =3D "NONROT",=0A=
+	[QUEUE_FLAG_IO_STAT]	 =3D "IO_STAT",=0A=
+	[QUEUE_FLAG_DISCARD]	 =3D "DISCARD",=0A=
+	[QUEUE_FLAG_NOXMERGES]	 =3D "NOXMERGES",=0A=
+	[QUEUE_FLAG_ADD_RANDOM]	 =3D "ADD_RANDOM",=0A=
+	[QUEUE_FLAG_SECERASE]	 =3D "SECERASE",=0A=
+	[QUEUE_FLAG_SAME_FORCE]	 =3D "SAME_FORCE",=0A=
+	[QUEUE_FLAG_DEAD]	 =3D "DEAD",=0A=
+	[QUEUE_FLAG_INIT_DONE]	 =3D "INIT_DONE",=0A=
+	[QUEUE_FLAG_NO_SG_MERGE] =3D "NO_SG_MERGE",=0A=
+	[QUEUE_FLAG_POLL]	 =3D "POLL",=0A=
+	[QUEUE_FLAG_WC]		 =3D "WC",=0A=
+	[QUEUE_FLAG_FUA]	 =3D "FUA",=0A=
+	[QUEUE_FLAG_FLUSH_NQ]	 =3D "FLUSH_NQ",=0A=
+	[QUEUE_FLAG_DAX]	 =3D "DAX",=0A=
+	[QUEUE_FLAG_STATS]	 =3D "STATS",=0A=
+	[QUEUE_FLAG_POLL_STATS]	 =3D "POLL_STATS",=0A=
+	[QUEUE_FLAG_REGISTERED]	 =3D "REGISTERED",=0A=
+};=0A=
+=0A=
+static int blk_queue_flags_show(struct seq_file *m, void *v)=0A=
+{=0A=
+	struct request_queue *q =3D m->private;=0A=
+=0A=
+	blk_flags_show(m, q->queue_flags, blk_queue_flag_name,=0A=
+		       ARRAY_SIZE(blk_queue_flag_name));=0A=
+	return 0;=0A=
+}=0A=
+=0A=
+static ssize_t blk_queue_flags_store(struct file *file, const char __user =
*ubuf,=0A=
+				     size_t len, loff_t *offp)=0A=
+{=0A=
+	struct request_queue *q =3D file_inode(file)->i_private;=0A=
+	char op[16] =3D { }, *s;=0A=
+=0A=
+	/*=0A=
+	 * The debugfs attributes are removed after blk_cleanup_queue() has=0A=
+	 * called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set=0A=
+	 * to avoid triggering a use-after-free.=0A=
+	 */=0A=
+	if (blk_queue_dead(q))=0A=
+		return -ENOENT;=0A=
+=0A=
+	len =3D min(len, sizeof(op) - 1);=0A=
+	if (copy_from_user(op, ubuf, len))=0A=
+		return -EFAULT;=0A=
+	s =3D op;=0A=
+	strsep(&s, " \t\n"); /* strip trailing whitespace */=0A=
+	if (strcmp(op, "run") =3D=3D 0) {=0A=
+		blk_mq_run_hw_queues(q, true);=0A=
+	} else if (strcmp(op, "start") =3D=3D 0) {=0A=
+		blk_mq_start_stopped_hw_queues(q, true);=0A=
+	} else {=0A=
+		pr_err("%s: unsupported operation %s. Use either 'run' or 'start'\n",=0A=
+		       __func__, op);=0A=
+		return -EINVAL;=0A=
+	}=0A=
+	return len;=0A=
+}=0A=
+=0A=
+static int blk_queue_flags_open(struct inode *inode, struct file *file)=0A=
+{=0A=
+	return single_open(file, blk_queue_flags_show, inode->i_private);=0A=
+}=0A=
+=0A=
+static const struct file_operations blk_queue_flags_fops =3D {=0A=
+	.open		=3D blk_queue_flags_open,=0A=
+	.read		=3D seq_read,=0A=
+	.llseek		=3D seq_lseek,=0A=
+	.release	=3D single_release,=0A=
+	.write		=3D blk_queue_flags_store,=0A=
+};=0A=
+=0A=
+static const struct blk_mq_debugfs_attr blk_queue_attrs[] =3D {=0A=
+	{"state", 0600, &blk_queue_flags_fops},=0A=
+	{},=0A=
+};=0A=
+=0A=
 static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)=0A=
 {=0A=
 	if (stat->nr_samples) {=0A=
@@ -735,6 +846,9 @@ int blk_mq_debugfs_register_hctxs(struct request_queue =
*q)=0A=
 	if (!q->debugfs_dir)=0A=
 		return -ENOENT;=0A=
 =0A=
+	if (!debugfs_create_files(q->debugfs_dir, q, blk_queue_attrs))=0A=
+		goto err;=0A=
+=0A=
 	q->mq_debugfs_dir =3D debugfs_create_dir("mq", q->debugfs_dir);=0A=
 	if (!q->mq_debugfs_dir)=0A=
 		goto err;=0A=
-- =0A=
2.12.0=0A=
=0A=
=0A=

^ permalink raw reply related

* [PATCH 3/3] scsi: Ensure that scsi_run_queue() runs all hardware queues
From: Bart Van Assche @ 2017-03-31 23:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke,
	Sagi Grimberg
In-Reply-To: <20170331231205.16640-1-bart.vanassche@sandisk.com>

commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. blk_mq_start_stopped_hw_queues()
only runs queues that had been stopped. Hence change the
blk_mq_start_stopped_hw_queues() call in scsi_run_queue() into
blk_mq_run_hw_queues(). Remove the blk_mq_start_stopped_hw_queues()
call from scsi_end_request() because __blk_mq_finish_request()
already runs all hardware queues if needed.

Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/scsi/scsi_lib.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 1d804e33971a..3323878423ac 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
 		scsi_starved_list_run(sdev->host);
 
 	if (q->mq_ops)
-		blk_mq_start_stopped_hw_queues(q, false);
+		blk_mq_run_hw_queues(q, false);
 	else
 		blk_run_queue(q);
 }
@@ -681,8 +681,6 @@ static bool scsi_end_request(struct request *req, int error,
 		if (scsi_target(sdev)->single_lun ||
 		    !list_empty(&sdev->host->starved_list))
 			kblockd_schedule_work(&sdev->requeue_work);
-		else
-			blk_mq_start_stopped_hw_queues(q, true);
 	} else {
 		unsigned long flags;
 
-- 
2.12.0

^ permalink raw reply related

* [PATCH 2/3] scsi: Add scsi_restart_queues()
From: Bart Van Assche @ 2017-03-31 23:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170331231205.16640-1-bart.vanassche@sandisk.com>

This patch avoids that if multiple SCSI devices are associated with
a SCSI host that a queue can get stuck if scsi_queue_rq() returns
"busy".

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/scsi_lib.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c1519660824b..1d804e33971a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -555,6 +555,21 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
 		scsi_run_queue(sdev->request_queue);
 }
 
+static void scsi_restart_queues(struct request_queue *q)
+{
+	struct scsi_device *sdev = q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	unsigned long flags;
+
+	spin_lock_irqsave(shost->host_lock, flags);
+	__shost_for_each_device(sdev, shost) {
+		q = sdev->request_queue;
+		if (q->mq_ops && !blk_queue_dying(q))
+			blk_mq_run_hw_queues(q, true);
+	}
+	spin_unlock_irqrestore(shost->host_lock, flags);
+}
+
 static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 {
 	if (!blk_rq_is_passthrough(cmd->request)) {
@@ -2156,6 +2171,7 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 
 static const struct blk_mq_ops scsi_mq_ops = {
 	.queue_rq	= scsi_queue_rq,
+	.restart_queues	= scsi_restart_queues,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
 	.init_request	= scsi_init_request,
-- 
2.12.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox