Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()
From: Christoph Hellwig @ 2017-04-10  7:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig,
	Hannes Reinecke, Long Li, K . Y . Srinivasan
In-Reply-To: <20170407181654.27836-5-bart.vanassche@sandisk.com>

> +	if (msecs == 0)
> +		kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
> +					 &hctx->run_work);
> +	else
> +		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> +						 &hctx->delayed_run_work,
> +						 msecs_to_jiffies(msecs));
> +}

I'd rather make run_work a delayed_work (again) and use
kblockd_schedule_delayed_work_on with a timeout of zero for the immediate
run case instead of having two competing work structs.

^ permalink raw reply

* Re: [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
From: Christoph Hellwig @ 2017-04-10  7:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Omar Sandoval,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170407181654.27836-4-bart.vanassche@sandisk.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared
From: Christoph Hellwig @ 2017-04-10  7:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig,
	Hannes Reinecke
In-Reply-To: <20170407181654.27836-3-bart.vanassche@sandisk.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Christoph Hellwig @ 2017-04-10  7:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig,
	Hannes Reinecke
In-Reply-To: <20170407181654.27836-2-bart.vanassche@sandisk.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* [PATCH v3 5/5] blk-mq: introduce Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-04-10  6:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491806249.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.

The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.

A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.

Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.

These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.

Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched                 |   9 +
 block/Makefile                        |   1 +
 block/kyber-iosched.c                 | 717 ++++++++++++++++++++++++++++++++++
 4 files changed, 741 insertions(+)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

diff --git a/Documentation/block/kyber-iosched.txt b/Documentation/block/kyber-iosched.txt
new file mode 100644
index 000000000000..e94feacd7edc
--- /dev/null
+++ b/Documentation/block/kyber-iosched.txt
@@ -0,0 +1,14 @@
+Kyber I/O scheduler tunables
+===========================
+
+The only two tunables for the Kyber scheduler are the target latencies for
+reads and synchronous writes. Kyber will throttle requests in order to meet
+these target latencies.
+
+read_lat_nsec
+-------------
+Target latency for reads (in nanoseconds).
+
+write_lat_nsec
+--------------
+Target latency for synchronous writes (in nanoseconds).
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc8684788d..916e69c68fa4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,15 @@ config MQ_IOSCHED_DEADLINE
 	---help---
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_KYBER
+	tristate "Kyber I/O scheduler"
+	default y
+	---help---
+	  The Kyber I/O scheduler is a low-overhead scheduler suitable for
+	  multiqueue and other fast devices. Given target latencies for reads and
+	  synchronous writes, it will self-tune queue depths to achieve that
+	  goal.
+
 endmenu
 
 endif
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..6146d2eaaeaa 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
new file mode 100644
index 000000000000..c94246c38c77
--- /dev/null
+++ b/block/kyber-iosched.c
@@ -0,0 +1,717 @@
+/*
+ * The Kyber I/O scheduler. Controls latency by throttling queue depths using
+ * scalable techniques.
+ *
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <https://www.gnu.org/licenses/>.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-stat.h"
+
+/* Scheduling domains. */
+enum {
+	KYBER_READ,
+	KYBER_SYNC_WRITE,
+	KYBER_OTHER, /* Async writes, discard, etc. */
+	KYBER_NUM_DOMAINS,
+};
+
+enum {
+	KYBER_MIN_DEPTH = 256,
+
+	/*
+	 * In order to prevent starvation of synchronous requests by a flood of
+	 * asynchronous requests, we reserve 25% of requests for synchronous
+	 * operations.
+	 */
+	KYBER_ASYNC_PERCENT = 75,
+};
+
+/*
+ * Initial device-wide depths for each scheduling domain.
+ *
+ * Even for fast devices with lots of tags like NVMe, you can saturate
+ * the device with only a fraction of the maximum possible queue depth.
+ * So, we cap these to a reasonable value.
+ */
+static const unsigned int kyber_depth[] = {
+	[KYBER_READ] = 256,
+	[KYBER_SYNC_WRITE] = 128,
+	[KYBER_OTHER] = 64,
+};
+
+/*
+ * Scheduling domain batch sizes. We favor reads.
+ */
+static const unsigned int kyber_batch_size[] = {
+	[KYBER_READ] = 16,
+	[KYBER_SYNC_WRITE] = 8,
+	[KYBER_OTHER] = 8,
+};
+
+struct kyber_queue_data {
+	struct request_queue *q;
+
+	struct blk_stat_callback *cb;
+
+	/*
+	 * The device is divided into multiple scheduling domains based on the
+	 * request type. Each domain has a fixed number of in-flight requests of
+	 * that type device-wide, limited by these tokens.
+	 */
+	struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
+
+	/*
+	 * Async request percentage, converted to per-word depth for
+	 * sbitmap_get_shallow().
+	 */
+	unsigned int async_depth;
+
+	/* Target latencies in nanoseconds. */
+	u64 read_lat_nsec, write_lat_nsec;
+};
+
+struct kyber_hctx_data {
+	spinlock_t lock;
+	struct list_head rqs[KYBER_NUM_DOMAINS];
+	int cur_domain;
+	unsigned int batching;
+	wait_queue_t domain_wait[KYBER_NUM_DOMAINS];
+	atomic_t wait_index[KYBER_NUM_DOMAINS];
+};
+
+static unsigned int rq_sched_domain(const struct request *rq)
+{
+	unsigned int op = rq->cmd_flags;
+
+	if ((op & REQ_OP_MASK) == REQ_OP_READ)
+		return KYBER_READ;
+	else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
+		return KYBER_SYNC_WRITE;
+	else
+		return KYBER_OTHER;
+}
+
+enum {
+	NONE = 0,
+	GOOD = 1,
+	GREAT = 2,
+	BAD = -1,
+	AWFUL = -2,
+};
+
+#define IS_GOOD(status) ((status) > 0)
+#define IS_BAD(status) ((status) < 0)
+
+static int kyber_lat_status(struct blk_stat_callback *cb,
+			    unsigned int sched_domain, u64 target)
+{
+	u64 latency;
+
+	if (!cb->stat[sched_domain].nr_samples)
+		return NONE;
+
+	latency = cb->stat[sched_domain].mean;
+	if (latency >= 2 * target)
+		return AWFUL;
+	else if (latency > target)
+		return BAD;
+	else if (latency <= target / 2)
+		return GREAT;
+	else /* (latency <= target) */
+		return GOOD;
+}
+
+/*
+ * Adjust the read or synchronous write depth given the status of reads and
+ * writes. The goal is that the latencies of the two domains are fair (i.e., if
+ * one is good, then the other is good).
+ */
+static void kyber_adjust_rw_depth(struct kyber_queue_data *kqd,
+				  unsigned int sched_domain, int this_status,
+				  int other_status)
+{
+	unsigned int orig_depth, depth;
+
+	/*
+	 * If this domain had no samples, or reads and writes are both good or
+	 * both bad, don't adjust the depth.
+	 */
+	if (this_status == NONE ||
+	    (IS_GOOD(this_status) && IS_GOOD(other_status)) ||
+	    (IS_BAD(this_status) && IS_BAD(other_status)))
+		return;
+
+	orig_depth = depth = kqd->domain_tokens[sched_domain].sb.depth;
+
+	if (other_status == NONE) {
+		depth++;
+	} else {
+		switch (this_status) {
+		case GOOD:
+			if (other_status == AWFUL)
+				depth -= max(depth / 4, 1U);
+			else
+				depth -= max(depth / 8, 1U);
+			break;
+		case GREAT:
+			if (other_status == AWFUL)
+				depth /= 2;
+			else
+				depth -= max(depth / 4, 1U);
+			break;
+		case BAD:
+			depth++;
+			break;
+		case AWFUL:
+			if (other_status == GREAT)
+				depth += 2;
+			else
+				depth++;
+			break;
+		}
+	}
+
+	depth = clamp(depth, 1U, kyber_depth[sched_domain]);
+	if (depth != orig_depth)
+		sbitmap_queue_resize(&kqd->domain_tokens[sched_domain], depth);
+}
+
+/*
+ * Adjust the depth of other requests given the status of reads and synchronous
+ * writes. As long as either domain is doing fine, we don't throttle, but if
+ * both domains are doing badly, we throttle heavily.
+ */
+static void kyber_adjust_other_depth(struct kyber_queue_data *kqd,
+				     int read_status, int write_status,
+				     bool have_samples)
+{
+	unsigned int orig_depth, depth;
+	int status;
+
+	orig_depth = depth = kqd->domain_tokens[KYBER_OTHER].sb.depth;
+
+	if (read_status == NONE && write_status == NONE) {
+		depth += 2;
+	} else if (have_samples) {
+		if (read_status == NONE)
+			status = write_status;
+		else if (write_status == NONE)
+			status = read_status;
+		else
+			status = max(read_status, write_status);
+		switch (status) {
+		case GREAT:
+			depth += 2;
+			break;
+		case GOOD:
+			depth++;
+			break;
+		case BAD:
+			depth -= max(depth / 4, 1U);
+			break;
+		case AWFUL:
+			depth /= 2;
+			break;
+		}
+	}
+
+	depth = clamp(depth, 1U, kyber_depth[KYBER_OTHER]);
+	if (depth != orig_depth)
+		sbitmap_queue_resize(&kqd->domain_tokens[KYBER_OTHER], depth);
+}
+
+/*
+ * Apply heuristics for limiting queue depths based on gathered latency
+ * statistics.
+ */
+static void kyber_stat_timer_fn(struct blk_stat_callback *cb)
+{
+	struct kyber_queue_data *kqd = cb->data;
+	int read_status, write_status;
+
+	read_status = kyber_lat_status(cb, KYBER_READ, kqd->read_lat_nsec);
+	write_status = kyber_lat_status(cb, KYBER_SYNC_WRITE, kqd->write_lat_nsec);
+
+	kyber_adjust_rw_depth(kqd, KYBER_READ, read_status, write_status);
+	kyber_adjust_rw_depth(kqd, KYBER_SYNC_WRITE, write_status, read_status);
+	kyber_adjust_other_depth(kqd, read_status, write_status,
+				 cb->stat[KYBER_OTHER].nr_samples != 0);
+
+	/*
+	 * Continue monitoring latencies if we aren't hitting the targets or
+	 * we're still throttling other requests.
+	 */
+	if (!blk_stat_is_active(kqd->cb) &&
+	    ((IS_BAD(read_status) || IS_BAD(write_status) ||
+	      kqd->domain_tokens[KYBER_OTHER].sb.depth < kyber_depth[KYBER_OTHER])))
+		blk_stat_activate_msecs(kqd->cb, 100);
+}
+
+static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)
+{
+	/*
+	 * All of the hardware queues have the same depth, so we can just grab
+	 * the shift of the first one.
+	 */
+	return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+}
+
+static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
+{
+	struct kyber_queue_data *kqd;
+	unsigned int max_tokens;
+	unsigned int shift;
+	int ret = -ENOMEM;
+	int i;
+
+	kqd = kmalloc_node(sizeof(*kqd), GFP_KERNEL, q->node);
+	if (!kqd)
+		goto err;
+	kqd->q = q;
+
+	kqd->cb = blk_stat_alloc_callback(kyber_stat_timer_fn, rq_sched_domain,
+					  KYBER_NUM_DOMAINS, kqd);
+	if (!kqd->cb)
+		goto err_kqd;
+
+	/*
+	 * The maximum number of tokens for any scheduling domain is at least
+	 * the queue depth of a single hardware queue. If the hardware doesn't
+	 * have many tags, still provide a reasonable number.
+	 */
+	max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
+			   KYBER_MIN_DEPTH);
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		WARN_ON(!kyber_depth[i]);
+		WARN_ON(!kyber_batch_size[i]);
+		ret = sbitmap_queue_init_node(&kqd->domain_tokens[i],
+					      max_tokens, -1, false, GFP_KERNEL,
+					      q->node);
+		if (ret) {
+			while (--i >= 0)
+				sbitmap_queue_free(&kqd->domain_tokens[i]);
+			goto err_cb;
+		}
+		sbitmap_queue_resize(&kqd->domain_tokens[i], kyber_depth[i]);
+	}
+
+	shift = kyber_sched_tags_shift(kqd);
+	kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U;
+
+	kqd->read_lat_nsec = 2000000ULL;
+	kqd->write_lat_nsec = 10000000ULL;
+
+	return kqd;
+
+err_cb:
+	blk_stat_free_callback(kqd->cb);
+err_kqd:
+	kfree(kqd);
+err:
+	return ERR_PTR(ret);
+}
+
+static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct kyber_queue_data *kqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	kqd = kyber_queue_data_alloc(q);
+	if (IS_ERR(kqd)) {
+		kobject_put(&eq->kobj);
+		return PTR_ERR(kqd);
+	}
+
+	eq->elevator_data = kqd;
+	q->elevator = eq;
+
+	blk_stat_add_callback(q, kqd->cb);
+
+	return 0;
+}
+
+static void kyber_exit_sched(struct elevator_queue *e)
+{
+	struct kyber_queue_data *kqd = e->elevator_data;
+	struct request_queue *q = kqd->q;
+	int i;
+
+	blk_stat_remove_callback(q, kqd->cb);
+
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+		sbitmap_queue_free(&kqd->domain_tokens[i]);
+	blk_stat_free_callback(kqd->cb);
+	kfree(kqd);
+}
+
+static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct kyber_hctx_data *khd;
+	int i;
+
+	khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
+	if (!khd)
+		return -ENOMEM;
+
+	spin_lock_init(&khd->lock);
+
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		INIT_LIST_HEAD(&khd->rqs[i]);
+		INIT_LIST_HEAD(&khd->domain_wait[i].task_list);
+		atomic_set(&khd->wait_index[i], 0);
+	}
+
+	khd->cur_domain = 0;
+	khd->batching = 0;
+
+	hctx->sched_data = khd;
+
+	return 0;
+}
+
+static void kyber_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	kfree(hctx->sched_data);
+}
+
+static int rq_get_domain_token(struct request *rq)
+{
+	return (long)rq->elv.priv[0];
+}
+
+static void rq_set_domain_token(struct request *rq, int token)
+{
+	rq->elv.priv[0] = (void *)(long)token;
+}
+
+static void rq_clear_domain_token(struct kyber_queue_data *kqd,
+				  struct request *rq)
+{
+	unsigned int sched_domain;
+	int nr;
+
+	nr = rq_get_domain_token(rq);
+	if (nr != -1) {
+		sched_domain = rq_sched_domain(rq);
+		sbitmap_queue_clear(&kqd->domain_tokens[sched_domain], nr,
+				    rq->mq_ctx->cpu);
+	}
+}
+
+static struct request *kyber_get_request(struct request_queue *q,
+					 unsigned int op,
+					 struct blk_mq_alloc_data *data)
+{
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
+	struct request *rq;
+
+	/*
+	 * We use the scheduler tags as per-hardware queue queueing tokens.
+	 * Async requests can be limited at this stage.
+	 */
+	if (!op_is_sync(op))
+		data->shallow_depth = kqd->async_depth;
+
+	rq = __blk_mq_alloc_request(data, op);
+	if (rq)
+		rq_set_domain_token(rq, -1);
+	return rq;
+}
+
+static void kyber_put_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
+
+	rq_clear_domain_token(kqd, rq);
+	blk_mq_finish_request(rq);
+}
+
+static void kyber_completed_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
+	unsigned int sched_domain;
+	u64 now, latency, target;
+
+	/*
+	 * Check if this request met our latency goal. If not, quickly gather
+	 * some statistics and start throttling.
+	 */
+	sched_domain = rq_sched_domain(rq);
+	switch (sched_domain) {
+	case KYBER_READ:
+		target = kqd->read_lat_nsec;
+		break;
+	case KYBER_SYNC_WRITE:
+		target = kqd->write_lat_nsec;
+		break;
+	default:
+		return;
+	}
+
+	/* If we are already monitoring latencies, don't check again. */
+	if (blk_stat_is_active(kqd->cb))
+		return;
+
+	now = __blk_stat_time(ktime_to_ns(ktime_get()));
+	if (now < blk_stat_time(&rq->issue_stat))
+		return;
+
+	latency = now - blk_stat_time(&rq->issue_stat);
+
+	if (latency > target)
+		blk_stat_activate_msecs(kqd->cb, 10);
+}
+
+static void kyber_flush_busy_ctxs(struct kyber_hctx_data *khd,
+				  struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(rq_list);
+	struct request *rq, *next;
+
+	blk_mq_flush_busy_ctxs(hctx, &rq_list);
+	list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+		unsigned int sched_domain;
+
+		sched_domain = rq_sched_domain(rq);
+		list_move_tail(&rq->queuelist, &khd->rqs[sched_domain]);
+	}
+}
+
+static int kyber_domain_wake(wait_queue_t *wait, unsigned mode, int flags,
+			     void *key)
+{
+	struct blk_mq_hw_ctx *hctx = READ_ONCE(wait->private);
+
+	list_del_init(&wait->task_list);
+	blk_mq_run_hw_queue(hctx, true);
+	return 1;
+}
+
+static int kyber_get_domain_token(struct kyber_queue_data *kqd,
+				  struct kyber_hctx_data *khd,
+				  struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int sched_domain = khd->cur_domain;
+	struct sbitmap_queue *domain_tokens = &kqd->domain_tokens[sched_domain];
+	wait_queue_t *wait = &khd->domain_wait[sched_domain];
+	struct sbq_wait_state *ws;
+	int nr;
+
+	nr = __sbitmap_queue_get(domain_tokens);
+	if (nr >= 0)
+		return nr;
+
+	/*
+	 * If we failed to get a domain token, make sure the hardware queue is
+	 * run when one becomes available. Note that this is serialized on
+	 * khd->lock, but we still need to be careful about the waker.
+	 */
+	if (list_empty_careful(&wait->task_list)) {
+		init_waitqueue_func_entry(wait, kyber_domain_wake);
+		wait->private = hctx;
+		ws = sbq_wait_ptr(domain_tokens,
+				  &khd->wait_index[sched_domain]);
+		add_wait_queue(&ws->wait, wait);
+
+		/*
+		 * Try again in case a token was freed before we got on the wait
+		 * queue.
+		 */
+		nr = __sbitmap_queue_get(domain_tokens);
+	}
+	return nr;
+}
+
+static struct request *
+kyber_dispatch_cur_domain(struct kyber_queue_data *kqd,
+			  struct kyber_hctx_data *khd,
+			  struct blk_mq_hw_ctx *hctx,
+			  bool *flushed)
+{
+	struct list_head *rqs;
+	struct request *rq;
+	int nr;
+
+	rqs = &khd->rqs[khd->cur_domain];
+	rq = list_first_entry_or_null(rqs, struct request, queuelist);
+
+	/*
+	 * If there wasn't already a pending request and we haven't flushed the
+	 * software queues yet, flush the software queues and check again.
+	 */
+	if (!rq && !*flushed) {
+		kyber_flush_busy_ctxs(khd, hctx);
+		*flushed = true;
+		rq = list_first_entry_or_null(rqs, struct request, queuelist);
+	}
+
+	if (rq) {
+		nr = kyber_get_domain_token(kqd, khd, hctx);
+		if (nr >= 0) {
+			khd->batching++;
+			rq_set_domain_token(rq, nr);
+			list_del_init(&rq->queuelist);
+			return rq;
+		}
+	}
+
+	/* There were either no pending requests or no tokens. */
+	return NULL;
+}
+
+static struct request *kyber_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
+	struct kyber_hctx_data *khd = hctx->sched_data;
+	bool flushed = false;
+	struct request *rq;
+	int i;
+
+	spin_lock(&khd->lock);
+
+	/*
+	 * First, if we are still entitled to batch, try to dispatch a request
+	 * from the batch.
+	 */
+	if (khd->batching < kyber_batch_size[khd->cur_domain]) {
+		rq = kyber_dispatch_cur_domain(kqd, khd, hctx, &flushed);
+		if (rq)
+			goto out;
+	}
+
+	/*
+	 * Either,
+	 * 1. We were no longer entitled to a batch.
+	 * 2. The domain we were batching didn't have any requests.
+	 * 3. The domain we were batching was out of tokens.
+	 *
+	 * Start another batch. Note that this wraps back around to the original
+	 * domain if no other domains have requests or tokens.
+	 */
+	khd->batching = 0;
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		if (++khd->cur_domain >= KYBER_NUM_DOMAINS)
+			khd->cur_domain = 0;
+
+		rq = kyber_dispatch_cur_domain(kqd, khd, hctx, &flushed);
+		if (rq)
+			goto out;
+	}
+
+	rq = NULL;
+out:
+	spin_unlock(&khd->lock);
+	return rq;
+}
+
+static bool kyber_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct kyber_hctx_data *khd = hctx->sched_data;
+	int i;
+
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		if (!list_empty_careful(&khd->rqs[i]))
+			return true;
+	}
+	return false;
+}
+
+#define KYBER_LAT_SHOW_STORE(op)					\
+static ssize_t kyber_##op##_lat_show(struct elevator_queue *e,		\
+				     char *page)			\
+{									\
+	struct kyber_queue_data *kqd = e->elevator_data;		\
+									\
+	return sprintf(page, "%llu\n", kqd->op##_lat_nsec);		\
+}									\
+									\
+static ssize_t kyber_##op##_lat_store(struct elevator_queue *e,		\
+				      const char *page, size_t count)	\
+{									\
+	struct kyber_queue_data *kqd = e->elevator_data;		\
+	unsigned long long nsec;					\
+	int ret;							\
+									\
+	ret = kstrtoull(page, 10, &nsec);				\
+	if (ret)							\
+		return ret;						\
+									\
+	kqd->op##_lat_nsec = nsec;					\
+									\
+	return count;							\
+}
+KYBER_LAT_SHOW_STORE(read);
+KYBER_LAT_SHOW_STORE(write);
+#undef KYBER_LAT_SHOW_STORE
+
+#define KYBER_LAT_ATTR(op) __ATTR(op##_lat_nsec, 0644, kyber_##op##_lat_show, kyber_##op##_lat_store)
+static struct elv_fs_entry kyber_sched_attrs[] = {
+	KYBER_LAT_ATTR(read),
+	KYBER_LAT_ATTR(write),
+	__ATTR_NULL
+};
+#undef KYBER_LAT_ATTR
+
+static struct elevator_type kyber_sched = {
+	.ops.mq = {
+		.init_sched = kyber_init_sched,
+		.exit_sched = kyber_exit_sched,
+		.init_hctx = kyber_init_hctx,
+		.exit_hctx = kyber_exit_hctx,
+		.get_request = kyber_get_request,
+		.put_request = kyber_put_request,
+		.completed_request = kyber_completed_request,
+		.dispatch_request = kyber_dispatch_request,
+		.has_work = kyber_has_work,
+	},
+	.uses_mq = true,
+	.elevator_attrs = kyber_sched_attrs,
+	.elevator_name = "kyber",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init kyber_init(void)
+{
+	return elv_register(&kyber_sched);
+}
+
+static void __exit kyber_exit(void)
+{
+	elv_unregister(&kyber_sched);
+}
+
+module_init(kyber_init);
+module_exit(kyber_exit);
+
+MODULE_AUTHOR("Omar Sandoval");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Kyber I/O scheduler");
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 4/5] blk-mq-sched: make completed_request() callback more useful
From: Omar Sandoval @ 2017-04-10  6:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491806249.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.h     | 11 +++--------
 block/blk-mq.c           |  5 ++++-
 include/linux/elevator.h |  2 +-
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index f4bc186c3440..120c6abc37cc 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -82,17 +82,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 	return true;
 }
 
-static inline void
-blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct elevator_queue *e = rq->q->elevator;
 
 	if (e && e->type->ops.mq.completed_request)
-		e->type->ops.mq.completed_request(hctx, rq);
-
-	BUG_ON(rq->internal_tag == -1);
-
-	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
+		e->type->ops.mq.completed_request(rq);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7138cd98146e..e2ef7b460924 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -350,7 +350,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
-		blk_mq_sched_completed_request(hctx, rq);
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
@@ -444,6 +444,9 @@ static void __blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	if (rq->internal_tag != -1)
+		blk_mq_sched_completed_request(rq);
+
 	blk_mq_stat_add(rq);
 
 	if (!q->softirq_done_fn)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b7ec315ee7e7..3a216318ae73 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -106,7 +106,7 @@ struct elevator_mq_ops {
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
 	bool (*has_work)(struct blk_mq_hw_ctx *);
-	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+	void (*completed_request)(struct request *);
 	void (*started_request)(struct request *);
 	void (*requeue_request)(struct request *);
 	struct request *(*former_request)(struct request_queue *, struct request *);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 3/5] blk-mq: export helpers
From: Omar Sandoval @ 2017-04-10  6:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491806249.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c         | 2 ++
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e536dacfae4c..7138cd98146e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -368,6 +368,7 @@ void blk_mq_finish_request(struct request *rq)
 {
 	blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
 }
+EXPORT_SYMBOL_GPL(blk_mq_finish_request);
 
 void blk_mq_free_request(struct request *rq)
 {
@@ -1183,6 +1184,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	__blk_mq_delay_run_hw_queue(hctx, async, 0);
 }
+EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b90c3d5766cd..d75de612845d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -238,6 +238,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
+void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 2/5] blk-mq: add shallow depth option for blk_mq_get_tag()
From: Omar Sandoval @ 2017-04-10  6:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491806249.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-tag.c | 5 ++++-
 block/blk-mq.h     | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..d0be72ccb091 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -96,7 +96,10 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
 	    !hctx_may_queue(data->hctx, bt))
 		return -1;
-	return __sbitmap_queue_get(bt);
+	if (data->shallow_depth)
+		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
+	else
+		return __sbitmap_queue_get(bt);
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7e6f2e467696..524f44742816 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -141,6 +141,7 @@ struct blk_mq_alloc_data {
 	/* input parameter */
 	struct request_queue *q;
 	unsigned int flags;
+	unsigned int shallow_depth;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 1/5] sbitmap: add sbitmap_get_shallow() operation
From: Omar Sandoval @ 2017-04-10  6:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491806249.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/sbitmap.h | 55 ++++++++++++++++++++++++++++++++++++
 lib/sbitmap.c           | 75 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 123 insertions(+), 7 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index d4e0a204c118..a1904aadbc45 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -176,6 +176,25 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth);
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin);
 
 /**
+ * sbitmap_get_shallow() - Try to allocate a free bit from a &struct sbitmap,
+ * limiting the depth used from each word.
+ * @sb: Bitmap to allocate from.
+ * @alloc_hint: Hint for where to start searching for a free bit.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ *
+ * This rather specific operation allows for having multiple users with
+ * different allocation limits. E.g., there can be a high-priority class that
+ * uses sbitmap_get() and a low-priority class that uses sbitmap_get_shallow()
+ * with a @shallow_depth of (1 << (@sb->shift - 1)). Then, the low-priority
+ * class can only allocate half of the total bits in the bitmap, preventing it
+ * from starving out the high-priority class.
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+			unsigned long shallow_depth);
+
+/**
  * sbitmap_any_bit_set() - Check for a set bit in a &struct sbitmap.
  * @sb: Bitmap to check.
  *
@@ -326,6 +345,19 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth);
 int __sbitmap_queue_get(struct sbitmap_queue *sbq);
 
 /**
+ * __sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
+ * sbitmap_queue, limiting the depth used from each word, with preemption
+ * already disabled.
+ * @sbq: Bitmap queue to allocate from.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+				unsigned int shallow_depth);
+
+/**
  * sbitmap_queue_get() - Try to allocate a free bit from a &struct
  * sbitmap_queue.
  * @sbq: Bitmap queue to allocate from.
@@ -346,6 +378,29 @@ static inline int sbitmap_queue_get(struct sbitmap_queue *sbq,
 }
 
 /**
+ * sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
+ * sbitmap_queue, limiting the depth used from each word.
+ * @sbq: Bitmap queue to allocate from.
+ * @cpu: Output parameter; will contain the CPU we ran on (e.g., to be passed to
+ *       sbitmap_queue_clear()).
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+static inline int sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+					    unsigned int *cpu,
+					    unsigned int shallow_depth)
+{
+	int nr;
+
+	*cpu = get_cpu();
+	nr = __sbitmap_queue_get_shallow(sbq, shallow_depth);
+	put_cpu();
+	return nr;
+}
+
+/**
  * sbitmap_queue_clear() - Free an allocated bit and wake up waiters on a
  * &struct sbitmap_queue.
  * @sbq: Bitmap to free from.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 60e800e0b5a0..80aa8d5463fa 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -79,15 +79,15 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 }
 EXPORT_SYMBOL_GPL(sbitmap_resize);
 
-static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
-			      bool wrap)
+static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
+			      unsigned int hint, bool wrap)
 {
 	unsigned int orig_hint = hint;
 	int nr;
 
 	while (1) {
-		nr = find_next_zero_bit(&word->word, word->depth, hint);
-		if (unlikely(nr >= word->depth)) {
+		nr = find_next_zero_bit(word, depth, hint);
+		if (unlikely(nr >= depth)) {
 			/*
 			 * We started with an offset, and we didn't reset the
 			 * offset to 0 in a failure case, so start from 0 to
@@ -100,11 +100,11 @@ static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
 			return -1;
 		}
 
-		if (!test_and_set_bit(nr, &word->word))
+		if (!test_and_set_bit(nr, word))
 			break;
 
 		hint = nr + 1;
-		if (hint >= word->depth - 1)
+		if (hint >= depth - 1)
 			hint = 0;
 	}
 
@@ -119,7 +119,8 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 	index = SB_NR_TO_INDEX(sb, alloc_hint);
 
 	for (i = 0; i < sb->map_nr; i++) {
-		nr = __sbitmap_get_word(&sb->map[index],
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					sb->map[index].depth,
 					SB_NR_TO_BIT(sb, alloc_hint),
 					!round_robin);
 		if (nr != -1) {
@@ -141,6 +142,37 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 }
 EXPORT_SYMBOL_GPL(sbitmap_get);
 
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+			unsigned long shallow_depth)
+{
+	unsigned int i, index;
+	int nr = -1;
+
+	index = SB_NR_TO_INDEX(sb, alloc_hint);
+
+	for (i = 0; i < sb->map_nr; i++) {
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					min(sb->map[index].depth, shallow_depth),
+					SB_NR_TO_BIT(sb, alloc_hint), true);
+		if (nr != -1) {
+			nr += index << sb->shift;
+			break;
+		}
+
+		/* Jump to next index. */
+		index++;
+		alloc_hint = index << sb->shift;
+
+		if (index >= sb->map_nr) {
+			index = 0;
+			alloc_hint = 0;
+		}
+	}
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(sbitmap_get_shallow);
+
 bool sbitmap_any_bit_set(const struct sbitmap *sb)
 {
 	unsigned int i;
@@ -342,6 +374,35 @@ int __sbitmap_queue_get(struct sbitmap_queue *sbq)
 }
 EXPORT_SYMBOL_GPL(__sbitmap_queue_get);
 
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+				unsigned int shallow_depth)
+{
+	unsigned int hint, depth;
+	int nr;
+
+	hint = this_cpu_read(*sbq->alloc_hint);
+	depth = READ_ONCE(sbq->sb.depth);
+	if (unlikely(hint >= depth)) {
+		hint = depth ? prandom_u32() % depth : 0;
+		this_cpu_write(*sbq->alloc_hint, hint);
+	}
+	nr = sbitmap_get_shallow(&sbq->sb, hint, shallow_depth);
+
+	if (nr == -1) {
+		/* If the map is full, a hint won't do us much good. */
+		this_cpu_write(*sbq->alloc_hint, 0);
+	} else if (nr == hint || unlikely(sbq->round_robin)) {
+		/* Only update the hint if we used it. */
+		hint = nr + 1;
+		if (hint >= depth - 1)
+			hint = 0;
+		this_cpu_write(*sbq->alloc_hint, hint);
+	}
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(__sbitmap_queue_get_shallow);
+
 static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 {
 	int i, wake_index;
-- 
2.12.2

^ permalink raw reply related

* [PATCH v3 0/5] blk-mq: Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-04-10  6:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This is v3 of Kyber, an I/O scheduler for multiqueue devices combining
several techniques: the scalable bitmap library, the new blk-stats API,
and queue depth throttling similar to blk-wbt. v1 was here [1], v2 was
here [2].

v3 reworks how hardware queues are restarted when domain tokens are
exhausted. Instead of reintroducing the QUEUE_FLAG_RESTART bit, which
was removed in 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets
are shared"), the new approach hooks into the sbitmap queues like we do
for driver tags since da55f2cc7841 ("blk-mq: use sbq wait queues instead
of restart for driver tags").

This series is based on block/for-next + the initialization fix series I
sent out yesterday [2]. Patches 1 and 2 implement a new sbitmap
operation. Patch 3 exports a couple of helpers. Patch 4 moves a
scheduler callback to somewhere more useful. Patch 5 implements the new
scheduler.

Thanks!

1: http://marc.info/?l=linux-block&m=148978871820916&w=2
2: http://marc.info/?l=linux-block&m=149132467510945&w=2

Omar Sandoval (5):
  sbitmap: add sbitmap_get_shallow() operation
  blk-mq: add shallow depth option for blk_mq_get_tag()
  blk-mq: export helpers
  blk-mq-sched: make completed_request() callback more useful
  blk-mq: introduce Kyber multiqueue I/O scheduler

 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched                 |   9 +
 block/Makefile                        |   1 +
 block/blk-mq-sched.h                  |  11 +-
 block/blk-mq-tag.c                    |   5 +-
 block/blk-mq.c                        |   7 +-
 block/blk-mq.h                        |   1 +
 block/kyber-iosched.c                 | 717 ++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h                |   1 +
 include/linux/elevator.h              |   2 +-
 include/linux/sbitmap.h               |  55 +++
 lib/sbitmap.c                         |  75 +++-
 12 files changed, 880 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

-- 
2.12.2

^ permalink raw reply

* [PATCH] lightnvm: don't check for failure from mempool_alloc()
From: NeilBrown @ 2017-04-10  2:07 UTC (permalink / raw)
  To: Matias Bjorling; +Cc: linux-block, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1226 bytes --]


mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_KERNEL and GFP_NOIO allows for sleeping.

So rrpc_move_valid_pages() and rrpc_make_rq() don't need to
test the return value.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/lightnvm/rrpc.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e00b1d7b976f..34f5f1cc9452 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -318,10 +318,6 @@ static int rrpc_move_valid_pages(struct rrpc *rrpc, struct rrpc_block *rblk)
 	}
 
 	page = mempool_alloc(rrpc->page_pool, GFP_NOIO);
-	if (!page) {
-		bio_put(bio);
-		return -ENOMEM;
-	}
 
 	while ((slot = find_first_zero_bit(rblk->invalid_pages,
 					    nr_sec_per_blk)) < nr_sec_per_blk) {
@@ -1007,11 +1003,6 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
 	}
 
 	rqd = mempool_alloc(rrpc->rq_pool, GFP_KERNEL);
-	if (!rqd) {
-		pr_err_ratelimited("rrpc: not able to queue bio.");
-		bio_io_error(bio);
-		return BLK_QC_T_NONE;
-	}
 	memset(rqd, 0, sizeof(struct nvm_rq));
 
 	err = rrpc_submit_io(rrpc, bio, rqd, NVM_IOTYPE_NONE);
-- 
2.12.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH v3] lightnvm: physical block device (pblk) target
From: Javier González @ 2017-04-09  9:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Matias Bjørling, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <CY1PR0401MB1536AD7F70C960D62DCA6983810F0@CY1PR0401MB1536.namprd04.prod.outlook.com>

[-- Attachment #1: Type: text/plain, Size: 7415 bytes --]

Hi Bart,

Thanks for reviewing the code.

> On 8 Apr 2017, at 22.56, Bart Van Assche <bart.vanassche@sandisk.com> wrote:
> 
> On 04/07/17 11:50, Javier González wrote:
>> Documentation/lightnvm/pblk.txt  |   21 +
>> drivers/lightnvm/Kconfig         |   19 +
>> drivers/lightnvm/Makefile        |    5 +
>> drivers/lightnvm/pblk-cache.c    |  112 +++
>> drivers/lightnvm/pblk-core.c     | 1641 ++++++++++++++++++++++++++++++++++++++
>> drivers/lightnvm/pblk-gc.c       |  542 +++++++++++++
>> drivers/lightnvm/pblk-init.c     |  942 ++++++++++++++++++++++
>> drivers/lightnvm/pblk-map.c      |  135 ++++
>> drivers/lightnvm/pblk-rb.c       |  859 ++++++++++++++++++++
>> drivers/lightnvm/pblk-read.c     |  513 ++++++++++++
>> drivers/lightnvm/pblk-recovery.c | 1007 +++++++++++++++++++++++
>> drivers/lightnvm/pblk-rl.c       |  182 +++++
>> drivers/lightnvm/pblk-sysfs.c    |  500 ++++++++++++
>> drivers/lightnvm/pblk-write.c    |  408 ++++++++++
>> drivers/lightnvm/pblk.h          | 1127 ++++++++++++++++++++++++++
>> include/linux/lightnvm.h         |   57 +-
>> pblk-sysfs.c                     |  581 ++++++++++++++
> 
> This patch introduces two slightly different versions of pblk-sysfs.c -
> one at the top level and one in drivers/lightnvm. Please remove the file
> at the top level.

The top level file is a mistake; I'll remove it.

> 
>> +config NVM_PBLK_L2P_OPT
>> +	bool "PBLK with optimized L2P table for devices up to 8TB"
>> +	depends on NVM_PBLK
>> +	---help---
>> +	Physical device addresses are typical 64-bit integers. Since we store
>> +	the logical to physical (L2P) table on the host, this takes 1/500 of
>> +	host memory (e.g., 2GB per 1TB of storage media). On drives under 8TB,
>> +	it is possible to reduce this to 1/1000 (e.g., 1GB per 1TB). This
>> +	option allows to do this optimization on the host L2P table.
> 
> Why is NVM_PBLK_L2P_OPT a compile-time option instead of a run-time
> option? Since this define does not affect the definition of the ppa_addr
> I don't see why this has to be a compile-time option. For e.g. Linux
> distributors the only choice would be to disable NVM_PBLK_L2P_OPT. I
> think it would be unfortunate if no Linux distribution ever would be
> able to benefit from this optimization.

struct ppa_addr, which is the physical address format is not affected,
but pblk's internal L2P address representation (pblk_addr) is. You can
see that the type either represents struct ppa_addr or ppa_addr_32. How
would you define a type that can either be u64 or u32 with different bit
offsets at run-time? Note that address conversions to this type is in
the fast path and this format allows us to only use bit shifts.

> 
>> +#ifdef CONFIG_NVM_DEBUG
>> +	atomic_add(nr_entries, &pblk->inflight_writes);
>> +	atomic_add(nr_entries, &pblk->req_writes);
>> +#endif
> 
> Has it been considered to use the "static key" feature such that
> consistency checks can be enabled at run-time instead of having to
> rebuild the kernel to enable CONFIG_NVM_DEBUG?

I haven't considered it. I'll look into it. I would like to have this
counters and the corresponding sysfs entry only available on debug mode
since it allows us to have a good picture of the FTL state.

> 
>> +#ifdef CONFIG_NVM_DEBUG
>> +	BUG_ON(nr_rec_entries != valid_entries);
>> +	atomic_add(valid_entries, &pblk->inflight_writes);
>> +	atomic_add(valid_entries, &pblk->recov_gc_writes);
>> +#endif
> 
> Are you aware that Linus is strongly opposed against using BUG_ON()?
> 

Yes, I am aware of the discussions around BUG_ON. The rationale on the
cases we have it is that they represent a pblk internal state error.
This will most probably result on a wild memory access or an
out-of-bound, which will eventually cause the kernel to crash either
way. You can see that most of them are under CONFIG_NVM_DEBUG. Would it
make sense to maintain CONFIG_NVM_DEBUG so that all BUG_ON checks are
contained within them? As far as I can read, this is not possible with
"static key"

>> +#ifdef CONFIG_NVM_DEBUG
>> +	lockdep_assert_held(&l_mg->free_lock);
>> +#endif
> 
> Why is lockdep_assert_held() surrounded with #ifdef CONFIG_NVM_DEBUG /
> #endif? Are you aware that lockdep_assert_held() do not generate any
> code with CONFIG_PROVE_LOCKING=n?

I did not know about CONFIG_PROVE_LOCKING, thanks for pointing it out.

> 
>> +static const struct block_device_operations pblk_fops = {
>> +	.owner		= THIS_MODULE,
>> +};
> 
> Is this data structure useful? If so, where is pblk_fops used?

It is not useful anymore. I'll remove it.

> 
>> +static void pblk_l2p_free(struct pblk *pblk)
>> +{
>> +	vfree(pblk->trans_map);
>> +}
>> +
>> +static int pblk_l2p_init(struct pblk *pblk)
>> +{
>> +	sector_t i;
>> +
>> +	pblk->trans_map = vmalloc(sizeof(pblk_addr) * pblk->rl.nr_secs);
>> +	if (!pblk->trans_map)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < pblk->rl.nr_secs; i++)
>> +		pblk_ppa_set_empty(&pblk->trans_map[i]);
>> +
>> +	return 0;
>> +}
> 
> Has it been considered to add support for keeping a subset of the L2P
> translation table in memory instead of keeping it in memory in its entirety?

Yes. L2P caching is on our roadmap and will be included in the future.

> 
>> +	sprintf(cache_name, "pblk_line_m_%s", pblk->disk->disk_name);
> 
> Please use snprintf() or kasprintf() instead of printf(). That makes it
> easier for humans to verify that no buffer overflow is triggered.
> 

Ok.

>> +/* physical block device target */
>> +static struct nvm_tgt_type tt_pblk = {
>> +	.name		= "pblk",
>> +	.version	= {1, 0, 0},
> 
> Are version numbers useful inside the kernel tree?

It allows us to relate in to the disk format version, but most probably
the version on the actual disk format structures is enough.

> 
>> +void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
>> +		 unsigned long *lun_bitmap, unsigned int valid_secs,
>> +		 unsigned int off)
>> +{
>> +	struct pblk_sec_meta *meta_list = rqd->meta_list;
>> +	unsigned int map_secs;
>> +	int min = pblk->min_write_pgs;
>> +	int i;
>> +
>> +	for (i = off; i < rqd->nr_ppas; i += min) {
>> +		map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
>> +		pblk_map_page_data(pblk, sentry + i, &rqd->ppa_list[i],
>> +					lun_bitmap, &meta_list[i], map_secs);
>> +	}
>> +}
> 
> Has it been considered to implement the above code such that no modulo
> (%) computation is needed, which is a relatively expensive operation? I
> think for this loop that can be done easily.

I'll look into it.

> 
>> +static DECLARE_RWSEM(pblk_rb_lock);
>> +
>> +void pblk_rb_data_free(struct pblk_rb *rb)
>> +{
>> +	struct pblk_rb_pages *p, *t;
>> +
>> +	down_write(&pblk_rb_lock);
>> +	list_for_each_entry_safe(p, t, &rb->pages, list) {
>> +		free_pages((unsigned long)page_address(p->pages), p->order);
>> +		list_del(&p->list);
>> +		kfree(p);
>> +	}
>> +	up_write(&pblk_rb_lock);
>> +}
> 
> Global locks like pblk_rb_lock are a performance bottleneck on
> multi-socket systems. Why is that lock global?

This global lock is only used for the write buffer init and exit; it
does not touch the fast path. On tear down it guarantees that we flush
all the data present on the buffer before freeing the buffer itself.

> 
> Bart.

Javier

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH v3] lightnvm: physical block device (pblk) target
From: Bart Van Assche @ 2017-04-08 20:56 UTC (permalink / raw)
  To: Javier González, mb@lightnvm.io
  Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Javier González
In-Reply-To: <1491591015-7554-2-git-send-email-javier@cnexlabs.com>

On 04/07/17 11:50, Javier Gonz=E1lez wrote:=0A=
>  Documentation/lightnvm/pblk.txt  |   21 +=0A=
>  drivers/lightnvm/Kconfig         |   19 +=0A=
>  drivers/lightnvm/Makefile        |    5 +=0A=
>  drivers/lightnvm/pblk-cache.c    |  112 +++=0A=
>  drivers/lightnvm/pblk-core.c     | 1641 ++++++++++++++++++++++++++++++++=
++++++=0A=
>  drivers/lightnvm/pblk-gc.c       |  542 +++++++++++++=0A=
>  drivers/lightnvm/pblk-init.c     |  942 ++++++++++++++++++++++=0A=
>  drivers/lightnvm/pblk-map.c      |  135 ++++=0A=
>  drivers/lightnvm/pblk-rb.c       |  859 ++++++++++++++++++++=0A=
>  drivers/lightnvm/pblk-read.c     |  513 ++++++++++++=0A=
>  drivers/lightnvm/pblk-recovery.c | 1007 +++++++++++++++++++++++=0A=
>  drivers/lightnvm/pblk-rl.c       |  182 +++++=0A=
>  drivers/lightnvm/pblk-sysfs.c    |  500 ++++++++++++=0A=
>  drivers/lightnvm/pblk-write.c    |  408 ++++++++++=0A=
>  drivers/lightnvm/pblk.h          | 1127 ++++++++++++++++++++++++++=0A=
>  include/linux/lightnvm.h         |   57 +-=0A=
>  pblk-sysfs.c                     |  581 ++++++++++++++=0A=
=0A=
This patch introduces two slightly different versions of pblk-sysfs.c - =0A=
one at the top level and one in drivers/lightnvm. Please remove the file=0A=
at the top level.=0A=
=0A=
> +config NVM_PBLK_L2P_OPT=0A=
> +	bool "PBLK with optimized L2P table for devices up to 8TB"=0A=
> +	depends on NVM_PBLK=0A=
> +	---help---=0A=
> +	Physical device addresses are typical 64-bit integers. Since we store=
=0A=
> +	the logical to physical (L2P) table on the host, this takes 1/500 of=0A=
> +	host memory (e.g., 2GB per 1TB of storage media). On drives under 8TB,=
=0A=
> +	it is possible to reduce this to 1/1000 (e.g., 1GB per 1TB). This=0A=
> +	option allows to do this optimization on the host L2P table.=0A=
=0A=
Why is NVM_PBLK_L2P_OPT a compile-time option instead of a run-time =0A=
option? Since this define does not affect the definition of the ppa_addr =
=0A=
I don't see why this has to be a compile-time option. For e.g. Linux =0A=
distributors the only choice would be to disable NVM_PBLK_L2P_OPT. I =0A=
think it would be unfortunate if no Linux distribution ever would be =0A=
able to benefit from this optimization.=0A=
=0A=
> +#ifdef CONFIG_NVM_DEBUG=0A=
> +	atomic_add(nr_entries, &pblk->inflight_writes);=0A=
> +	atomic_add(nr_entries, &pblk->req_writes);=0A=
> +#endif=0A=
=0A=
Has it been considered to use the "static key" feature such that =0A=
consistency checks can be enabled at run-time instead of having to =0A=
rebuild the kernel to enable CONFIG_NVM_DEBUG?=0A=
=0A=
> +#ifdef CONFIG_NVM_DEBUG=0A=
> +	BUG_ON(nr_rec_entries !=3D valid_entries);=0A=
> +	atomic_add(valid_entries, &pblk->inflight_writes);=0A=
> +	atomic_add(valid_entries, &pblk->recov_gc_writes);=0A=
> +#endif=0A=
=0A=
Are you aware that Linus is strongly opposed against using BUG_ON()?=0A=
=0A=
> +#ifdef CONFIG_NVM_DEBUG=0A=
> +	lockdep_assert_held(&l_mg->free_lock);=0A=
> +#endif=0A=
=0A=
Why is lockdep_assert_held() surrounded with #ifdef CONFIG_NVM_DEBUG / =0A=
#endif? Are you aware that lockdep_assert_held() do not generate any =0A=
code with CONFIG_PROVE_LOCKING=3Dn?=0A=
=0A=
> +static const struct block_device_operations pblk_fops =3D {=0A=
> +	.owner		=3D THIS_MODULE,=0A=
> +};=0A=
=0A=
Is this data structure useful? If so, where is pblk_fops used?=0A=
=0A=
> +static void pblk_l2p_free(struct pblk *pblk)=0A=
> +{=0A=
> +	vfree(pblk->trans_map);=0A=
> +}=0A=
> +=0A=
> +static int pblk_l2p_init(struct pblk *pblk)=0A=
> +{=0A=
> +	sector_t i;=0A=
> +=0A=
> +	pblk->trans_map =3D vmalloc(sizeof(pblk_addr) * pblk->rl.nr_secs);=0A=
> +	if (!pblk->trans_map)=0A=
> +		return -ENOMEM;=0A=
> +=0A=
> +	for (i =3D 0; i < pblk->rl.nr_secs; i++)=0A=
> +		pblk_ppa_set_empty(&pblk->trans_map[i]);=0A=
> +=0A=
> +	return 0;=0A=
> +}=0A=
=0A=
Has it been considered to add support for keeping a subset of the L2P =0A=
translation table in memory instead of keeping it in memory in its entirety=
?=0A=
=0A=
> +	sprintf(cache_name, "pblk_line_m_%s", pblk->disk->disk_name);=0A=
=0A=
Please use snprintf() or kasprintf() instead of printf(). That makes it =0A=
easier for humans to verify that no buffer overflow is triggered.=0A=
=0A=
> +/* physical block device target */=0A=
> +static struct nvm_tgt_type tt_pblk =3D {=0A=
> +	.name		=3D "pblk",=0A=
> +	.version	=3D {1, 0, 0},=0A=
=0A=
Are version numbers useful inside the kernel tree?=0A=
=0A=
> +void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sen=
try,=0A=
> +		 unsigned long *lun_bitmap, unsigned int valid_secs,=0A=
> +		 unsigned int off)=0A=
> +{=0A=
> +	struct pblk_sec_meta *meta_list =3D rqd->meta_list;=0A=
> +	unsigned int map_secs;=0A=
> +	int min =3D pblk->min_write_pgs;=0A=
> +	int i;=0A=
> +=0A=
> +	for (i =3D off; i < rqd->nr_ppas; i +=3D min) {=0A=
> +		map_secs =3D (i + min > valid_secs) ? (valid_secs % min) : min;=0A=
> +		pblk_map_page_data(pblk, sentry + i, &rqd->ppa_list[i],=0A=
> +					lun_bitmap, &meta_list[i], map_secs);=0A=
> +	}=0A=
> +}=0A=
=0A=
Has it been considered to implement the above code such that no modulo =0A=
(%) computation is needed, which is a relatively expensive operation? I =0A=
think for this loop that can be done easily.=0A=
=0A=
> +static DECLARE_RWSEM(pblk_rb_lock);=0A=
> +=0A=
> +void pblk_rb_data_free(struct pblk_rb *rb)=0A=
> +{=0A=
> +	struct pblk_rb_pages *p, *t;=0A=
> +=0A=
> +	down_write(&pblk_rb_lock);=0A=
> +	list_for_each_entry_safe(p, t, &rb->pages, list) {=0A=
> +		free_pages((unsigned long)page_address(p->pages), p->order);=0A=
> +		list_del(&p->list);=0A=
> +		kfree(p);=0A=
> +	}=0A=
> +	up_write(&pblk_rb_lock);=0A=
> +}=0A=
=0A=
Global locks like pblk_rb_lock are a performance bottleneck on =0A=
multi-socket systems. Why is that lock global?=0A=
=0A=
Bart.=0A=

^ permalink raw reply

* [GIT PULL] Block fixes for 4.11-rc6
From: Jens Axboe @ 2017-04-08 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Omar Sandoval

Hi Linus,

Here's a pull request for 4.11-rc, fixing a set of issues mostly
centered around the new scheduling framework. These have been brewing
for a while, but split up into what we absolutely need in 4.11, and what
we can defer until 4.12. These are well tested, on both single queue and
multiqueue setups, and with and without shared tags. They fix several
hangs that have happened in testing. This is obviously larger than I
would have preferred at this point in time, but I don't think we can
shave much off this and still get the desired results.

In detail, this pull request contains:

- Set of 5 fixes for NVMe, mostly from Christoph and one from Roland.

- A series from Bart, fixing issues with dm-mq and SCSI shared tags and
  scheduling. Note that one of those patches commit messages may read
  like an optimization, but it is in fact an important fix for queue
  restarts in particular.

- A series from Omar, most importantly fixing a hang with multiple
  hardware queues when we fail to get a driver tag. Another important
  fix in there is for resizing hardware queues, which nbd does when
  handling multiple sockets for one connection.

- Fixing an imbalance in putting the ctx for hctx request allocations
  from Minchan.

Please pull for -rc6! Note that I'll be sporadically available the next
10 days. Omar is standing by as the point of contact, until I return.


  git://git.kernel.dk/linux-block.git for-linus


----------------------------------------------------------------
Bart Van Assche (4):
      blk-mq: Introduce blk_mq_delay_run_hw_queue()
      scsi: Avoid that SCSI queues get stuck
      dm rq: Avoid that request processing stalls sporadically
      blk-mq: Restart a single queue if tag sets are shared

Christoph Hellwig (4):
      nvme: add missing byte swap in nvme_setup_discard
      nvmet: add missing byte swap in nvmet_get_smart_log
      nvmet: fix byte swap in nvmet_execute_write_zeroes
      nvmet: fix byte swap in nvmet_parse_io_cmd

Jens Axboe (1):
      Merge branch 'nvme-4.11-rc' of git://git.infradead.org/nvme into for-linus

Minchan Kim (1):
      block: do not put mq context in blk_mq_alloc_request_hctx

Omar Sandoval (5):
      blk-mq: use the right hctx when getting a driver tag fails
      blk-mq-sched: refactor scheduler initialization
      blk-mq-sched: set up scheduler tags when bringing up new queues
      blk-mq-sched: fix crash in switch error path
      blk-mq: remap queues when adding/removing hardware queues

Roland Dreier (1):
      nvme: Correct NVMF enum values to match NVMe-oF rev 1.0

 block/blk-mq-sched.c            | 181 +++++++++++++++++++++++++++++-----------
 block/blk-mq-sched.h            |  25 ++----
 block/blk-mq.c                  |  85 ++++++++++++++-----
 block/blk-mq.h                  |   2 +-
 block/blk-sysfs.c               |   2 +-
 block/elevator.c                | 114 +++++++++++++------------
 drivers/md/dm-rq.c              |   1 +
 drivers/nvme/host/core.c        |   2 +-
 drivers/nvme/target/admin-cmd.c |   2 +-
 drivers/nvme/target/io-cmd.c    |   4 +-
 drivers/scsi/scsi_lib.c         |   6 +-
 include/linux/blk-mq.h          |   2 +
 include/linux/blkdev.h          |   1 -
 include/linux/elevator.h        |   2 +-
 include/linux/nvme.h            |  16 ++--
 15 files changed, 281 insertions(+), 164 deletions(-)

-- 
Jens Axboe

^ permalink raw reply

* Re: always use REQ_OP_WRITE_ZEROES for zeroing offload V2
From: Jens Axboe @ 2017-04-08 17:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid
In-Reply-To: <20170405172125.22600-1-hch@lst.de>

On Wed, Apr 05 2017, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.

Added for 4.12, thanks Christoph.

-- 
Jens Axboe

^ permalink raw reply

* Re: [RESEND PATCH] block: sed-opal: Tone down all the pr_* to debugs
From: Jens Axboe @ 2017-04-07 20:24 UTC (permalink / raw)
  To: Scott Bauer, linux-block; +Cc: jonathan.derrick, hch
In-Reply-To: <1491595130-5837-1-git-send-email-scott.bauer@intel.com>

On 04/07/2017 01:58 PM, Scott Bauer wrote:
> Lets not flood the kernel log with messages unless
> the user requests so.

Thanks Scott, applied for 4.12.

-- 
Jens Axboe

^ permalink raw reply

* [RESEND PATCH] block: sed-opal: Tone down all the pr_* to debugs
From: Scott Bauer @ 2017-04-07 19:58 UTC (permalink / raw)
  To: linux-block; +Cc: axboe, jonathan.derrick, hch, Scott Bauer

Lets not flood the kernel log with messages unless
the user requests so.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
---
 block/sed-opal.c | 153 +++++++++++++++++++++++++++----------------------------
 1 file changed, 74 insertions(+), 79 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 6736c78..9b30ae5 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -275,8 +275,8 @@ static bool check_tper(const void *data)
 	u8 flags = tper->supported_features;
 
 	if (!(flags & TPER_SYNC_SUPPORTED)) {
-		pr_err("TPer sync not supported. flags = %d\n",
-		       tper->supported_features);
+		pr_debug("TPer sync not supported. flags = %d\n",
+			 tper->supported_features);
 		return false;
 	}
 
@@ -289,7 +289,7 @@ static bool check_sum(const void *data)
 	u32 nlo = be32_to_cpu(sum->num_locking_objects);
 
 	if (nlo == 0) {
-		pr_err("Need at least one locking object.\n");
+		pr_debug("Need at least one locking object.\n");
 		return false;
 	}
 
@@ -385,9 +385,9 @@ static int next(struct opal_dev *dev)
 
 		error = step->fn(dev, step->data);
 		if (error) {
-			pr_err("Error on step function: %d with error %d: %s\n",
-			       state, error,
-			       opal_error_to_human(error));
+			pr_debug("Error on step function: %d with error %d: %s\n",
+				 state, error,
+				 opal_error_to_human(error));
 
 			/* For each OPAL command we do a discovery0 then we
 			 * start some sort of session.
@@ -419,8 +419,8 @@ static int opal_discovery0_end(struct opal_dev *dev)
 	print_buffer(dev->resp, hlen);
 
 	if (hlen > IO_BUFFER_LENGTH - sizeof(*hdr)) {
-		pr_warn("Discovery length overflows buffer (%zu+%u)/%u\n",
-			sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
+		pr_debug("Discovery length overflows buffer (%zu+%u)/%u\n",
+			 sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
 		return -EFAULT;
 	}
 
@@ -503,7 +503,7 @@ static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok)
 	if (*err)
 		return;
 	if (cmd->pos >= IO_BUFFER_LENGTH - 1) {
-		pr_err("Error adding u8: end of buffer.\n");
+		pr_debug("Error adding u8: end of buffer.\n");
 		*err = -ERANGE;
 		return;
 	}
@@ -553,7 +553,7 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number)
 	len = DIV_ROUND_UP(msb, 4);
 
 	if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) {
-		pr_err("Error adding u64: end of buffer.\n");
+		pr_debug("Error adding u64: end of buffer.\n");
 		*err = -ERANGE;
 		return;
 	}
@@ -579,7 +579,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 	}
 
 	if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) {
-		pr_err("Error adding bytestring: end of buffer.\n");
+		pr_debug("Error adding bytestring: end of buffer.\n");
 		*err = -ERANGE;
 		return;
 	}
@@ -597,7 +597,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 {
 	if (length > OPAL_UID_LENGTH) {
-		pr_err("Can't build locking range. Length OOB\n");
+		pr_debug("Can't build locking range. Length OOB\n");
 		return -ERANGE;
 	}
 
@@ -614,7 +614,7 @@ static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 static int build_locking_user(u8 *buffer, size_t length, u8 lr)
 {
 	if (length > OPAL_UID_LENGTH) {
-		pr_err("Can't build locking range user, Length OOB\n");
+		pr_debug("Can't build locking range user, Length OOB\n");
 		return -ERANGE;
 	}
 
@@ -648,7 +648,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn)
 	add_token_u8(&err, cmd, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error finalizing command.\n");
+		pr_debug("Error finalizing command.\n");
 		return -EFAULT;
 	}
 
@@ -660,7 +660,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn)
 	hdr->subpkt.length = cpu_to_be32(cmd->pos - sizeof(*hdr));
 	while (cmd->pos % 4) {
 		if (cmd->pos >= IO_BUFFER_LENGTH) {
-			pr_err("Error: Buffer overrun\n");
+			pr_debug("Error: Buffer overrun\n");
 			return -ERANGE;
 		}
 		cmd->cmd[cmd->pos++] = 0;
@@ -679,14 +679,14 @@ static const struct opal_resp_tok *response_get_token(
 	const struct opal_resp_tok *tok;
 
 	if (n >= resp->num) {
-		pr_err("Token number doesn't exist: %d, resp: %d\n",
-		       n, resp->num);
+		pr_debug("Token number doesn't exist: %d, resp: %d\n",
+			 n, resp->num);
 		return ERR_PTR(-EINVAL);
 	}
 
 	tok = &resp->toks[n];
 	if (tok->len == 0) {
-		pr_err("Token length must be non-zero\n");
+		pr_debug("Token length must be non-zero\n");
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -727,7 +727,7 @@ static ssize_t response_parse_short(struct opal_resp_tok *tok,
 
 		tok->type = OPAL_DTA_TOKENID_UINT;
 		if (tok->len > 9) {
-			pr_warn("uint64 with more than 8 bytes\n");
+			pr_debug("uint64 with more than 8 bytes\n");
 			return -EINVAL;
 		}
 		for (i = tok->len - 1; i > 0; i--) {
@@ -814,8 +814,8 @@ static int response_parse(const u8 *buf, size_t length,
 
 	if (clen == 0 || plen == 0 || slen == 0 ||
 	    slen > IO_BUFFER_LENGTH - sizeof(*hdr)) {
-		pr_err("Bad header length. cp: %u, pkt: %u, subpkt: %u\n",
-		       clen, plen, slen);
+		pr_debug("Bad header length. cp: %u, pkt: %u, subpkt: %u\n",
+			 clen, plen, slen);
 		print_buffer(pos, sizeof(*hdr));
 		return -EINVAL;
 	}
@@ -848,7 +848,7 @@ static int response_parse(const u8 *buf, size_t length,
 	}
 
 	if (num_entries == 0) {
-		pr_err("Couldn't parse response.\n");
+		pr_debug("Couldn't parse response.\n");
 		return -EINVAL;
 	}
 	resp->num = num_entries;
@@ -861,18 +861,18 @@ static size_t response_get_string(const struct parsed_resp *resp, int n,
 {
 	*store = NULL;
 	if (!resp) {
-		pr_err("Response is NULL\n");
+		pr_debug("Response is NULL\n");
 		return 0;
 	}
 
 	if (n > resp->num) {
-		pr_err("Response has %d tokens. Can't access %d\n",
-		       resp->num, n);
+		pr_debug("Response has %d tokens. Can't access %d\n",
+			 resp->num, n);
 		return 0;
 	}
 
 	if (resp->toks[n].type != OPAL_DTA_TOKENID_BYTESTRING) {
-		pr_err("Token is not a byte string!\n");
+		pr_debug("Token is not a byte string!\n");
 		return 0;
 	}
 
@@ -883,26 +883,26 @@ static size_t response_get_string(const struct parsed_resp *resp, int n,
 static u64 response_get_u64(const struct parsed_resp *resp, int n)
 {
 	if (!resp) {
-		pr_err("Response is NULL\n");
+		pr_debug("Response is NULL\n");
 		return 0;
 	}
 
 	if (n > resp->num) {
-		pr_err("Response has %d tokens. Can't access %d\n",
-		       resp->num, n);
+		pr_debug("Response has %d tokens. Can't access %d\n",
+			 resp->num, n);
 		return 0;
 	}
 
 	if (resp->toks[n].type != OPAL_DTA_TOKENID_UINT) {
-		pr_err("Token is not unsigned it: %d\n",
-		       resp->toks[n].type);
+		pr_debug("Token is not unsigned it: %d\n",
+			 resp->toks[n].type);
 		return 0;
 	}
 
 	if (!(resp->toks[n].width == OPAL_WIDTH_TINY ||
 	      resp->toks[n].width == OPAL_WIDTH_SHORT)) {
-		pr_err("Atom is not short or tiny: %d\n",
-		       resp->toks[n].width);
+		pr_debug("Atom is not short or tiny: %d\n",
+			 resp->toks[n].width);
 		return 0;
 	}
 
@@ -949,7 +949,7 @@ static int parse_and_check_status(struct opal_dev *dev)
 
 	error = response_parse(dev->resp, IO_BUFFER_LENGTH, &dev->parsed);
 	if (error) {
-		pr_err("Couldn't parse response.\n");
+		pr_debug("Couldn't parse response.\n");
 		return error;
 	}
 
@@ -975,7 +975,7 @@ static int start_opal_session_cont(struct opal_dev *dev)
 	tsn = response_get_u64(&dev->parsed, 5);
 
 	if (hsn == 0 && tsn == 0) {
-		pr_err("Couldn't authenticate session\n");
+		pr_debug("Couldn't authenticate session\n");
 		return -EPERM;
 	}
 
@@ -1012,7 +1012,7 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont)
 
 	ret = cmd_finalize(dev, dev->hsn, dev->tsn);
 	if (ret) {
-		pr_err("Error finalizing command buffer: %d\n", ret);
+		pr_debug("Error finalizing command buffer: %d\n", ret);
 		return ret;
 	}
 
@@ -1041,7 +1041,7 @@ static int gen_key(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building gen key command\n");
+		pr_debug("Error building gen key command\n");
 		return err;
 
 	}
@@ -1059,8 +1059,8 @@ static int get_active_key_cont(struct opal_dev *dev)
 		return error;
 	keylen = response_get_string(&dev->parsed, 4, &activekey);
 	if (!activekey) {
-		pr_err("%s: Couldn't extract the Activekey from the response\n",
-		       __func__);
+		pr_debug("%s: Couldn't extract the Activekey from the response\n",
+			 __func__);
 		return OPAL_INVAL_PARAM;
 	}
 	dev->prev_data = kmemdup(activekey, keylen, GFP_KERNEL);
@@ -1103,7 +1103,7 @@ static int get_active_key(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	if (err) {
-		pr_err("Error building get active key command\n");
+		pr_debug("Error building get active key command\n");
 		return err;
 	}
 
@@ -1159,7 +1159,7 @@ static inline int enable_global_lr(struct opal_dev *dev, u8 *uid,
 	err = generic_lr_enable_disable(dev, uid, !!setup->RLE, !!setup->WLE,
 					0, 0);
 	if (err)
-		pr_err("Failed to create enable global lr command\n");
+		pr_debug("Failed to create enable global lr command\n");
 	return err;
 }
 
@@ -1217,7 +1217,7 @@ static int setup_locking_range(struct opal_dev *dev, void *data)
 
 	}
 	if (err) {
-		pr_err("Error building Setup Locking range command.\n");
+		pr_debug("Error building Setup Locking range command.\n");
 		return err;
 
 	}
@@ -1234,11 +1234,8 @@ static int start_generic_opal_session(struct opal_dev *dev,
 	u32 hsn;
 	int err = 0;
 
-	if (key == NULL && auth != OPAL_ANYBODY_UID) {
-		pr_err("%s: Attempted to open ADMIN_SP Session without a Host" \
-		       "Challenge, and not as the Anybody UID\n", __func__);
+	if (key == NULL && auth != OPAL_ANYBODY_UID)
 		return OPAL_INVAL_PARAM;
-	}
 
 	clear_opal_cmd(dev);
 
@@ -1273,12 +1270,12 @@ static int start_generic_opal_session(struct opal_dev *dev,
 		add_token_u8(&err, dev, OPAL_ENDLIST);
 		break;
 	default:
-		pr_err("Cannot start Admin SP session with auth %d\n", auth);
+		pr_debug("Cannot start Admin SP session with auth %d\n", auth);
 		return OPAL_INVAL_PARAM;
 	}
 
 	if (err) {
-		pr_err("Error building start adminsp session command.\n");
+		pr_debug("Error building start adminsp session command.\n");
 		return err;
 	}
 
@@ -1369,7 +1366,7 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building STARTSESSION command.\n");
+		pr_debug("Error building STARTSESSION command.\n");
 		return err;
 	}
 
@@ -1391,7 +1388,7 @@ static int revert_tper(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	if (err) {
-		pr_err("Error building REVERT TPER command.\n");
+		pr_debug("Error building REVERT TPER command.\n");
 		return err;
 	}
 
@@ -1426,7 +1423,7 @@ static int internal_activate_user(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building Activate UserN command.\n");
+		pr_debug("Error building Activate UserN command.\n");
 		return err;
 	}
 
@@ -1453,7 +1450,7 @@ static int erase_locking_range(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building Erase Locking Range Command.\n");
+		pr_debug("Error building Erase Locking Range Command.\n");
 		return err;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1484,7 +1481,7 @@ static int set_mbr_done(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error Building set MBR Done command\n");
+		pr_debug("Error Building set MBR Done command\n");
 		return err;
 	}
 
@@ -1516,7 +1513,7 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error Building set MBR done command\n");
+		pr_debug("Error Building set MBR done command\n");
 		return err;
 	}
 
@@ -1567,7 +1564,7 @@ static int set_new_pw(struct opal_dev *dev, void *data)
 
 	if (generic_pw_cmd(usr->opal_key.key, usr->opal_key.key_len,
 			   cpin_uid, dev)) {
-		pr_err("Error building set password command.\n");
+		pr_debug("Error building set password command.\n");
 		return -ERANGE;
 	}
 
@@ -1582,7 +1579,7 @@ static int set_sid_cpin_pin(struct opal_dev *dev, void *data)
 	memcpy(cpin_uid, opaluid[OPAL_C_PIN_SID], OPAL_UID_LENGTH);
 
 	if (generic_pw_cmd(key->key, key->key_len, cpin_uid, dev)) {
-		pr_err("Error building Set SID cpin\n");
+		pr_debug("Error building Set SID cpin\n");
 		return -ERANGE;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1657,7 +1654,7 @@ static int add_user_to_lr(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building add user to locking range command.\n");
+		pr_debug("Error building add user to locking range command.\n");
 		return err;
 	}
 
@@ -1691,7 +1688,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 		/* vars are initalized to locked */
 		break;
 	default:
-		pr_err("Tried to set an invalid locking state... returning to uland\n");
+		pr_debug("Tried to set an invalid locking state... returning to uland\n");
 		return OPAL_INVAL_PARAM;
 	}
 
@@ -1718,7 +1715,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building SET command.\n");
+		pr_debug("Error building SET command.\n");
 		return err;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1752,14 +1749,14 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data)
 		/* vars are initalized to locked */
 		break;
 	default:
-		pr_err("Tried to set an invalid locking state.\n");
+		pr_debug("Tried to set an invalid locking state.\n");
 		return OPAL_INVAL_PARAM;
 	}
 	ret = generic_lr_enable_disable(dev, lr_buffer, 1, 1,
 					read_locked, write_locked);
 
 	if (ret < 0) {
-		pr_err("Error building SET command.\n");
+		pr_debug("Error building SET command.\n");
 		return ret;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1811,7 +1808,7 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 	}
 
 	if (err) {
-		pr_err("Error building Activate LockingSP command.\n");
+		pr_debug("Error building Activate LockingSP command.\n");
 		return err;
 	}
 
@@ -1831,7 +1828,7 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev)
 	/* 0x08 is Manufacured Inactive */
 	/* 0x09 is Manufactured */
 	if (lc_status != OPAL_MANUFACTURED_INACTIVE) {
-		pr_err("Couldn't determine the status of the Lifecycle state\n");
+		pr_debug("Couldn't determine the status of the Lifecycle state\n");
 		return -ENODEV;
 	}
 
@@ -1868,7 +1865,7 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error Building GET Lifecycle Status command\n");
+		pr_debug("Error Building GET Lifecycle Status command\n");
 		return err;
 	}
 
@@ -1887,7 +1884,7 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev)
 
 	strlen = response_get_string(&dev->parsed, 4, &msid_pin);
 	if (!msid_pin) {
-		pr_err("%s: Couldn't extract PIN from response\n", __func__);
+		pr_debug("%s: Couldn't extract PIN from response\n", __func__);
 		return OPAL_INVAL_PARAM;
 	}
 
@@ -1929,7 +1926,7 @@ static int get_msid_cpin_pin(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building Get MSID CPIN PIN command.\n");
+		pr_debug("Error building Get MSID CPIN PIN command.\n");
 		return err;
 	}
 
@@ -2124,18 +2121,18 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 
 	if (lk_unlk->l_state != OPAL_RO &&
 	    lk_unlk->l_state != OPAL_RW) {
-		pr_err("Locking state was not RO or RW\n");
+		pr_debug("Locking state was not RO or RW\n");
 		return -EINVAL;
 	}
 	if (lk_unlk->session.who < OPAL_USER1 ||
 	    lk_unlk->session.who > OPAL_USER9) {
-		pr_err("Authority was not within the range of users: %d\n",
-		       lk_unlk->session.who);
+		pr_debug("Authority was not within the range of users: %d\n",
+			 lk_unlk->session.who);
 		return -EINVAL;
 	}
 	if (lk_unlk->session.sum) {
-		pr_err("%s not supported in sum. Use setup locking range\n",
-		       __func__);
+		pr_debug("%s not supported in sum. Use setup locking range\n",
+			 __func__);
 		return -EINVAL;
 	}
 
@@ -2312,7 +2309,7 @@ static int opal_activate_user(struct opal_dev *dev,
 	/* We can't activate Admin1 it's active as manufactured */
 	if (opal_session->who < OPAL_USER1 ||
 	    opal_session->who > OPAL_USER9) {
-		pr_err("Who was not a valid user: %d\n", opal_session->who);
+		pr_debug("Who was not a valid user: %d\n", opal_session->who);
 		return -EINVAL;
 	}
 
@@ -2343,9 +2340,9 @@ bool opal_unlock_from_suspend(struct opal_dev *dev)
 
 		ret = __opal_lock_unlock(dev, &suspend->unlk);
 		if (ret) {
-			pr_warn("Failed to unlock LR %hhu with sum %d\n",
-				suspend->unlk.session.opal_key.lr,
-				suspend->unlk.session.sum);
+			pr_debug("Failed to unlock LR %hhu with sum %d\n",
+				 suspend->unlk.session.opal_key.lr,
+				 suspend->unlk.session.sum);
 			was_failure = true;
 		}
 	}
@@ -2363,10 +2360,8 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg)
 		return -EACCES;
 	if (!dev)
 		return -ENOTSUPP;
-	if (!dev->supported) {
-		pr_err("Not supported\n");
+	if (!dev->supported)
 		return -ENOTSUPP;
-	}
 
 	p = memdup_user(arg, _IOC_SIZE(cmd));
 	if (IS_ERR(p))
@@ -2410,7 +2405,7 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg)
 		ret = opal_secure_erase_locking_range(dev, p);
 		break;
 	default:
-		pr_warn("No such Opal Ioctl %u\n", cmd);
+		break;
 	}
 
 	kfree(p);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
From: Jens Axboe @ 2017-04-07 18:51 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org
In-Reply-To: <1491590377.2559.22.camel@sandisk.com>

On 04/07/2017 12:39 PM, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
>> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
>>> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
>>>> Hello Jens,
>>>>
>>>> The six patches in this patch series fix the queue lockup I reported
>>>> recently on the linux-block mailing list. Please consider these patches
>>>> for inclusion in the upstream kernel.
>>>
>>> Some of this we need in 4.11, but not all of it. I can't be applying patches
>>> that "improve scalability" at this point.
>>>
>>> 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
>>> we can put 1-3 on top in 4.12, with the others pulled in first.
>>
>> Hello Jens,
>>
>> Please note that patch 2/6 is a bug fix. The current implementation of
>> blk_mq_sched_restart_queues() only considers hardware queues associated with
>> the same request queue as the hardware queue that has been passed as an
>> argument. If a tag set is shared across request queues - as is the case for
>> SCSI - then all request queues that share a tag set with the hctx argument
>> must be considered.
> 
> (replying to my own e-mail)
> 
> Hello Jens,
> 
> If you want I can split that patch into two patches - one that runs all hardware
> queues with which the tag set is shared and one that switches from rerunning
> all hardware queues to one hardware queue.

I already put it in, but this is getting very awkward. We're at -rc5 time, patches
going into mainline should be TINY. And now I'm sitting on this, that I have to
justify:

 15 files changed, 281 insertions(+), 164 deletions(-)

and where one of the patches reads like it's a performance improvement, when
in reality it's fixing a hang. So yes, the patch should have been split in
two, and the series should have been ordered so that the first patches could
go into 4.11, and the rest on top of that in 4.12. Did we really need a
patch clarifying comments in that series? Probably not.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH v3] lightnvm: pblk
From: Javier González @ 2017-04-07 18:50 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González

This patch introduces pblk, a new target for LightNVM implementing a
full host-based FTL. Details on the commit message.

Changes since v2:
* Rebase on top of Matias' for-4.12/core
* Implement L2P scan recovery to recover L2P table in case of power
  failure.
* Re-design disk format to be more flexible in future versions (from
  Matias Bjørling)
* Implement per-instance uuid to allow correct recovery without
  forcing line erases (from Matias Bjørling)
* Re-design GC threading to have several GC readers and a single
  writer that places data on the write buffer. This allows to maximize
  the GC write buffer budget without having unnecessary GC writers
  competing for the write buffer lock.
* Simplify sysfs interface.
* Refactoring and several code improvements (together with Matias
  Bjørling)

Changes since v1:
* Rebase on top of Matias' for-4.12/core
* Move from per-LUN block allocation to a line model. This means that a
  whole lines across all LUNs is allocated at a time. Data is still
  stripped in a round-robin fashion at a page granurality.
* Implement new disk format scheme, where metadata is stored per line
  instead of per LUN. This allows for space optimizations.
* Improvements on GC workqueue management and victim selection.
* Implement sysfs interface to query pblk's operation and statistics.
* Implement a user - GC I/O rate-limiter
* Various bug fixes

Javier González (1):
  lightnvm: physical block device (pblk) target

 Documentation/lightnvm/pblk.txt  |   21 +
 drivers/lightnvm/Kconfig         |   19 +
 drivers/lightnvm/Makefile        |    5 +
 drivers/lightnvm/pblk-cache.c    |  112 +++
 drivers/lightnvm/pblk-core.c     | 1641 ++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/pblk-gc.c       |  542 +++++++++++++
 drivers/lightnvm/pblk-init.c     |  942 ++++++++++++++++++++++
 drivers/lightnvm/pblk-map.c      |  135 ++++
 drivers/lightnvm/pblk-rb.c       |  859 ++++++++++++++++++++
 drivers/lightnvm/pblk-read.c     |  513 ++++++++++++
 drivers/lightnvm/pblk-recovery.c | 1007 +++++++++++++++++++++++
 drivers/lightnvm/pblk-rl.c       |  182 +++++
 drivers/lightnvm/pblk-sysfs.c    |  500 ++++++++++++
 drivers/lightnvm/pblk-write.c    |  408 ++++++++++
 drivers/lightnvm/pblk.h          | 1127 ++++++++++++++++++++++++++
 include/linux/lightnvm.h         |   57 +-
 pblk-sysfs.c                     |  581 ++++++++++++++
 17 files changed, 8638 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/lightnvm/pblk.txt
 create mode 100644 drivers/lightnvm/pblk-cache.c
 create mode 100644 drivers/lightnvm/pblk-core.c
 create mode 100644 drivers/lightnvm/pblk-gc.c
 create mode 100644 drivers/lightnvm/pblk-init.c
 create mode 100644 drivers/lightnvm/pblk-map.c
 create mode 100644 drivers/lightnvm/pblk-rb.c
 create mode 100644 drivers/lightnvm/pblk-read.c
 create mode 100644 drivers/lightnvm/pblk-recovery.c
 create mode 100644 drivers/lightnvm/pblk-rl.c
 create mode 100644 drivers/lightnvm/pblk-sysfs.c
 create mode 100644 drivers/lightnvm/pblk-write.c
 create mode 100644 drivers/lightnvm/pblk.h
 create mode 100644 pblk-sysfs.c

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
From: Bart Van Assche @ 2017-04-07 18:39 UTC (permalink / raw)
  To: axboe@kernel.dk; +Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org
In-Reply-To: <1491590023.2559.18.camel@sandisk.com>

On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> > On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > > Hello Jens,
> > >=20
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patch=
es
> > > for inclusion in the upstream kernel.
> >=20
> > Some of this we need in 4.11, but not all of it. I can't be applying pa=
tches
> > that "improve scalability" at this point.
> >=20
> > 4-6 looks like what we want for 4.11, I'll see if those apply directly.=
 Then
> > we can put 1-3 on top in 4.12, with the others pulled in first.
>=20
> Hello Jens,
>=20
> Please note that patch 2/6 is a bug fix. The current implementation of
> blk_mq_sched_restart_queues() only considers hardware queues associated w=
ith
> the same request queue as the hardware queue that has been passed as an
> argument. If a tag set is shared across request queues - as is the case f=
or
> SCSI - then all request queues that share a tag set with the hctx argumen=
t
> must be considered.

(replying to my own e-mail)

Hello Jens,

If you want I can split that patch into two patches - one that runs all har=
dware
queues with which the tag set is shared and one that switches from rerunnin=
g
all hardware queues to one hardware queue.

Bart.=

^ permalink raw reply

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
From: Bart Van Assche @ 2017-04-07 18:33 UTC (permalink / raw)
  To: axboe@kernel.dk; +Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org
In-Reply-To: <78b80b12-a764-a793-f8a1-7a941dc84ee5@kernel.dk>

On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > Hello Jens,
> >=20
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
>=20
> Some of this we need in 4.11, but not all of it. I can't be applying patc=
hes
> that "improve scalability" at this point.
>=20
> 4-6 looks like what we want for 4.11, I'll see if those apply directly. T=
hen
> we can put 1-3 on top in 4.12, with the others pulled in first.

Hello Jens,

Please note that patch 2/6 is a bug fix. The current implementation of
blk_mq_sched_restart_queues() only considers hardware queues associated wit=
h
the same request queue as the hardware queue that has been passed as an
argument. If a tag set is shared across request queues - as is the case for
SCSI - then all request queues that share a tag set with the hctx argument
must be considered.

Bart.=

^ permalink raw reply

* [PATCH 4/4] lightnvm: allow to init targets on factory mode
From: Javier González @ 2017-04-07 18:31 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González
In-Reply-To: <1491589874-26818-1-git-send-email-javier@cnexlabs.com>

Target initialization has two responsibilities: creating the target
partition and instantiating the target. This patch enables to create a
factory partition (e.g., do not trigger recovery on the given target).
This is useful for target development and for being able to restore the
device state at any moment in time without requiring a full-device
erase.

Signed-off-by: Javier González <javier@cnexlabs.com>
---
 drivers/lightnvm/core.c       | 14 +++++++++++---
 drivers/lightnvm/rrpc.c       |  3 ++-
 include/linux/lightnvm.h      |  3 ++-
 include/uapi/linux/lightnvm.h |  4 ++++
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 28e69a7..e4530c2 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -279,7 +279,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 	tdisk->fops = &nvm_fops;
 	tdisk->queue = tqueue;
 
-	targetdata = tt->init(tgt_dev, tdisk);
+	targetdata = tt->init(tgt_dev, tdisk, create->flags);
 	if (IS_ERR(targetdata))
 		goto err_init;
 
@@ -1243,8 +1243,16 @@ static long nvm_ioctl_dev_create(struct file *file, void __user *arg)
 	create.tgtname[DISK_NAME_LEN - 1] = '\0';
 
 	if (create.flags != 0) {
-		pr_err("nvm: no flags supported\n");
-		return -EINVAL;
+		__u32 flags = create.flags;
+
+		/* Check for valid flags */
+		if (flags & NVM_TARGET_FACTORY)
+			flags &= ~NVM_TARGET_FACTORY;
+
+		if (flags) {
+			pr_err("nvm: flag not supported\n");
+			return -EINVAL;
+		}
 	}
 
 	return __nvm_configure_create(&create);
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 4e4c299..2c04ff3 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -1515,7 +1515,8 @@ static int rrpc_luns_configure(struct rrpc *rrpc)
 
 static struct nvm_tgt_type tt_rrpc;
 
-static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk)
+static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
+		       int flags)
 {
 	struct request_queue *bqueue = dev->q;
 	struct request_queue *tqueue = tdisk->queue;
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index bebea80..e88f7ef 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -436,7 +436,8 @@ static inline int ppa_cmp_blk(struct ppa_addr ppa1, struct ppa_addr ppa2)
 
 typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *);
 typedef sector_t (nvm_tgt_capacity_fn)(void *);
-typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *);
+typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *,
+				int flags);
 typedef void (nvm_tgt_exit_fn)(void *);
 typedef int (nvm_tgt_sysfs_init_fn)(struct gendisk *);
 typedef void (nvm_tgt_sysfs_exit_fn)(struct gendisk *);
diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h
index fd19f36..c8aec4b 100644
--- a/include/uapi/linux/lightnvm.h
+++ b/include/uapi/linux/lightnvm.h
@@ -85,6 +85,10 @@ struct nvm_ioctl_create_conf {
 	};
 };
 
+enum {
+	NVM_TARGET_FACTORY = 1 << 0,	/* Init target in factory mode */
+};
+
 struct nvm_ioctl_create {
 	char dev[DISK_NAME_LEN];		/* open-channel SSD device */
 	char tgttype[NVM_TTYPE_NAME_MAX];	/* target type name */
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/4] lightnvm: bad type conversion for nvme control bits
From: Javier González @ 2017-04-07 18:31 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González
In-Reply-To: <1491589874-26818-1-git-send-email-javier@cnexlabs.com>

The NVMe I/O command control bits are 16 bytes, but is interpreted as
32 bytes in the lightnvm user I/O data path.

Signed-off-by: Javier González <javier@cnexlabs.com>
---
 drivers/nvme/host/lightnvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 2c8f933..83e7ea2 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -754,7 +754,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
 	c.ph_rw.length = cpu_to_le16(vcmd.nppas);
-	c.ph_rw.control  = cpu_to_le32(vcmd.control);
+	c.ph_rw.control  = cpu_to_le16(vcmd.control);
 	c.common.cdw10[3] = cpu_to_le32(vcmd.cdw13);
 	c.common.cdw10[4] = cpu_to_le32(vcmd.cdw14);
 	c.common.cdw10[5] = cpu_to_le32(vcmd.cdw15);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/4] lightnvm: fix cleanup order of disk on init error
From: Javier González @ 2017-04-07 18:31 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González
In-Reply-To: <1491589874-26818-1-git-send-email-javier@cnexlabs.com>

Reorder disk allocation such that the disk structure can be put
safely.

Signed-off-by: Javier González <javier@cnexlabs.com>
---
 drivers/lightnvm/core.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index da4c082..28e69a7 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -263,15 +263,15 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 		goto err_t;
 	}
 
+	tdisk = alloc_disk(0);
+	if (!tdisk)
+		goto err_dev;
+
 	tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
 	if (!tqueue)
-		goto err_dev;
+		goto err_disk;
 	blk_queue_make_request(tqueue, tt->make_rq);
 
-	tdisk = alloc_disk(0);
-	if (!tdisk)
-		goto err_queue;
-
 	sprintf(tdisk->disk_name, "%s", create->tgtname);
 	tdisk->flags = GENHD_FL_EXT_DEVT;
 	tdisk->major = 0;
@@ -307,9 +307,9 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 	if (tt->exit)
 		tt->exit(targetdata);
 err_init:
-	put_disk(tdisk);
-err_queue:
 	blk_cleanup_queue(tqueue);
+err_disk:
+	put_disk(tdisk);
 err_dev:
 	nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
-- 
2.7.4

^ permalink raw reply related

* [PATCH 1/4] lightnvm: double-clear of dev->lun_map on target init error
From: Javier González @ 2017-04-07 18:31 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González,
	Matias Bjørling

The dev->lun_map bits are cleared twice if an target init error occurs.
First in the target clean routine, and then next in the nvm_tgt_create
error function. Make sure that it is only cleared once by extending
nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can
ignored when cleaning up a successful initialized target.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 2122922..da4c082 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -89,7 +89,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int lun_begin,
 		WARN_ON(!test_and_clear_bit(i, dev->lun_map));
 }
 
-static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
+static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
 {
 	struct nvm_dev *dev = tgt_dev->parent;
 	struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -100,11 +100,13 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
 		int *lun_offs = ch_map->lun_offs;
 		int ch = i + ch_map->ch_off;
 
-		for (j = 0; j < ch_map->nr_luns; j++) {
-			int lun = j + lun_offs[j];
-			int lunid = (ch * dev->geo.luns_per_chnl) + lun;
+		if (clear) {
+			for (j = 0; j < ch_map->nr_luns; j++) {
+				int lun = j + lun_offs[j];
+				int lunid = (ch * dev->geo.luns_per_chnl) + lun;
 
-			WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+				WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+			}
 		}
 
 		kfree(ch_map->lun_offs);
@@ -309,7 +311,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 err_queue:
 	blk_cleanup_queue(tqueue);
 err_dev:
-	nvm_remove_tgt_dev(tgt_dev);
+	nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
 	kfree(t);
 err_reserve:
@@ -332,7 +334,7 @@ static void __nvm_remove_target(struct nvm_target *t)
 	if (tt->exit)
 		tt->exit(tdisk->private_data);
 
-	nvm_remove_tgt_dev(t->dev);
+	nvm_remove_tgt_dev(t->dev, 1);
 	put_disk(tdisk);
 
 	list_del(&t->list);
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox