Linux block layer
 help / color / mirror / Atom feed
* [PATCH v2 5/5] blk-mq: introduce Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.

The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.

A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.

Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.

These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.

Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched                 |   9 +
 block/Makefile                        |   1 +
 block/kyber-iosched.c                 | 706 ++++++++++++++++++++++++++++++++++
 4 files changed, 730 insertions(+)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

diff --git a/Documentation/block/kyber-iosched.txt b/Documentation/block/kyber-iosched.txt
new file mode 100644
index 000000000000..e94feacd7edc
--- /dev/null
+++ b/Documentation/block/kyber-iosched.txt
@@ -0,0 +1,14 @@
+Kyber I/O scheduler tunables
+===========================
+
+The only two tunables for the Kyber scheduler are the target latencies for
+reads and synchronous writes. Kyber will throttle requests in order to meet
+these target latencies.
+
+read_lat_nsec
+-------------
+Target latency for reads (in nanoseconds).
+
+write_lat_nsec
+--------------
+Target latency for synchronous writes (in nanoseconds).
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc8684788d..916e69c68fa4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,15 @@ config MQ_IOSCHED_DEADLINE
 	---help---
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_KYBER
+	tristate "Kyber I/O scheduler"
+	default y
+	---help---
+	  The Kyber I/O scheduler is a low-overhead scheduler suitable for
+	  multiqueue and other fast devices. Given target latencies for reads and
+	  synchronous writes, it will self-tune queue depths to achieve that
+	  goal.
+
 endmenu
 
 endif
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..6146d2eaaeaa 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
new file mode 100644
index 000000000000..30253d0758aa
--- /dev/null
+++ b/block/kyber-iosched.c
@@ -0,0 +1,706 @@
+/*
+ * The Kyber I/O scheduler. Controls latency by throttling queue depths using
+ * scalable techniques.
+ *
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <https://www.gnu.org/licenses/>.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-stat.h"
+
+/* Scheduling domains. */
+enum {
+	KYBER_READ,
+	KYBER_SYNC_WRITE,
+	KYBER_OTHER, /* Async writes, discard, etc. */
+	KYBER_NUM_DOMAINS,
+};
+
+enum {
+	KYBER_MIN_DEPTH = 256,
+
+	/*
+	 * In order to prevent starvation of synchronous requests by a flood of
+	 * asynchronous requests, we reserve 25% of requests for synchronous
+	 * operations.
+	 */
+	KYBER_ASYNC_PERCENT = 75,
+};
+
+/*
+ * Initial device-wide depths for each scheduling domain.
+ *
+ * Even for fast devices with lots of tags like NVMe, you can saturate
+ * the device with only a fraction of the maximum possible queue depth.
+ * So, we cap these to a reasonable value.
+ */
+static const unsigned int kyber_depth[] = {
+	[KYBER_READ] = 256,
+	[KYBER_SYNC_WRITE] = 128,
+	[KYBER_OTHER] = 64,
+};
+
+/*
+ * Scheduling domain batch sizes. We favor reads.
+ */
+static const unsigned int kyber_batch_size[] = {
+	[KYBER_READ] = 16,
+	[KYBER_SYNC_WRITE] = 8,
+	[KYBER_OTHER] = 8,
+};
+
+struct kyber_queue_data {
+	struct request_queue *q;
+
+	struct blk_stat_callback *cb;
+
+	/*
+	 * The device is divided into multiple scheduling domains based on the
+	 * request type. Each domain has a fixed number of in-flight requests of
+	 * that type device-wide, limited by these tokens.
+	 */
+	struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
+
+	/*
+	 * Async request percentage, converted to per-word depth for
+	 * sbitmap_get_shallow().
+	 */
+	unsigned int async_depth;
+
+	/* Target latencies in nanoseconds. */
+	u64 read_lat_nsec, write_lat_nsec;
+};
+
+struct kyber_hctx_data {
+	spinlock_t lock;
+	struct list_head rqs[KYBER_NUM_DOMAINS];
+	int cur_domain;
+	unsigned int batching;
+};
+
+static unsigned int rq_sched_domain(const struct request *rq)
+{
+	unsigned int op = rq->cmd_flags;
+
+	if ((op & REQ_OP_MASK) == REQ_OP_READ)
+		return KYBER_READ;
+	else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
+		return KYBER_SYNC_WRITE;
+	else
+		return KYBER_OTHER;
+}
+
+enum {
+	NONE = 0,
+	GOOD = 1,
+	GREAT = 2,
+	BAD = -1,
+	AWFUL = -2,
+};
+
+#define IS_GOOD(status) ((status) > 0)
+#define IS_BAD(status) ((status) < 0)
+
+static int kyber_lat_status(struct blk_stat_callback *cb,
+			    unsigned int sched_domain, u64 target)
+{
+	u64 latency;
+
+	if (!cb->stat[sched_domain].nr_samples)
+		return NONE;
+
+	latency = cb->stat[sched_domain].mean;
+	if (latency >= 2 * target)
+		return AWFUL;
+	else if (latency > target)
+		return BAD;
+	else if (latency <= target / 2)
+		return GREAT;
+	else /* (latency <= target) */
+		return GOOD;
+}
+
+/*
+ * Adjust the read or synchronous write depth given the status of reads and
+ * writes. The goal is that the latencies of the two domains are fair (i.e., if
+ * one is good, then the other is good).
+ */
+static void kyber_adjust_rw_depth(struct kyber_queue_data *kqd,
+				  unsigned int sched_domain, int this_status,
+				  int other_status)
+{
+	unsigned int orig_depth, depth;
+
+	/*
+	 * If this domain had no samples, or reads and writes are both good or
+	 * both bad, don't adjust the depth.
+	 */
+	if (this_status == NONE ||
+	    (IS_GOOD(this_status) && IS_GOOD(other_status)) ||
+	    (IS_BAD(this_status) && IS_BAD(other_status)))
+		return;
+
+	orig_depth = depth = kqd->domain_tokens[sched_domain].sb.depth;
+
+	if (other_status == NONE) {
+		depth++;
+	} else {
+		switch (this_status) {
+		case GOOD:
+			if (other_status == AWFUL)
+				depth -= max(depth / 4, 1U);
+			else
+				depth -= max(depth / 8, 1U);
+			break;
+		case GREAT:
+			if (other_status == AWFUL)
+				depth /= 2;
+			else
+				depth -= max(depth / 4, 1U);
+			break;
+		case BAD:
+			depth++;
+			break;
+		case AWFUL:
+			if (other_status == GREAT)
+				depth += 2;
+			else
+				depth++;
+			break;
+		}
+	}
+
+	depth = clamp(depth, 1U, kyber_depth[sched_domain]);
+	if (depth != orig_depth)
+		sbitmap_queue_resize(&kqd->domain_tokens[sched_domain], depth);
+}
+
+/*
+ * Adjust the depth of other requests given the status of reads and synchronous
+ * writes. As long as either domain is doing fine, we don't throttle, but if
+ * both domains are doing badly, we throttle heavily.
+ */
+static void kyber_adjust_other_depth(struct kyber_queue_data *kqd,
+				     int read_status, int write_status,
+				     bool have_samples)
+{
+	unsigned int orig_depth, depth;
+	int status;
+
+	orig_depth = depth = kqd->domain_tokens[KYBER_OTHER].sb.depth;
+
+	if (read_status == NONE && write_status == NONE) {
+		depth += 2;
+	} else if (have_samples) {
+		if (read_status == NONE)
+			status = write_status;
+		else if (write_status == NONE)
+			status = read_status;
+		else
+			status = max(read_status, write_status);
+		switch (status) {
+		case GREAT:
+			depth += 2;
+			break;
+		case GOOD:
+			depth++;
+			break;
+		case BAD:
+			depth -= max(depth / 4, 1U);
+			break;
+		case AWFUL:
+			depth /= 2;
+			break;
+		}
+	}
+
+	depth = clamp(depth, 1U, kyber_depth[KYBER_OTHER]);
+	if (depth != orig_depth)
+		sbitmap_queue_resize(&kqd->domain_tokens[KYBER_OTHER], depth);
+}
+
+/*
+ * Apply heuristics for limiting queue depths based on gathered latency
+ * statistics.
+ */
+static void kyber_stat_timer_fn(struct blk_stat_callback *cb)
+{
+	struct kyber_queue_data *kqd = cb->data;
+	int read_status, write_status;
+
+	read_status = kyber_lat_status(cb, KYBER_READ, kqd->read_lat_nsec);
+	write_status = kyber_lat_status(cb, KYBER_SYNC_WRITE, kqd->write_lat_nsec);
+
+	kyber_adjust_rw_depth(kqd, KYBER_READ, read_status, write_status);
+	kyber_adjust_rw_depth(kqd, KYBER_SYNC_WRITE, write_status, read_status);
+	kyber_adjust_other_depth(kqd, read_status, write_status,
+				 cb->stat[KYBER_OTHER].nr_samples != 0);
+
+	/*
+	 * Continue monitoring latencies if we aren't hitting the targets or
+	 * we're still throttling other requests.
+	 */
+	if (!blk_stat_is_active(kqd->cb) &&
+	    ((IS_BAD(read_status) || IS_BAD(write_status) ||
+	      kqd->domain_tokens[KYBER_OTHER].sb.depth < kyber_depth[KYBER_OTHER])))
+		blk_stat_activate_msecs(kqd->cb, 100);
+}
+
+static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)
+{
+	/*
+	 * All of the hardware queues have the same depth, so we can just grab
+	 * the shift of the first one.
+	 */
+	return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+}
+
+static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
+{
+	struct kyber_queue_data *kqd;
+	unsigned int max_tokens;
+	unsigned int shift;
+	int ret = -ENOMEM;
+	int i;
+
+	kqd = kmalloc_node(sizeof(*kqd), GFP_KERNEL, q->node);
+	if (!kqd)
+		goto err;
+	kqd->q = q;
+
+	kqd->cb = blk_stat_alloc_callback(kyber_stat_timer_fn, rq_sched_domain,
+					  KYBER_NUM_DOMAINS, kqd);
+	if (!kqd->cb)
+		goto err_kqd;
+
+	/*
+	 * The maximum number of tokens for any scheduling domain is at least
+	 * the queue depth of a single hardware queue. If the hardware doesn't
+	 * have many tags, still provide a reasonable number.
+	 */
+	max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
+			   KYBER_MIN_DEPTH);
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		WARN_ON(!kyber_depth[i]);
+		WARN_ON(!kyber_batch_size[i]);
+		ret = sbitmap_queue_init_node(&kqd->domain_tokens[i],
+					      max_tokens, -1, false, GFP_KERNEL,
+					      q->node);
+		if (ret) {
+			while (--i >= 0)
+				sbitmap_queue_free(&kqd->domain_tokens[i]);
+			goto err_cb;
+		}
+		sbitmap_queue_resize(&kqd->domain_tokens[i], kyber_depth[i]);
+	}
+
+	shift = kyber_sched_tags_shift(kqd);
+	kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U;
+
+	kqd->read_lat_nsec = 2000000ULL;
+	kqd->write_lat_nsec = 10000000ULL;
+
+	return kqd;
+
+err_cb:
+	blk_stat_free_callback(kqd->cb);
+err_kqd:
+	kfree(kqd);
+err:
+	return ERR_PTR(ret);
+}
+
+static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct kyber_queue_data *kqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	kqd = kyber_queue_data_alloc(q);
+	if (IS_ERR(kqd)) {
+		kobject_put(&eq->kobj);
+		return PTR_ERR(kqd);
+	}
+
+	eq->elevator_data = kqd;
+	q->elevator = eq;
+
+	blk_stat_add_callback(q, kqd->cb);
+
+	return 0;
+}
+
+static void kyber_exit_sched(struct elevator_queue *e)
+{
+	struct kyber_queue_data *kqd = e->elevator_data;
+	struct request_queue *q = kqd->q;
+	int i;
+
+	blk_stat_remove_callback(q, kqd->cb);
+
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+		sbitmap_queue_free(&kqd->domain_tokens[i]);
+	blk_stat_free_callback(kqd->cb);
+	kfree(kqd);
+}
+
+static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct kyber_hctx_data *khd;
+	int i;
+
+	khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
+	if (!khd)
+		return -ENOMEM;
+
+	spin_lock_init(&khd->lock);
+
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+		INIT_LIST_HEAD(&khd->rqs[i]);
+
+	khd->cur_domain = 0;
+	khd->batching = 0;
+
+	hctx->sched_data = khd;
+
+	return 0;
+}
+
+static void kyber_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	kfree(hctx->sched_data);
+}
+
+static int kyber_get_domain_token(struct kyber_queue_data *kqd,
+				  unsigned int sched_domain)
+{
+	struct sbitmap_queue *domain_tokens;
+
+	domain_tokens = &kqd->domain_tokens[sched_domain];
+	return __sbitmap_queue_get(domain_tokens);
+}
+
+static int rq_get_domain_token(struct request *rq)
+{
+	return (long)rq->elv.priv[0];
+}
+
+static void rq_set_domain_token(struct request *rq, int token)
+{
+	rq->elv.priv[0] = (void *)(long)token;
+}
+
+static void rq_clear_domain_token(struct kyber_queue_data *kqd,
+				  struct request *rq)
+{
+	unsigned int sched_domain;
+	int nr;
+
+	nr = rq_get_domain_token(rq);
+	if (nr != -1) {
+		sched_domain = rq_sched_domain(rq);
+		sbitmap_queue_clear(&kqd->domain_tokens[sched_domain], nr,
+				    rq->mq_ctx->cpu);
+	}
+}
+
+static struct request *kyber_get_request(struct request_queue *q,
+					 unsigned int op,
+					 struct blk_mq_alloc_data *data)
+{
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
+	struct request *rq;
+
+	/*
+	 * We use the scheduler tags as per-hardware queue queueing tokens.
+	 * Async requests can be limited at this stage.
+	 */
+	if (!op_is_sync(op))
+		data->shallow_depth = kqd->async_depth;
+
+	rq = __blk_mq_alloc_request(data, op);
+	if (rq)
+		rq_set_domain_token(rq, -1);
+	return rq;
+}
+
+static void kyber_put_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
+
+	rq_clear_domain_token(kqd, rq);
+	blk_mq_finish_request(rq);
+}
+
+static void kyber_completed_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
+	unsigned int sched_domain;
+	u64 now, latency, target;
+
+	/*
+	 * Check if this request met our latency goal. If not, quickly gather
+	 * some statistics and start throttling.
+	 */
+	sched_domain = rq_sched_domain(rq);
+	switch (sched_domain) {
+	case KYBER_READ:
+		target = kqd->read_lat_nsec;
+		break;
+	case KYBER_SYNC_WRITE:
+		target = kqd->write_lat_nsec;
+		break;
+	default:
+		return;
+	}
+
+	/* If we are already monitoring latencies, don't check again. */
+	if (blk_stat_is_active(kqd->cb))
+		return;
+
+	now = __blk_stat_time(ktime_to_ns(ktime_get()));
+	if (now < blk_stat_time(&rq->issue_stat))
+		return;
+
+	latency = now - blk_stat_time(&rq->issue_stat);
+
+	if (latency > target)
+		blk_stat_activate_msecs(kqd->cb, 10);
+}
+
+static void kyber_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+				  struct kyber_hctx_data *khd)
+{
+	LIST_HEAD(rq_list);
+	struct request *rq, *next;
+
+	blk_mq_flush_busy_ctxs(hctx, &rq_list);
+	list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+		unsigned int sched_domain;
+
+		sched_domain = rq_sched_domain(rq);
+		list_move_tail(&rq->queuelist, &khd->rqs[sched_domain]);
+	}
+}
+
+static struct request *
+kyber_dispatch_cur_domain(struct blk_mq_hw_ctx *hctx,
+			  struct kyber_queue_data *kqd,
+			  struct kyber_hctx_data *khd,
+			  bool *flushed, bool *no_tokens)
+{
+	struct list_head *rqs;
+	struct request *rq;
+	int nr;
+
+	rqs = &khd->rqs[khd->cur_domain];
+	rq = list_first_entry_or_null(rqs, struct request, queuelist);
+
+	/*
+	 * If there wasn't already a pending request and we haven't flushed the
+	 * software queues yet, flush the software queues and check again.
+	 */
+	if (!rq && !*flushed) {
+		kyber_flush_busy_ctxs(hctx, khd);
+		*flushed = true;
+		rq = list_first_entry_or_null(rqs, struct request, queuelist);
+	}
+
+	if (rq) {
+		nr = kyber_get_domain_token(kqd, khd->cur_domain);
+		if (nr == -1) {
+			*no_tokens = true;
+		} else {
+			khd->batching++;
+			rq_set_domain_token(rq, nr);
+			list_del_init(&rq->queuelist);
+			return rq;
+		}
+	}
+
+	/* There were either no pending requests or no tokens. */
+	return NULL;
+}
+
+/*
+ * Returns a request on success, NULL if there were no requests to dispatch, and
+ * ERR_PTR(-EBUSY) if there were requests to dispatch but no domain tokens for
+ * them.
+ */
+static struct request *__kyber_dispatch_request(struct kyber_queue_data *kqd,
+						struct kyber_hctx_data *khd,
+						struct blk_mq_hw_ctx *hctx)
+{
+	bool flushed = false, no_tokens = false;
+	struct request *rq;
+	int i;
+
+	/*
+	 * First, if we are still entitled to batch, try to dispatch a request
+	 * from the batch.
+	 */
+	if (khd->batching < kyber_batch_size[khd->cur_domain]) {
+		rq = kyber_dispatch_cur_domain(hctx, kqd, khd, &flushed,
+					       &no_tokens);
+		if (rq)
+			return rq;
+	}
+
+	/*
+	 * Either,
+	 * 1. We were no longer entitled to a batch.
+	 * 2. The domain we were batching didn't have any requests.
+	 * 3. The domain we were batching was out of tokens.
+	 *
+	 * Start another batch. Note that this wraps back around to the original
+	 * domain if no other domains have requests or tokens.
+	 */
+	khd->batching = 0;
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		if (++khd->cur_domain >= KYBER_NUM_DOMAINS)
+			khd->cur_domain = 0;
+
+		rq = kyber_dispatch_cur_domain(hctx, kqd, khd, &flushed,
+					       &no_tokens);
+		if (rq)
+			return rq;
+	}
+
+	return no_tokens ? ERR_PTR(-EBUSY) : NULL;
+}
+
+static struct request *kyber_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
+	struct kyber_hctx_data *khd = hctx->sched_data;
+	struct request *rq;
+
+	spin_lock(&khd->lock);
+
+	rq = __kyber_dispatch_request(kqd, khd, hctx);
+	if (IS_ERR(rq)) {
+		/*
+		 * We failed to get a domain token. Mark the queue as needing a
+		 * restart and try again in case a token was freed before we set
+		 * the restart bit.
+		 */
+		blk_mq_sched_mark_restart_queue(hctx);
+		rq = __kyber_dispatch_request(kqd, khd, hctx);
+		if (IS_ERR(rq))
+			rq = NULL;
+	}
+
+	spin_unlock(&khd->lock);
+
+	return rq;
+}
+
+static bool kyber_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct kyber_hctx_data *khd = hctx->sched_data;
+	int i;
+
+	for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+		if (!list_empty_careful(&khd->rqs[i]))
+			return true;
+	}
+	return false;
+}
+
+#define KYBER_LAT_SHOW_STORE(op)					\
+static ssize_t kyber_##op##_lat_show(struct elevator_queue *e,		\
+				     char *page)			\
+{									\
+	struct kyber_queue_data *kqd = e->elevator_data;		\
+									\
+	return sprintf(page, "%llu\n", kqd->op##_lat_nsec);		\
+}									\
+									\
+static ssize_t kyber_##op##_lat_store(struct elevator_queue *e,		\
+				      const char *page, size_t count)	\
+{									\
+	struct kyber_queue_data *kqd = e->elevator_data;		\
+	unsigned long long nsec;					\
+	int ret;							\
+									\
+	ret = kstrtoull(page, 10, &nsec);				\
+	if (ret)							\
+		return ret;						\
+									\
+	kqd->op##_lat_nsec = nsec;					\
+									\
+	return count;							\
+}
+KYBER_LAT_SHOW_STORE(read);
+KYBER_LAT_SHOW_STORE(write);
+#undef KYBER_LAT_SHOW_STORE
+
+#define KYBER_LAT_ATTR(op) __ATTR(op##_lat_nsec, 0644, kyber_##op##_lat_show, kyber_##op##_lat_store)
+static struct elv_fs_entry kyber_sched_attrs[] = {
+	KYBER_LAT_ATTR(read),
+	KYBER_LAT_ATTR(write),
+	__ATTR_NULL
+};
+#undef KYBER_LAT_ATTR
+
+static struct elevator_type kyber_sched = {
+	.ops.mq = {
+		.init_sched = kyber_init_sched,
+		.exit_sched = kyber_exit_sched,
+		.init_hctx = kyber_init_hctx,
+		.exit_hctx = kyber_exit_hctx,
+		.get_request = kyber_get_request,
+		.put_request = kyber_put_request,
+		.completed_request = kyber_completed_request,
+		.dispatch_request = kyber_dispatch_request,
+		.has_work = kyber_has_work,
+	},
+	.uses_mq = true,
+	.elevator_attrs = kyber_sched_attrs,
+	.elevator_name = "kyber",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init kyber_init(void)
+{
+	return elv_register(&kyber_sched);
+}
+
+static void __exit kyber_exit(void)
+{
+	elv_unregister(&kyber_sched);
+}
+
+module_init(kyber_init);
+module_exit(kyber_exit);
+
+MODULE_AUTHOR("Omar Sandoval");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Kyber I/O scheduler");
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 4/5] blk-mq: export blk_mq_finish_request()
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This is required for schedulers that define their own put_request().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3dbfed6d0e5b..e75fa7a1a653 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -368,6 +368,7 @@ void blk_mq_finish_request(struct request *rq)
 {
 	blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
 }
+EXPORT_SYMBOL_GPL(blk_mq_finish_request);
 
 void blk_mq_free_request(struct request *rq)
 {
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 3/5] blk-mq-sched: make completed_request() callback more useful
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.h     | 11 +++--------
 block/blk-mq.c           |  5 ++++-
 include/linux/elevator.h |  2 +-
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index c6e760df0fb4..d846e07bb564 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -82,17 +82,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 	return true;
 }
 
-static inline void
-blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct elevator_queue *e = rq->q->elevator;
 
 	if (e && e->type->ops.mq.completed_request)
-		e->type->ops.mq.completed_request(hctx, rq);
-
-	BUG_ON(rq->internal_tag == -1);
-
-	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
+		e->type->ops.mq.completed_request(rq);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b49fdfde7e06..3dbfed6d0e5b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -350,7 +350,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
-		blk_mq_sched_completed_request(hctx, rq);
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart_queues(hctx);
 	blk_queue_exit(q);
 }
@@ -443,6 +443,9 @@ static void __blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	if (rq->internal_tag != -1)
+		blk_mq_sched_completed_request(rq);
+
 	blk_mq_stat_add(rq);
 
 	if (!q->softirq_done_fn)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b7ec315ee7e7..3a216318ae73 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -106,7 +106,7 @@ struct elevator_mq_ops {
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
 	bool (*has_work)(struct blk_mq_hw_ctx *);
-	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+	void (*completed_request)(struct request *);
 	void (*started_request)(struct request *);
 	void (*requeue_request)(struct request *);
 	struct request *(*former_request)(struct request_queue *, struct request *);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 0/5] blk-mq: Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This is v2 of Kyber, an I/O scheduler for multiqueue devices combining
several techniques: the scalable bitmap library, the new blk-stats API,
and queue depth throttling similar to blk-wbt. v1 was here [1].

v2 adds a tunable target synchronous write latency. The heuristics are
more fleshed out and balance read and synchronous write latencies based
on the latency targets. The queueing and dispatch mechanism is the same
as before.

This series is based on block/for-next + the initialization fix series I
sent out yesterday [2]. Patches 1 and 2 implement a new sbitmap
operation. Patch 3 moves a scheduler callback to somewhere more useful.
Patch 4 exports a required function. Patch 5 implements the new
scheduler.

Thanks!

1: http://marc.info/?l=linux-block&m=148978871820916&w=2
2: http://marc.info/?l=linux-block&m=149125578724683&w=2

Omar Sandoval (5):
  sbitmap: add sbitmap_get_shallow() operation
  blk-mq: add shallow depth option for blk_mq_get_tag()
  blk-mq-sched: make completed_request() callback more useful
  blk-mq: export blk_mq_finish_request()
  blk-mq: introduce Kyber multiqueue I/O scheduler

 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched                 |   9 +
 block/Makefile                        |   1 +
 block/blk-mq-sched.h                  |  11 +-
 block/blk-mq-tag.c                    |   5 +-
 block/blk-mq.c                        |   6 +-
 block/blk-mq.h                        |   1 +
 block/kyber-iosched.c                 | 706 ++++++++++++++++++++++++++++++++++
 include/linux/elevator.h              |   2 +-
 include/linux/sbitmap.h               |  55 +++
 lib/sbitmap.c                         |  75 +++-
 11 files changed, 867 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

-- 
2.12.2

^ permalink raw reply

* [PATCH v2 2/5] blk-mq: add shallow depth option for blk_mq_get_tag()
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-tag.c | 5 ++++-
 block/blk-mq.h     | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..d0be72ccb091 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -96,7 +96,10 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
 	    !hctx_may_queue(data->hctx, bt))
 		return -1;
-	return __sbitmap_queue_get(bt);
+	if (data->shallow_depth)
+		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
+	else
+		return __sbitmap_queue_get(bt);
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..77ec66369f21 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -141,6 +141,7 @@ struct blk_mq_alloc_data {
 	/* input parameter */
 	struct request_queue *q;
 	unsigned int flags;
+	unsigned int shallow_depth;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 1/5] sbitmap: add sbitmap_get_shallow() operation
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/sbitmap.h | 55 ++++++++++++++++++++++++++++++++++++
 lib/sbitmap.c           | 75 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 123 insertions(+), 7 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index d4e0a204c118..a1904aadbc45 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -176,6 +176,25 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth);
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin);
 
 /**
+ * sbitmap_get_shallow() - Try to allocate a free bit from a &struct sbitmap,
+ * limiting the depth used from each word.
+ * @sb: Bitmap to allocate from.
+ * @alloc_hint: Hint for where to start searching for a free bit.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ *
+ * This rather specific operation allows for having multiple users with
+ * different allocation limits. E.g., there can be a high-priority class that
+ * uses sbitmap_get() and a low-priority class that uses sbitmap_get_shallow()
+ * with a @shallow_depth of (1 << (@sb->shift - 1)). Then, the low-priority
+ * class can only allocate half of the total bits in the bitmap, preventing it
+ * from starving out the high-priority class.
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+			unsigned long shallow_depth);
+
+/**
  * sbitmap_any_bit_set() - Check for a set bit in a &struct sbitmap.
  * @sb: Bitmap to check.
  *
@@ -326,6 +345,19 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth);
 int __sbitmap_queue_get(struct sbitmap_queue *sbq);
 
 /**
+ * __sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
+ * sbitmap_queue, limiting the depth used from each word, with preemption
+ * already disabled.
+ * @sbq: Bitmap queue to allocate from.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+				unsigned int shallow_depth);
+
+/**
  * sbitmap_queue_get() - Try to allocate a free bit from a &struct
  * sbitmap_queue.
  * @sbq: Bitmap queue to allocate from.
@@ -346,6 +378,29 @@ static inline int sbitmap_queue_get(struct sbitmap_queue *sbq,
 }
 
 /**
+ * sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
+ * sbitmap_queue, limiting the depth used from each word.
+ * @sbq: Bitmap queue to allocate from.
+ * @cpu: Output parameter; will contain the CPU we ran on (e.g., to be passed to
+ *       sbitmap_queue_clear()).
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+static inline int sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+					    unsigned int *cpu,
+					    unsigned int shallow_depth)
+{
+	int nr;
+
+	*cpu = get_cpu();
+	nr = __sbitmap_queue_get_shallow(sbq, shallow_depth);
+	put_cpu();
+	return nr;
+}
+
+/**
  * sbitmap_queue_clear() - Free an allocated bit and wake up waiters on a
  * &struct sbitmap_queue.
  * @sbq: Bitmap to free from.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 60e800e0b5a0..80aa8d5463fa 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -79,15 +79,15 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 }
 EXPORT_SYMBOL_GPL(sbitmap_resize);
 
-static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
-			      bool wrap)
+static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
+			      unsigned int hint, bool wrap)
 {
 	unsigned int orig_hint = hint;
 	int nr;
 
 	while (1) {
-		nr = find_next_zero_bit(&word->word, word->depth, hint);
-		if (unlikely(nr >= word->depth)) {
+		nr = find_next_zero_bit(word, depth, hint);
+		if (unlikely(nr >= depth)) {
 			/*
 			 * We started with an offset, and we didn't reset the
 			 * offset to 0 in a failure case, so start from 0 to
@@ -100,11 +100,11 @@ static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
 			return -1;
 		}
 
-		if (!test_and_set_bit(nr, &word->word))
+		if (!test_and_set_bit(nr, word))
 			break;
 
 		hint = nr + 1;
-		if (hint >= word->depth - 1)
+		if (hint >= depth - 1)
 			hint = 0;
 	}
 
@@ -119,7 +119,8 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 	index = SB_NR_TO_INDEX(sb, alloc_hint);
 
 	for (i = 0; i < sb->map_nr; i++) {
-		nr = __sbitmap_get_word(&sb->map[index],
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					sb->map[index].depth,
 					SB_NR_TO_BIT(sb, alloc_hint),
 					!round_robin);
 		if (nr != -1) {
@@ -141,6 +142,37 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 }
 EXPORT_SYMBOL_GPL(sbitmap_get);
 
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+			unsigned long shallow_depth)
+{
+	unsigned int i, index;
+	int nr = -1;
+
+	index = SB_NR_TO_INDEX(sb, alloc_hint);
+
+	for (i = 0; i < sb->map_nr; i++) {
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					min(sb->map[index].depth, shallow_depth),
+					SB_NR_TO_BIT(sb, alloc_hint), true);
+		if (nr != -1) {
+			nr += index << sb->shift;
+			break;
+		}
+
+		/* Jump to next index. */
+		index++;
+		alloc_hint = index << sb->shift;
+
+		if (index >= sb->map_nr) {
+			index = 0;
+			alloc_hint = 0;
+		}
+	}
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(sbitmap_get_shallow);
+
 bool sbitmap_any_bit_set(const struct sbitmap *sb)
 {
 	unsigned int i;
@@ -342,6 +374,35 @@ int __sbitmap_queue_get(struct sbitmap_queue *sbq)
 }
 EXPORT_SYMBOL_GPL(__sbitmap_queue_get);
 
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+				unsigned int shallow_depth)
+{
+	unsigned int hint, depth;
+	int nr;
+
+	hint = this_cpu_read(*sbq->alloc_hint);
+	depth = READ_ONCE(sbq->sb.depth);
+	if (unlikely(hint >= depth)) {
+		hint = depth ? prandom_u32() % depth : 0;
+		this_cpu_write(*sbq->alloc_hint, hint);
+	}
+	nr = sbitmap_get_shallow(&sbq->sb, hint, shallow_depth);
+
+	if (nr == -1) {
+		/* If the map is full, a hint won't do us much good. */
+		this_cpu_write(*sbq->alloc_hint, 0);
+	} else if (nr == hint || unlikely(sbq->round_robin)) {
+		/* Only update the hint if we used it. */
+		hint = nr + 1;
+		if (hint >= depth - 1)
+			hint = 0;
+		this_cpu_write(*sbq->alloc_hint, hint);
+	}
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(__sbitmap_queue_get_shallow);
+
 static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 {
 	int i, wake_index;
-- 
2.12.2

^ permalink raw reply related

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Bart Van Assche @ 2017-04-04 16:25 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, linux-block, Linux SCSI List
In-Reply-To: <CACVXFVPyWL8XyFKC9uT7r4q+QT1+_RvnvJnOskCO3aSnUEVY1w@mail.gmail.com>

On 04/04/2017 09:00 AM, Ming Lei wrote:=0A=
> Just be curious, why is multi hard queue used for this case? Are there=0A=
> some real cases in SCSI?=0A=
=0A=
Hello Ming,=0A=
=0A=
Yes, there is a real need for this. Background information is available=0A=
in the following e-mail thread: Arun Easi, "scsi-mq - tag# and=0A=
can_queue, performance", April 2, 2017, linux-scsi=0A=
(http://www.spinics.net/lists/linux-scsi/msg106853.html).=0A=
=0A=
Bart.=0A=

^ permalink raw reply

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Ming Lei @ 2017-04-04 15:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, Linux SCSI List
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

On Tue, Apr 4, 2017 at 8:07 PM, Hannes Reinecke <hare@suse.de> wrote:
> Hi all,
>
> as discussed recently most existing HBAs have a host-wide tagset which
> does not map easily onto the per-queue tagset model of block mq.
> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
> enables the use of a shared tagset for all hardware queues.
> The second patch adds a flag 'host_tagset' to the SCSI host template,
> which allows drivers to enable the use of the global tagset.
>
> This patchset probably has some performance implications as
> there is a quite high probability of cache-bouncing when allocating
> tags. Also I'm not quite sure if the implemented tagset sharing
> is the correct way to handle things.
> So this can be considered an RFC.
>
> As usual, comments and reviews are welcome.

Just be curious, why is multi hard queue used for this case? Are there
some real cases in SCSI?


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Bart Van Assche @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Martin K . Petersen,
	James Bottomley, Hannes Reinecke
In-Reply-To: <20170404064252.GD9168@lst.de>

On 04/03/2017 11:42 PM, Christoph Hellwig wrote:=0A=
>> +static void scsi_restart_hctx(struct request_queue *q,=0A=
>> +			      struct blk_mq_hw_ctx *hctx)=0A=
>> +{=0A=
>> +	struct blk_mq_tags *tags =3D hctx->tags;=0A=
>> +	struct blk_mq_tag_set *set =3D q->tag_set;=0A=
>> +	int i;=0A=
>> +=0A=
>> +	rcu_read_lock();=0A=
>> +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list)=0A=
>> +		queue_for_each_hw_ctx(q, hctx, i)=0A=
>> +			if (hctx->tags =3D=3D tags)=0A=
>> +				blk_mq_sched_restart_hctx(hctx);=0A=
>> +	rcu_read_unlock();=0A=
>> +}=0A=
> =0A=
> This looks like generic block layer code, why is it in SCSI?=0A=
=0A=
Hello Christoph,=0A=
=0A=
That's an excellent question. I assume that you are fine with moving=0A=
this code to the block layer?=0A=
=0A=
Bart.=0A=
=0A=

^ permalink raw reply

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Ming Lei @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Linux Kernel Mailing List, linux-block, Martin K. Petersen
In-Reply-To: <871st8kxnk.fsf@dmlp.sw.ru>

On Tue, Apr 4, 2017 at 11:19 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> Ming Lei <tom.leiming@gmail.com> writes:
>
>> On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>>> Currently if some one try to advance bvec beyond it's size we simply
>>> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
>>> This simply means that we endup dereferencing/corrupting random memory
>>> region.
>>>
>>> Sane reaction would be to propagate error back to calling context
>>> But bvec_iter_advance's calling context is not always good for error
>>> handling. For safity reason let truncate iterator size to zero which
>>
>> IMO, we can avoid continuing to iterate by checking the return value,
>> and looks it is rude to just set iterator size as 0.
> But situation itself is horrible already. IMHO this is BUG_ON situation,

Not sure it is a real BUG_ON() since corrupt bvec array shouldn't happen
because we usually don't modify bvec array directly and just copy to local
variable for further access, but dereferencing random memory might happen.

> but since Linus hate bugons, I try to replace with something loud, but
> very safe. Since there is no guarantee that caller will ignore an error
> and try to dereference bvec the only safe thing we can to is to clamp
> iterator to zero to prevent any possible usage in future.

Since you prevent the case from happening, no dereference invalid bvec
can happen any more, but may cause dead loop. Setting iterator size as
zero can break the dead loop, but the driver still may not know this error
and continue to do its following work.

Don't have a better idea now, and looks it is fine to set iter.bi_size as
zero at the beginning of guarding bvec iteration.



Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 2/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Bart Van Assche @ 2017-04-04 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Martin K . Petersen,
	James Bottomley, Hannes Reinecke
In-Reply-To: <20170404064038.GB9168@lst.de>

On 04/03/2017 11:40 PM, Christoph Hellwig wrote:=0A=
> On Mon, Apr 03, 2017 at 04:22:25PM -0700, Bart Van Assche wrote:=0A=
>> A later patch in this series will namely use RCU to iterate over=0A=
>> this list.=0A=
> =0A=
> It also adds a couple lockdep_assert_held calls, which might be worth=0A=
> mentioning.=0A=
> =0A=
> Otherwise looks good:=0A=
> =0A=
> Reviewed-by: Christoph Hellwig <hch@lst.de>=0A=
=0A=
Thanks for the review. I will update the patch description.=0A=
=0A=
Bart.=0A=
=0A=

^ permalink raw reply

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Bart Van Assche @ 2017-04-04 15:46 UTC (permalink / raw)
  To: osandov@osandov.com, hare@suse.de
  Cc: hch@lst.de, james.bottomley@hansenpartnership.com,
	linux-scsi@vger.kernel.org, osandov@fb.com,
	linux-block@vger.kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
In-Reply-To: <20170404153234.GA3234@vader>

On Tue, 2017-04-04 at 08:32 -0700, Omar Sandoval wrote:
> blk-mq already supports a shared tagset, and scsi-mq already uses that.
> When we initialize a request queue, we add it to a tagset with
> blk_mq_add_queue_set(), where we automatically mark the tagset as shared
> if there is more than one queue using it. What does this do that
> BLK_MQ_F_TAG_SHARED doesn't cover?

Hello Omar,

Today blk-mq creates one tag set per hardware queue. The sharing by
scsi-mq is between request queues for hardware queues that have the same
index but not between hardware queues of a single request queue. My
understanding is that the goal of this patch series is to make it possible
to use a single tag set for all hardware queues and all request queues that
share the same SCSI host.

Bart.=

^ permalink raw reply

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Omar Sandoval @ 2017-04-04 15:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

On Tue, Apr 04, 2017 at 02:07:43PM +0200, Hannes Reinecke wrote:
> Hi all,
> 
> as discussed recently most existing HBAs have a host-wide tagset which
> does not map easily onto the per-queue tagset model of block mq.
> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
> enables the use of a shared tagset for all hardware queues.
> The second patch adds a flag 'host_tagset' to the SCSI host template,
> which allows drivers to enable the use of the global tagset.
> 
> This patchset probably has some performance implications as
> there is a quite high probability of cache-bouncing when allocating
> tags. Also I'm not quite sure if the implemented tagset sharing
> is the correct way to handle things.
> So this can be considered an RFC.
> 
> As usual, comments and reviews are welcome.

Hi, Hannes,

blk-mq already supports a shared tagset, and scsi-mq already uses that.
When we initialize a request queue, we add it to a tagset with
blk_mq_add_queue_set(), where we automatically mark the tagset as shared
if there is more than one queue using it. What does this do that
BLK_MQ_F_TAG_SHARED doesn't cover?

^ permalink raw reply

* Re: [PATCH] Block SQ/MQ Request Priority
From: Bart Van Assche @ 2017-04-04 15:31 UTC (permalink / raw)
  To: Adam Manzanares, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, axboe@kernel.dk
  Cc: hch@infradead.org
In-Reply-To: <1491319514-6671-1-git-send-email-adam.manzanares@wdc.com>

On Tue, 2017-04-04 at 08:25 -0700, adam.manzanares@wdc.com wrote:
>=20

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH V2 04/16] block, bfq: modify the peak-rate estimator
From: Bart Van Assche @ 2017-04-04 15:28 UTC (permalink / raw)
  To: paolo.valente@linaro.org
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
	avanzini.arianna@gmail.com, broonie@kernel.org, tj@kernel.org,
	ulf.hansson@linaro.org
In-Reply-To: <867BBC1E-080B-4A4B-88CE-BD103610BA6C@linaro.org>

On Tue, 2017-04-04 at 12:42 +0200, Paolo Valente wrote:
> > Il giorno 31 mar 2017, alle ore 17:31, Bart Van Assche <bart.vanassche@=
sandisk.com> ha scritto:
> >=20
> > On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
> > > +               delta_ktime =3D ktime_get();
> > > +       delta_ktime =3D ktime_sub(delta_ktime, bfqd->last_budget_star=
t);
> > > +       delta_usecs =3D ktime_to_us(delta_ktime);
> >=20
> > This patch changes the type of the variable in which the result of ktim=
e_to_us()
> > is stored from u64 into u32 and next compares that result with LONG_MAX=
. Since
> > ktime_to_us() returns a signed 64-bit number, are you sure you want to =
store that
> > result in a 32-bit variable? If ktime_to_us() would e.g. return 0xfffff=
fff00000100
> > or 0x100000100 then the assignment will truncate these numbers to 0x100=
.
>=20
> The instruction above the assignment you highlight stores in
> delta_ktime the difference between 'now' and the last budget start.
> The latter may have happened at most about 100 ms before 'now'.  So
> there should be no overflow issue.

Hello Paolo,

Please double check the following code: if (delta_usecs < 1000 || delta_use=
cs >=3D LONG_MAX)
Since delta_usecs is a 32-bit variable and LONG_MAX a 64-bit constant on 64=
-bit systems
I'm not sure that code will do what it is intended to do.

Thanks,

Bart.=

^ permalink raw reply

* [PATCH] Block SQ/MQ Request Priority
From: adam.manzanares @ 2017-04-04 15:25 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: hch, Bart.VanAssche, Adam Manzanares

From: Adam Manzanares <adam.manzanares@wdc.com>

In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.

I have tested this patch with the null block device driver with the following
parameters.

null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1

I have not seen a performance regression with this patch and I would appreciate
any feedback or additional testing.

I have also verified that io priorities are passed to the device when using
the SQ and MQ path to a SATA HDD that supports io priorities.

Thanks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
---
 block/blk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43b7d06..316a539 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1149,7 +1149,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
 
 	blk_rq_init(q, rq);
 	blk_rq_set_rl(rq, rl);
-	blk_rq_set_prio(rq, ioc);
 	rq->cmd_flags = op;
 	rq->rq_flags = rq_flags;
 
@@ -1636,6 +1635,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
 
 	req->errors = 0;
 	req->__sector = bio->bi_iter.bi_sector;
+	blk_rq_set_prio(req, rq_ioc(bio));
 	if (ioprio_valid(bio_prio(bio)))
 		req->ioprio = bio_prio(bio);
 	blk_rq_bio_prep(req->q, req, bio);
-- 
2.7.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply related

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Dmitry Monakhov @ 2017-04-04 15:19 UTC (permalink / raw)
  To: Ming Lei; +Cc: Linux Kernel Mailing List, linux-block, Martin K. Petersen
In-Reply-To: <CACVXFVPWe0ABh5CsXryqXi7Umtvvzcq=jp2n+t=JDRWSZJsbFA@mail.gmail.com>

Ming Lei <tom.leiming@gmail.com> writes:

> On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>> Currently if some one try to advance bvec beyond it's size we simply
>> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
>> This simply means that we endup dereferencing/corrupting random memory
>> region.
>>
>> Sane reaction would be to propagate error back to calling context
>> But bvec_iter_advance's calling context is not always good for error
>> handling. For safity reason let truncate iterator size to zero which
>
> IMO, we can avoid continuing to iterate by checking the return value,
> and looks it is rude to just set iterator size as 0.
But situation itself is horrible already. IMHO this is BUG_ON situation,
but since Linus hate bugons, I try to replace with something loud, but
very safe. Since there is no guarantee that caller will ignore an error
and try to dereference bvec the only safe thing we can to is to clamp
iterator to zero to prevent any possible usage in future.
>
>> will break external iteration loop which prevent us from unpredictable
>> memory range corruption. And even it caller ignores an error, it will
>> corrupt it's own bvecs, not others.
>>
>> This patch does:
>> - Return error back to caller with hope that it will react on this
>> - Truncate iterator size
>>
>> Code was added long time ago here 4550dd6c, luckily no one hit it
>> in real life :)
>>
>> changes since V1:
>>  - Replace  BUG_ON with error logic.
>>
>> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
>> ---
>>  drivers/nvdimm/blk.c |  4 +++-
>>  drivers/nvdimm/btt.c |  4 +++-
>>  include/linux/bio.h  |  8 ++++++--
>>  include/linux/bvec.h | 11 ++++++++---
>>  4 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 1edb3f3..04c3075 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
>>
>>                 len -= cur_len;
>>                 dev_offset += cur_len;
>> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               err = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               if (err)
>> +                       return err;
>>         }
>>
>>         return err;
>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>> index 03ded8d..3f3aa7b 100644
>> --- a/drivers/nvdimm/btt.c
>> +++ b/drivers/nvdimm/btt.c
>> @@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
>>
>>                 len -= cur_len;
>>                 meta_nsoff += cur_len;
>> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               ret = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               if (ret)
>> +                       return ret;
>>         }
>>
>>         return ret;
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index 0c1c95c..8bf1564 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
>>
>>         if (bio_no_advance_iter(bio))
>>                 iter->bi_size -= bytes;
>> -       else
>> -               bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +       else {
>> +               int err;
>> +               err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +               if (unlikely(err))
>> +                       bio->bi_error = err;
>> +       }
>>  }
>>
>>  #define __bio_for_each_segment(bvl, bio, iter, start)                  \
>> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
>> index 89b65b8..c117f1a 100644
>> --- a/include/linux/bvec.h
>> +++ b/include/linux/bvec.h
>> @@ -22,6 +22,7 @@
>>
>>  #include <linux/kernel.h>
>>  #include <linux/bug.h>
>> +#include <linux/errno.h>
>>
>>  /*
>>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
>> @@ -66,12 +67,15 @@ struct bvec_iter {
>>         .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
>>  })
>>
>> -static inline void bvec_iter_advance(const struct bio_vec *bv,
>> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>>                                      struct bvec_iter *iter,
>>                                      unsigned bytes)
>>  {
>> -       WARN_ONCE(bytes > iter->bi_size,
>> -                 "Attempted to advance past end of bvec iter\n");
>> +       if(unlikely(bytes > iter->bi_size)) {
>> +               WARN(1, "Attempted to advance past end of bvec iter\n");
>> +               iter->bi_size = 0;
>> +               return -EINVAL;
>> +       }
>>
>>         while (bytes) {
>>                 unsigned iter_len = bvec_iter_len(bv, *iter);
>> @@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
>>                         iter->bi_idx++;
>>                 }
>>         }
>> +       return 0;
>>  }
>>
>>  #define for_each_bvec(bvl, bio_vec, iter, start)                       \
>> --
>> 2.9.3
>>
>
>
>
> -- 
> Ming Lei

^ permalink raw reply

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Ming Lei @ 2017-04-04 15:03 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Linux Kernel Mailing List, linux-block, Martin K. Petersen
In-Reply-To: <1491204212-9952-8-git-send-email-dmonakhov@openvz.org>

On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> Currently if some one try to advance bvec beyond it's size we simply
> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
> This simply means that we endup dereferencing/corrupting random memory
> region.
>
> Sane reaction would be to propagate error back to calling context
> But bvec_iter_advance's calling context is not always good for error
> handling. For safity reason let truncate iterator size to zero which

IMO, we can avoid continuing to iterate by checking the return value,
and looks it is rude to just set iterator size as 0.

> will break external iteration loop which prevent us from unpredictable
> memory range corruption. And even it caller ignores an error, it will
> corrupt it's own bvecs, not others.
>
> This patch does:
> - Return error back to caller with hope that it will react on this
> - Truncate iterator size
>
> Code was added long time ago here 4550dd6c, luckily no one hit it
> in real life :)
>
> changes since V1:
>  - Replace  BUG_ON with error logic.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  drivers/nvdimm/blk.c |  4 +++-
>  drivers/nvdimm/btt.c |  4 +++-
>  include/linux/bio.h  |  8 ++++++--
>  include/linux/bvec.h | 11 ++++++++---
>  4 files changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
> index 1edb3f3..04c3075 100644
> --- a/drivers/nvdimm/blk.c
> +++ b/drivers/nvdimm/blk.c
> @@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
>
>                 len -= cur_len;
>                 dev_offset += cur_len;
> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               err = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               if (err)
> +                       return err;
>         }
>
>         return err;
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index 03ded8d..3f3aa7b 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
>
>                 len -= cur_len;
>                 meta_nsoff += cur_len;
> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               ret = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               if (ret)
> +                       return ret;
>         }
>
>         return ret;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 0c1c95c..8bf1564 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
>
>         if (bio_no_advance_iter(bio))
>                 iter->bi_size -= bytes;
> -       else
> -               bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> +       else {
> +               int err;
> +               err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> +               if (unlikely(err))
> +                       bio->bi_error = err;
> +       }
>  }
>
>  #define __bio_for_each_segment(bvl, bio, iter, start)                  \
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 89b65b8..c117f1a 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -22,6 +22,7 @@
>
>  #include <linux/kernel.h>
>  #include <linux/bug.h>
> +#include <linux/errno.h>
>
>  /*
>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
> @@ -66,12 +67,15 @@ struct bvec_iter {
>         .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
>  })
>
> -static inline void bvec_iter_advance(const struct bio_vec *bv,
> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>                                      struct bvec_iter *iter,
>                                      unsigned bytes)
>  {
> -       WARN_ONCE(bytes > iter->bi_size,
> -                 "Attempted to advance past end of bvec iter\n");
> +       if(unlikely(bytes > iter->bi_size)) {
> +               WARN(1, "Attempted to advance past end of bvec iter\n");
> +               iter->bi_size = 0;
> +               return -EINVAL;
> +       }
>
>         while (bytes) {
>                 unsigned iter_len = bvec_iter_len(bv, *iter);
> @@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
>                         iter->bi_idx++;
>                 }
>         }
> +       return 0;
>  }
>
>  #define for_each_bvec(bvl, bio_vec, iter, start)                       \
> --
> 2.9.3
>



-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Ming Lei @ 2017-04-04 14:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <871staffus.fsf@notabene.neil.brown.name>

On Mon, Apr 3, 2017 at 9:18 AM, NeilBrown <neilb@suse.com> wrote:
>
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
>
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
>
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
>
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
>
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..a7e1dd215fc2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>  {
>         struct loop_cmd *cmd =
>                 container_of(work, struct loop_cmd, work);
> +       int oldflags = current->flags & PF_LESS_THROTTLE;
>
> +       current->flags |= PF_LESS_THROTTLE;
>         loop_handle_cmd(cmd);
> +       current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;
>  }

You can do it against 'lo->worker_task' instead of doing it in each
loop_queue_work(),
and this flag needn't to be restored because the kernel thread is loop
specialized.


thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Christoph Hellwig @ 2017-04-04 13:09 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Sagi Grimberg, linux-rdma, linux-nvme, linux-block, netdev,
	Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <becb84ac-7819-a207-56b1-70f16cb80e42@mellanox.com>

On Tue, Apr 04, 2017 at 10:46:54AM +0300, Max Gurtovoy wrote:
>> +	if (set->nr_hw_queues > dev->num_comp_vectors)
>> +		goto fallback;
>> +
>> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
>> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
>> +		if (!mask)
>> +			goto fallback;
>
> Christoph,
> we can use fallback also in the blk-mq-pci.c in case pci_irq_get_affinity 
> fails, right ?

For PCI it shouldn't fail as the driver calling pci_irq_get_affinity
knows how it set up the interrupts.  So I don't think it's necessary there.

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 12:48 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <d84a1dcf-6f60-d089-f81d-85df5a504c19@profitbricks.com>



On 04/04/2017 02:24 PM, Michael Wang wrote:
> On 04/04/2017 12:23 PM, Michael Wang wrote:
> [snip]
>>> add something like
>>>   if (wbio->bi_next)
>>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>>           i, r1_bio->read_disk, wbio->bi_end_io);
>>>
>>> that might help narrow down what is happening.
>>
>> Just triggered again in 4.4, dmesg like:
>>
>> [  399.240230] md: super_written gets error=-5
>> [  399.240286] md: super_written gets error=-5
>> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
> 
> Is it possible that the fail fast who changed the 'bi_end_io' inside
> fix_sync_read_error() help the used bio pass the check?

Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md
RFC patch so we can continue the discussion there.

Regards,
Michael Wang

> 
> I'm not sure but if the read bio was supposed to be reused as write
> for fail fast, maybe we should reset it like this?
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>                 /* Don't try recovering from here - just fail it
>                  * ... unless it is the last working device of course */
>                 md_error(mddev, rdev);
> -               if (test_bit(Faulty, &rdev->flags))
> +               if (test_bit(Faulty, &rdev->flags)) {
>                         /* Don't try to read from here, but make sure
>                          * put_buf does it's thing
>                          */
>                         bio->bi_end_io = end_sync_write;
> +                       bio->bi_next = NULL;
> +               }
>         }
>  
>         while(sectors) {
> 
> Regards,
> Michael Wang
> 
> 
>> [  399.240363] ------------[ cut here ]------------
>> [  399.240364] kernel BUG at block/blk-core.c:2147!
>> [  399.240365] invalid opcode: 0000 [#1] SMP 
>> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
>> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
>> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
>> [  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
>> [  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
>> [  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
>> [  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
>> [  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
>> [  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
>> [  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
>> [  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
>> [  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
>> [  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
>> [  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  399.240392] Stack:
>> [  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
>> [  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
>> [  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
>> [  399.240396] Call Trace:
>> [  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
>> [  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
>> [  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
>> [  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
>> [  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
>> [  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
>> [  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
>> [  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
>> [  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
>> [  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
>> [  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
>> [  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
>> [  399.240435]  RSP <ffff8800d72b7d10>
>>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> NeilBrown
>>>

^ permalink raw reply

* Re: [PATCH 0/9] convert genericirq.tmpl and kernel-api.tmpl to DocBook
From: Mauro Carvalho Chehab @ 2017-04-04 12:34 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Linux Media Mailing List, Linux Doc Mailing List,
	Mauro Carvalho Chehab, Noam Camus, James Morris, zijun_hu,
	Markus Heiser, linux-clk, Jani Nikula, Andrew Morton, Jens Axboe,
	Nicholas Piggin, Russell King, linux-block, Kirill A. Shutemov,
	Mauro Carvalho Chehab, Joonsoo Kim, Ingo Molnar, Bjorn Helgaas,
	Serge E. Hallyn, Michal Hocko, Ross Zwisler, Chris Wilson,
	linux-mm, linux-security-module, Silvio Fricke, Takashi Iwai,
	Sebastian Andrzej Siewior, Jan Kara, Vlastimil Babka, linux-pci,
	Matt Fleming, Johannes Weiner, Andrey Ryabinin, Andy Lutomirski,
	Mel Gorman, Andy Shevchenko, Alexey Dobriyan, Hillf Danton
In-Reply-To: <20170402143418.3de75239@lwn.net>

Em Sun, 2 Apr 2017 14:34:18 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:

> On Thu, 30 Mar 2017 17:11:27 -0300
> Mauro Carvalho Chehab <mchehab@s-opensource.com> wrote:
> 
> > This series converts just two documents, adding them to the
> > core-api.rst book. It addresses the errors/warnings that popup
> > after the conversion.
> > 
> > I had to add two fixes to scripts/kernel-doc, in order to solve
> > some of the issues.  
> 
> I've applied the set, including the add-on to move some stuff to
> driver-api - thanks.

Thanks!

> For whatever reason, I had a hard time applying a few of these; "git am"
> would tell me this:
> 
> > Applying: docs-rst: core_api: move driver-specific stuff to drivers_api
> > fatal: sha1 information is lacking or useless (Documentation/driver-api/index.rst).
> > Patch failed at 0001 docs-rst: core_api: move driver-specific stuff to drivers_api
> > The copy of the patch that failed is found in: .git/rebase-apply/patch  
> 
> I was able to get around this, but it took some hand work.  How are you
> generating these?

That's weird. I'm using this to generate the patches:

	git format-patch -o $tmp_dir --stat --summary --patience --signoff --thread=shallow

plus some scripting that runs scripts/get_maintainers.

After that, I run:

	git send-email $tmp_dir

Then, exim sends the patches to a smart SMTP server (currently,
infradead.org, but I'm switching to s-opensource.org, as I'm getting
some troubles because the IP doesn't match the From: line).

Thanks,
Mauro

^ permalink raw reply

* [PATCH] cfq: Disable writeback throttling by default
From: Jan Kara @ 2017-04-04 12:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jan Kara

Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

  8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
  8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
  8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
  8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
  8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
  8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
  8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
  8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
  8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
  8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
  8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
  8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
  8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
...
  8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
  8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
  8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
  8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
  8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
  8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
  8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
  8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
  8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
...
  8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
  8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
  8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
  8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
  8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
  8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
  8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
  8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
  8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
  8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
  8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
...
  8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
  8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
  8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
  8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
  8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
  8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
  8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
  8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
  8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1

So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.

Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.

Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/cfq-iosched.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 440b95ee593c..da69b079725f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3761,16 +3761,14 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
+static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
 	uint64_t serial_nr;
-	bool nonroot_cg;
 
 	rcu_read_lock();
 	serial_nr = bio_blkcg(bio)->css.serial_nr;
-	nonroot_cg = bio_blkcg(bio) != &blkcg_root;
 	rcu_read_unlock();
 
 	/*
@@ -3778,7 +3776,7 @@ static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 	 * spuriously on a newly created cic but there's no harm.
 	 */
 	if (unlikely(!cfqd) || likely(cic->blkcg_serial_nr == serial_nr))
-		return nonroot_cg;
+		return;
 
 	/*
 	 * Drop reference to queues.  New queues will be assigned in new
@@ -3799,12 +3797,10 @@ static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 	}
 
 	cic->blkcg_serial_nr = serial_nr;
-	return nonroot_cg;
 }
 #else
-static inline bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
+static inline void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
-	return false;
 }
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
@@ -4449,12 +4445,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	const int rw = rq_data_dir(rq);
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
-	bool disable_wbt;
 
 	spin_lock_irq(q->queue_lock);
 
 	check_ioprio_changed(cic, bio);
-	disable_wbt = check_blkcg_changed(cic, bio);
+	check_blkcg_changed(cic, bio);
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
@@ -4491,9 +4486,6 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	rq->elv.priv[1] = cfqq->cfqg;
 	spin_unlock_irq(q->queue_lock);
 
-	if (disable_wbt)
-		wbt_disable_default(q);
-
 	return 0;
 }
 
@@ -4706,6 +4698,7 @@ static void cfq_registered_queue(struct request_queue *q)
 	 */
 	if (blk_queue_nonrot(q))
 		cfqd->cfq_slice_idle = 0;
+	wbt_disable_default(q);
 }
 
 /*
-- 
2.10.2

^ permalink raw reply related

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 12:24 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <2a6f8c30-616c-d6fe-1c3f-ab687c145cd7@profitbricks.com>

On 04/04/2017 12:23 PM, Michael Wang wrote:
[snip]
>> add something like
>>   if (wbio->bi_next)
>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>           i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
> 
> Just triggered again in 4.4, dmesg like:
> 
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

Is it possible that the fail fast who changed the 'bi_end_io' inside
fix_sync_read_error() help the used bio pass the check?

I'm not sure but if the read bio was supposed to be reused as write
for fail fast, maybe we should reset it like this?

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
                /* Don't try recovering from here - just fail it
                 * ... unless it is the last working device of course */
                md_error(mddev, rdev);
-               if (test_bit(Faulty, &rdev->flags))
+               if (test_bit(Faulty, &rdev->flags)) {
                        /* Don't try to read from here, but make sure
                         * put_buf does it's thing
                         */
                        bio->bi_end_io = end_sync_write;
+                       bio->bi_next = NULL;
+               }
        }
 
        while(sectors) {

Regards,
Michael Wang


> [  399.240363] ------------[ cut here ]------------
> [  399.240364] kernel BUG at block/blk-core.c:2147!
> [  399.240365] invalid opcode: 0000 [#1] SMP 
> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
> [  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
> [  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
> [  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
> [  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
> [  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
> [  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
> [  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
> [  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
> [  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
> [  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
> [  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  399.240392] Stack:
> [  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
> [  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
> [  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
> [  399.240396] Call Trace:
> [  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
> [  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
> [  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
> [  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
> [  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
> [  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
> [  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
> [  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
> [  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
> [  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
> [  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
> [  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
> [  399.240435]  RSP <ffff8800d72b7d10>
> 
> 
> Regards,
> Michael Wang
> 
>>
>> NeilBrown
>>

^ permalink raw reply related

* Re: [PATCH 2/7] bio-integrity: save original iterator for verify stage
From: Dmitry Monakhov @ 2017-04-04 12:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <20170404070122.GC12008@infradead.org>

Christoph Hellwig <hch@infradead.org> writes:

> This is a pretty big increase in the bio_integrity_payload size,
> but I guess we can't get around it..

Yes, everybody hate this solution, me too, but I've stated with
other approach and it is appeaded to be very ugly.

My idea was that we have two types of iterator incrementors: bio_advance()
and bio_xx_complete, First one is called during split, later is called
on completion ( req_bio_endio() ) . So we can add new field "bi_done" to
iterator which has similar meaning as bi_bvec_done, but at full iterator
scope.  It is incremented during completion, but before end_io.
Chain bios will propogate bi_done to parent bio to parent one.
On ->vefify_fn() iterator will be rewinded (counter part of bvec_advance) to
iter->bi_done bytes, so we will get oritinal iterator. 
I've even prepare a patch for this idea and it looks big and awful.
Even more it does not works if chained bios overlapts (raid1,raid10,
etc).

But... at the time I've wrote this email I've realized that I do not
care about what happen with chained bios. The only thing is important
is parent bio and how far it was advanced. If bi_done is incremented
inside bvec_iter_advance() I can be shure that at the moment
->bi_end_io()
original position can be restored by rewinding back to io_done bytes.

I'll try to implement this.

>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox