Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH 0/4] block: callback-based statistics
From: Omar Sandoval @ 2017-03-14 21:07 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489524591.git.osandov@fb.com>

On Tue, Mar 14, 2017 at 02:03:27PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This patchset generalizes the blk-stats infrastructure to allow users to
> register a callback to be called at a given time with the statistics of
> requests completed during that window. Writeback throttling and hybrid
> polling are converted to the new infrastructure. The ultimate goal is to
> use this infrastructure for the mq I/O scheduler.
> 
> The details are in patch 4, which is the actual conversion. Patches 1-3
> are preparation cleanups.

FYI, the diff of block/blk-stat.[ch] is a pain to read. It's easier to
just apply the patches and look at the whole files.

^ permalink raw reply

* [PATCH 4/4] blk-stats: convert to callback-based statistics reporting
From: Omar Sandoval @ 2017-03-14 21:03 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489524591.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
   statistics. Since the blk-stats statistics are reset after every
   window and the wbt windows don't line up with the blk-stats windows,
   wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
   on how the window lines up, we may miss some I/Os. It's also
   unnecessary overhead to get the statistics on every I/O; the hybrid
   polling heuristic would be just as happy with the statistics from the
   previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active. The
callbacks are kept on an RCU list, and each callback has percpu stats
buffers. There will only be a few users, so the overhead on the I/O
completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-core.c          |   6 +-
 block/blk-mq.c            |  77 ++++++++----
 block/blk-mq.h            |   1 -
 block/blk-stat.c          | 300 +++++++++++++++++++++-------------------------
 block/blk-stat.h          |  98 ++++++++++++---
 block/blk-sysfs.c         |   5 +
 block/blk-wbt.c           |  48 +++-----
 block/blk-wbt.h           |   2 +-
 include/linux/blk_types.h |   3 -
 include/linux/blkdev.h    |  10 +-
 10 files changed, 307 insertions(+), 243 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 82425017c9b8..0f4ae118e220 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -852,6 +852,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio);
 
 int blk_init_allocated_queue(struct request_queue *q)
 {
+	q->stats = blk_alloc_queue_stats();
+	if (!q->stats)
+		return -ENOMEM;
+
 	q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size);
 	if (!q->fq)
 		return -ENOMEM;
@@ -2692,7 +2696,7 @@ void blk_finish_request(struct request *req, int error)
 	struct request_queue *q = req->q;
 
 	if (req->rq_flags & RQF_STATS)
-		blk_stat_add(&q->rq_stats[rq_data_dir(req)], req);
+		blk_stat_add(req);
 
 	if (req->rq_flags & RQF_QUEUED)
 		blk_queue_end_tag(q, req);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e16f8d420683..559f9b0f24a1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -39,6 +39,10 @@
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+static void blk_mq_poll_stats_start(struct request_queue *q);
+static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb,
+				 struct blk_stats *stats);
+
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
@@ -432,15 +436,8 @@ static void blk_mq_ipi_complete_request(struct request *rq)
 static void blk_mq_stat_add(struct request *rq)
 {
 	if (rq->rq_flags & RQF_STATS) {
-		/*
-		 * We could rq->mq_ctx here, but there's less of a risk
-		 * of races if we have the completion event add the stats
-		 * to the local software queue.
-		 */
-		struct blk_mq_ctx *ctx;
-
-		ctx = __blk_mq_get_ctx(rq->q, raw_smp_processor_id());
-		blk_stat_add(&ctx->stat[rq_data_dir(rq)], rq);
+		blk_mq_poll_stats_start(rq->q);
+		blk_stat_add(rq);
 	}
 }
 
@@ -2039,8 +2036,6 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		spin_lock_init(&__ctx->lock);
 		INIT_LIST_HEAD(&__ctx->rq_list);
 		__ctx->queue = q;
-		blk_stat_init(&__ctx->stat[BLK_STAT_READ]);
-		blk_stat_init(&__ctx->stat[BLK_STAT_WRITE]);
 
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
 		if (!cpu_online(i))
@@ -2338,6 +2333,14 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
 
+	q->stats = blk_alloc_queue_stats();
+	if (!q->stats)
+		goto err_exit;
+
+	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn, q);
+	if (!q->poll_cb)
+		goto err_exit;
+
 	q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
 	if (!q->queue_ctx)
 		goto err_exit;
@@ -2739,28 +2742,54 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+/* Enable polling stats and return whether they were already enabled. */
+static bool blk_poll_stats_enable(struct request_queue *q)
+{
+	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags) ||
+	    test_and_set_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
+		return true;
+	blk_stat_add_callback(q, q->poll_cb);
+	return false;
+}
+
+static void blk_mq_poll_stats_start(struct request_queue *q)
+{
+	/*
+	 * We don't arm the callback if polling stats are not enabled or the
+	 * callback has already been armed.
+	 */
+	if (!test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags) ||
+	    timer_pending(&q->poll_cb->timer))
+		return;
+
+	blk_stat_arm_callback(q->poll_cb, jiffies + msecs_to_jiffies(100));
+}
+
+static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb,
+				 struct blk_stats *stats)
+{
+	struct request_queue *q = cb->data;
+
+	if (stats->read.nr_samples)
+		q->poll_stats.read = stats->read;
+	if (stats->write.nr_samples)
+		q->poll_stats.write = stats->write;
+}
+
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 				       struct blk_mq_hw_ctx *hctx,
 				       struct request *rq)
 {
-	struct blk_stats stats;
 	unsigned long ret = 0;
 
 	/*
 	 * If stats collection isn't on, don't sleep but turn it on for
 	 * future users
 	 */
-	if (!blk_stat_enable(q))
+	if (!blk_poll_stats_enable(q))
 		return 0;
 
 	/*
-	 * We don't have to do this once per IO, should optimize this
-	 * to just use the current window of stats until it changes
-	 */
-	memset(&stats, 0, sizeof(stats));
-	blk_hctx_stat_get(hctx, &stats);
-
-	/*
 	 * As an optimistic guess, use half of the mean service time
 	 * for this type of request. We can (and should) make this smarter.
 	 * For instance, if the completion latencies are tight, we can
@@ -2768,10 +2797,10 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	 * important on devices where the completion latencies are longer
 	 * than ~10 usec.
 	 */
-	if (req_op(rq) == REQ_OP_READ && stats.read.nr_samples)
-		ret = (stats.read.mean + 1) / 2;
-	else if (req_op(rq) == REQ_OP_WRITE && stats.write.nr_samples)
-		ret = (stats.write.mean + 1) / 2;
+	if (req_op(rq) == REQ_OP_READ && q->poll_stats.read.nr_samples)
+		ret = (q->poll_stats.read.mean + 1) / 2;
+	else if (req_op(rq) == REQ_OP_WRITE && q->poll_stats.write.nr_samples)
+		ret = (q->poll_stats.write.mean + 1) / 2;
 
 	return ret;
 }
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b79f9a7d8cf6..8d49c06fc520 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -20,7 +20,6 @@ struct blk_mq_ctx {
 
 	/* incremented at completion time */
 	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
-	struct blk_rq_stat	stat[2];
 
 	struct request_queue	*queue;
 	struct kobject		kobj;
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 40f6c90e432a..a71cfb35712d 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -4,11 +4,24 @@
  * Copyright (C) 2016 Jens Axboe
  */
 #include <linux/kernel.h>
+#include <linux/rculist.h>
 #include <linux/blk-mq.h>
 
 #include "blk-stat.h"
 #include "blk-mq.h"
 
+struct blk_queue_stats {
+	struct list_head callbacks;
+	spinlock_t lock;
+};
+
+static void blk_stat_init(struct blk_rq_stat *stat)
+{
+	stat->min = -1ULL;
+	stat->max = stat->nr_samples = stat->mean = 0;
+	stat->batch = stat->nr_batch = 0;
+}
+
 static void blk_stat_flush_batch(struct blk_rq_stat *stat)
 {
 	const s32 nr_batch = READ_ONCE(stat->nr_batch);
@@ -48,209 +61,164 @@ static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
 	dst->nr_samples += src->nr_samples;
 }
 
-static void blk_mq_stat_get(struct request_queue *q, struct blk_stats *stats)
+static void __blk_stat_add(struct blk_rq_stat *stat, u64 value)
 {
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	uint64_t latest = 0;
-	int i, j, nr;
-
-	blk_stat_init(&stats->read);
-	blk_stat_init(&stats->write);
-
-	nr = 0;
-	do {
-		uint64_t newest = 0;
-
-		queue_for_each_hw_ctx(q, hctx, i) {
-			hctx_for_each_ctx(hctx, ctx, j) {
-				blk_stat_flush_batch(&ctx->stat[BLK_STAT_READ]);
-				blk_stat_flush_batch(&ctx->stat[BLK_STAT_WRITE]);
-
-				if (!ctx->stat[BLK_STAT_READ].nr_samples &&
-				    !ctx->stat[BLK_STAT_WRITE].nr_samples)
-					continue;
-				if (ctx->stat[BLK_STAT_READ].time > newest)
-					newest = ctx->stat[BLK_STAT_READ].time;
-				if (ctx->stat[BLK_STAT_WRITE].time > newest)
-					newest = ctx->stat[BLK_STAT_WRITE].time;
-			}
-		}
+	stat->min = min(stat->min, value);
+	stat->max = max(stat->max, value);
 
-		/*
-		 * No samples
-		 */
-		if (!newest)
-			break;
-
-		if (newest > latest)
-			latest = newest;
-
-		queue_for_each_hw_ctx(q, hctx, i) {
-			hctx_for_each_ctx(hctx, ctx, j) {
-				if (ctx->stat[BLK_STAT_READ].time == newest) {
-					blk_stat_sum(&stats->read,
-						     &ctx->stat[BLK_STAT_READ]);
-					nr++;
-				}
-				if (ctx->stat[BLK_STAT_WRITE].time == newest) {
-					blk_stat_sum(&stats->write,
-						     &ctx->stat[BLK_STAT_WRITE]);
-					nr++;
-				}
-			}
-		}
-		/*
-		 * If we race on finding an entry, just loop back again.
-		 * Should be very rare.
-		 */
-	} while (!nr);
+	if (stat->batch + value < stat->batch ||
+	    stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
+		blk_stat_flush_batch(stat);
 
-	stats->read.time = stats->write.time = latest;
+	stat->batch += value;
+	stat->nr_batch++;
 }
 
-void blk_queue_stat_get(struct request_queue *q, struct blk_stats *stats)
+static struct blk_rq_stat *blk_stat_ddir(struct blk_stats __percpu *stats,
+					 int ddir)
 {
-	if (q->mq_ops)
-		blk_mq_stat_get(q, stats);
-	else {
-		blk_stat_flush_batch(&q->rq_stats[BLK_STAT_READ]);
-		blk_stat_flush_batch(&q->rq_stats[BLK_STAT_WRITE]);
-		memcpy(&stats->read, &q->rq_stats[BLK_STAT_READ],
-		       sizeof(struct blk_rq_stat));
-		memcpy(&stats->write, &q->rq_stats[BLK_STAT_WRITE],
-		       sizeof(struct blk_rq_stat));
-	}
+	if (ddir == READ)
+		return &this_cpu_ptr(stats)->read;
+	else
+		return &this_cpu_ptr(stats)->write;
 }
 
-void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_stats *stats)
+void blk_stat_add(struct request *rq)
 {
-	struct blk_mq_ctx *ctx;
-	unsigned int i, nr;
-
-	nr = 0;
-	do {
-		uint64_t newest = 0;
-
-		hctx_for_each_ctx(hctx, ctx, i) {
-			blk_stat_flush_batch(&ctx->stat[BLK_STAT_READ]);
-			blk_stat_flush_batch(&ctx->stat[BLK_STAT_WRITE]);
+	struct request_queue *q = rq->q;
+	struct blk_stat_callback *cb;
+	struct blk_rq_stat *stat;
+	int ddir = rq_data_dir(rq);
+	s64 now, value;
 
-			if (!ctx->stat[BLK_STAT_READ].nr_samples &&
-			    !ctx->stat[BLK_STAT_WRITE].nr_samples)
-				continue;
+	now = __blk_stat_time(ktime_to_ns(ktime_get()));
+	if (now < blk_stat_time(&rq->issue_stat))
+		return;
 
-			if (ctx->stat[BLK_STAT_READ].time > newest)
-				newest = ctx->stat[BLK_STAT_READ].time;
-			if (ctx->stat[BLK_STAT_WRITE].time > newest)
-				newest = ctx->stat[BLK_STAT_WRITE].time;
-		}
+	value = now - blk_stat_time(&rq->issue_stat);
 
-		if (!newest)
-			break;
-
-		hctx_for_each_ctx(hctx, ctx, i) {
-			if (ctx->stat[BLK_STAT_READ].time == newest) {
-				blk_stat_sum(&stats->read,
-					     &ctx->stat[BLK_STAT_READ]);
-				nr++;
-			}
-			if (ctx->stat[BLK_STAT_WRITE].time == newest) {
-				blk_stat_sum(&stats->write,
-					     &ctx->stat[BLK_STAT_WRITE]);
-				nr++;
-			}
+	rcu_read_lock();
+	list_for_each_entry_rcu(cb, &q->stats->callbacks, list) {
+		if (timer_pending(&cb->timer)) {
+			stat = blk_stat_ddir(cb->stats, ddir);
+			__blk_stat_add(stat, value);
 		}
-		/*
-		 * If we race on finding an entry, just loop back again.
-		 * Should be very rare, as the window is only updated
-		 * occasionally
-		 */
-	} while (!nr);
+	}
+	rcu_read_unlock();
 }
 
-static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
+static void blk_stat_timer_fn(unsigned long data)
 {
-	stat->min = -1ULL;
-	stat->max = stat->nr_samples = stat->mean = 0;
-	stat->batch = stat->nr_batch = 0;
-	stat->time = time_now & BLK_STAT_NSEC_MASK;
-}
+	struct blk_stat_callback *cb = (void *)data;
+	struct blk_stats stats;
+	int i;
 
-void blk_stat_init(struct blk_rq_stat *stat)
-{
-	__blk_stat_init(stat, ktime_to_ns(ktime_get()));
-}
+	blk_stat_init(&stats.read);
+	blk_stat_init(&stats.write);
 
-static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
-{
-	return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
+	for_each_online_cpu(i) {
+		struct blk_stats *cpu_stats;
+
+		cpu_stats = per_cpu_ptr(cb->stats, i);
+		blk_stat_sum(&stats.read, &cpu_stats->read);
+		blk_stat_sum(&stats.write, &cpu_stats->write);
+		blk_stat_init(&cpu_stats->read);
+		blk_stat_init(&cpu_stats->write);
+	}
+
+	cb->fn(cb, &stats);
 }
 
-bool blk_stat_is_current(struct blk_rq_stat *stat)
+struct blk_stat_callback *
+blk_stat_alloc_callback(void (*fn)(struct blk_stat_callback *, struct blk_stats *),
+			void *data)
 {
-	return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
+	struct blk_stat_callback *cb;
+
+	cb = kmalloc(sizeof(*cb), GFP_KERNEL);
+	if (!cb)
+		return NULL;
+
+	cb->stats = alloc_percpu(struct blk_stats);
+	if (!cb->stats) {
+		kfree(cb);
+		return NULL;
+	}
+
+	cb->fn = fn;
+	cb->data = data;
+	setup_timer(&cb->timer, blk_stat_timer_fn, (unsigned long)cb);
+
+	return cb;
 }
+EXPORT_SYMBOL_GPL(blk_stat_alloc_callback);
 
-void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
+void blk_stat_add_callback(struct request_queue *q,
+			   struct blk_stat_callback *cb)
 {
-	s64 now, value;
+	int i;
 
-	now = __blk_stat_time(ktime_to_ns(ktime_get()));
-	if (now < blk_stat_time(&rq->issue_stat))
-		return;
+	for_each_possible_cpu(i) {
+		struct blk_stats *stats = per_cpu_ptr(cb->stats, i);
 
-	if (!__blk_stat_is_current(stat, now))
-		__blk_stat_init(stat, now);
+		blk_stat_init(&stats->read);
+		blk_stat_init(&stats->write);
+	}
 
-	value = now - blk_stat_time(&rq->issue_stat);
-	if (value > stat->max)
-		stat->max = value;
-	if (value < stat->min)
-		stat->min = value;
+	spin_lock(&q->stats->lock);
+	list_add_tail_rcu(&cb->list, &q->stats->callbacks);
+	set_bit(QUEUE_FLAG_STATS, &q->queue_flags);
+	spin_unlock(&q->stats->lock);
+}
+EXPORT_SYMBOL_GPL(blk_stat_add_callback);
 
-	if (stat->batch + value < stat->batch ||
-	    stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
-		blk_stat_flush_batch(stat);
+void blk_stat_remove_callback(struct request_queue *q,
+			      struct blk_stat_callback *cb)
+{
+	spin_lock(&q->stats->lock);
+	list_del_rcu(&cb->list);
+	if (list_empty(&q->stats->callbacks))
+		clear_bit(QUEUE_FLAG_STATS, &q->queue_flags);
+	spin_unlock(&q->stats->lock);
 
-	stat->batch += value;
-	stat->nr_batch++;
+	del_timer_sync(&cb->timer);
 }
+EXPORT_SYMBOL_GPL(blk_stat_remove_callback);
 
-void blk_stat_clear(struct request_queue *q)
+static void blk_stat_free_callback_rcu(struct rcu_head *head)
 {
-	if (q->mq_ops) {
-		struct blk_mq_hw_ctx *hctx;
-		struct blk_mq_ctx *ctx;
-		int i, j;
-
-		queue_for_each_hw_ctx(q, hctx, i) {
-			hctx_for_each_ctx(hctx, ctx, j) {
-				blk_stat_init(&ctx->stat[BLK_STAT_READ]);
-				blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
-			}
-		}
-	} else {
-		blk_stat_init(&q->rq_stats[BLK_STAT_READ]);
-		blk_stat_init(&q->rq_stats[BLK_STAT_WRITE]);
-	}
+	struct blk_stat_callback *cb;
+
+	cb = container_of(head, struct blk_stat_callback, rcu);
+	free_percpu(cb->stats);
 }
 
-void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+void blk_stat_free_callback(struct blk_stat_callback *cb)
 {
-	stat->time = (stat->time & BLK_STAT_MASK) |
-			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+	call_rcu(&cb->rcu, blk_stat_free_callback_rcu);
 }
+EXPORT_SYMBOL_GPL(blk_stat_free_callback);
 
-/*
- * Enable stat tracking, return whether it was enabled
- */
-bool blk_stat_enable(struct request_queue *q)
+struct blk_queue_stats *blk_alloc_queue_stats(void)
 {
-	if (!test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
-		set_bit(QUEUE_FLAG_STATS, &q->queue_flags);
-		return false;
-	}
+	struct blk_queue_stats *stats;
+
+	stats = kmalloc(sizeof(*stats), GFP_KERNEL);
+	if (!stats)
+		return NULL;
+
+	INIT_LIST_HEAD(&stats->callbacks);
+	spin_lock_init(&stats->lock);
+
+	return stats;
+}
+
+void blk_free_queue_stats(struct blk_queue_stats *stats)
+{
+	if (!stats)
+		return;
+
+	WARN_ON(!list_empty(&stats->callbacks));
 
-	return true;
+	kfree(stats);
 }
diff --git a/block/blk-stat.h b/block/blk-stat.h
index a24439aab710..fd6ba9183b17 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -1,11 +1,11 @@
 #ifndef BLK_STAT_H
 #define BLK_STAT_H
 
-/*
- * ~0.13s window as a power-of-2 (2^27 nsecs)
- */
-#define BLK_STAT_NSEC		134217728ULL
-#define BLK_STAT_NSEC_MASK	~(BLK_STAT_NSEC - 1)
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/ktime.h>
+#include <linux/rcupdate.h>
+#include <linux/timer.h>
 
 /*
  * Upper 3 bits can be used elsewhere
@@ -15,19 +15,27 @@
 #define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
 #define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
 
-enum {
-	BLK_STAT_READ	= 0,
-	BLK_STAT_WRITE,
+#define BLK_RQ_STAT_BATCH	64
+
+struct blk_stat_callback {
+	struct list_head list;
+	struct timer_list timer;
+	struct blk_stats __percpu *stats;
+	void (*fn)(struct blk_stat_callback *, struct blk_stats *);
+	void *data;
+	struct rcu_head rcu;
 };
 
-void blk_stat_add(struct blk_rq_stat *, struct request *);
-void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_stats *);
-void blk_queue_stat_get(struct request_queue *, struct blk_stats *);
-void blk_stat_clear(struct request_queue *);
-void blk_stat_init(struct blk_rq_stat *);
-bool blk_stat_is_current(struct blk_rq_stat *);
-void blk_stat_set_issue_time(struct blk_issue_stat *);
-bool blk_stat_enable(struct request_queue *);
+struct blk_queue_stats *blk_alloc_queue_stats(void);
+void blk_free_queue_stats(struct blk_queue_stats *);
+
+void blk_stat_add(struct request *);
+
+static inline void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+{
+	stat->time = ((stat->time & BLK_STAT_MASK) |
+		      (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK));
+}
 
 static inline u64 __blk_stat_time(u64 time)
 {
@@ -39,4 +47,62 @@ static inline u64 blk_stat_time(struct blk_issue_stat *stat)
 	return __blk_stat_time(stat->time);
 }
 
+/**
+ * blk_stat_alloc_callback() - Allocate a block statistics callback.
+ * @fn: Callback function.
+ * @data: Value for the @data field of the &struct blk_stat_callback.
+ *
+ * Return: &struct blk_stat_callback on success or NULL on ENOMEM.
+ */
+struct blk_stat_callback *
+blk_stat_alloc_callback(void (*fn)(struct blk_stat_callback *, struct blk_stats *),
+			void *data);
+
+/**
+ * blk_stat_add_callback() - Add a block statistics callback to be run on a
+ * request queue.
+ * @q: The request queue.
+ * @cb: The callback.
+ *
+ * Note that a single &struct blk_stat_callback can only be added to a single
+ * &struct request_queue.
+ */
+void blk_stat_add_callback(struct request_queue *q,
+			   struct blk_stat_callback *cb);
+
+/**
+ * blk_stat_remove_callback() - Remove a block statistics callback from a
+ * request queue.
+ * @q: The request queue.
+ * @cb: The callback.
+ *
+ * When this returns, the callback is not running on any CPUs and will not be
+ * called again unless readded.
+ */
+void blk_stat_remove_callback(struct request_queue *q,
+			      struct blk_stat_callback *cb);
+
+/**
+ * blk_stat_free_callback() - Free a block statistics callback.
+ * @cb: The callback.
+ *
+ * @cb may be NULL, in which case this does nothing. If it is not NULL, @cb must
+ * not be associated with a request queue. I.e., if it was previously added with
+ * blk_stat_add_callback(), it must also have been removed since then with
+ * blk_stat_remove_callback().
+ */
+void blk_stat_free_callback(struct blk_stat_callback *cb);
+
+/**
+ * blk_stat_arm_callback() - Set a block statistics callback to fire at a given
+ * time.
+ * @cb: The callback.
+ * @expires: The time (in jiffies) at which to run.
+ */
+static inline void blk_stat_arm_callback(struct blk_stat_callback *cb,
+					 unsigned long expires)
+{
+	mod_timer(&cb->timer, expires);
+}
+
 #endif
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6e1cf622af21..fa831cb2fc30 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -785,6 +785,9 @@ static void blk_release_queue(struct kobject *kobj)
 		container_of(kobj, struct request_queue, kobj);
 
 	wbt_exit(q);
+	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
+		blk_stat_remove_callback(q, q->poll_cb);
+	blk_stat_free_callback(q->poll_cb);
 	bdi_put(q->backing_dev_info);
 	blkcg_exit_queue(q);
 
@@ -793,6 +796,8 @@ static void blk_release_queue(struct kobject *kobj)
 		elevator_exit(q->elevator);
 	}
 
+	blk_free_queue_stats(q->stats);
+
 	blk_exit_rl(&q->root_rl);
 
 	if (q->queue_tags)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 5e81068f7c52..695bb92be82b 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -277,7 +277,7 @@ enum {
 	LAT_EXCEEDED,
 };
 
-static int __latency_exceeded(struct rq_wb *rwb, struct blk_stats *stats)
+static int latency_exceeded(struct rq_wb *rwb, struct blk_stats *stats)
 {
 	struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
 	u64 thislat;
@@ -308,8 +308,8 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_stats *stats)
 		 * waited or still has writes in flights, consider us doing
 		 * just writes as well.
 		 */
-		if ((stats->write.nr_samples && blk_stat_is_current(&stats->read)) ||
-		    wb_recent_wait(rwb) || wbt_inflight(rwb))
+		if (stats->write.nr_samples || wb_recent_wait(rwb) ||
+		    wbt_inflight(rwb))
 			return LAT_UNKNOWN_WRITES;
 		return LAT_UNKNOWN;
 	}
@@ -329,14 +329,6 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_stats *stats)
 	return LAT_OK;
 }
 
-static int latency_exceeded(struct rq_wb *rwb)
-{
-	struct blk_stats stats;
-
-	blk_queue_stat_get(rwb->queue, &stats);
-	return __latency_exceeded(rwb, &stats);
-}
-
 static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
 {
 	struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
@@ -355,7 +347,6 @@ static void scale_up(struct rq_wb *rwb)
 
 	rwb->scale_step--;
 	rwb->unknown_cnt = 0;
-	blk_stat_clear(rwb->queue);
 
 	rwb->scaled_max = calc_wb_limits(rwb);
 
@@ -385,7 +376,6 @@ static void scale_down(struct rq_wb *rwb, bool hard_throttle)
 
 	rwb->scaled_max = false;
 	rwb->unknown_cnt = 0;
-	blk_stat_clear(rwb->queue);
 	calc_wb_limits(rwb);
 	rwb_trace_step(rwb, "step down");
 }
@@ -412,16 +402,16 @@ static void rwb_arm_timer(struct rq_wb *rwb)
 	}
 
 	expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec);
-	mod_timer(&rwb->window_timer, expires);
+	blk_stat_arm_callback(rwb->cb, expires);
 }
 
-static void wb_timer_fn(unsigned long data)
+static void wb_timer_fn(struct blk_stat_callback *cb, struct blk_stats *stats)
 {
-	struct rq_wb *rwb = (struct rq_wb *) data;
+	struct rq_wb *rwb = cb->data;
 	unsigned int inflight = wbt_inflight(rwb);
 	int status;
 
-	status = latency_exceeded(rwb);
+	status = latency_exceeded(rwb, stats);
 
 	trace_wbt_timer(rwb->queue->backing_dev_info, status, rwb->scale_step,
 			inflight);
@@ -614,7 +604,7 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
 
 	__wbt_wait(rwb, bio->bi_opf, lock);
 
-	if (!timer_pending(&rwb->window_timer))
+	if (!timer_pending(&rwb->cb->timer))
 		rwb_arm_timer(rwb);
 
 	if (current_is_kswapd())
@@ -675,7 +665,7 @@ void wbt_disable_default(struct request_queue *q)
 	struct rq_wb *rwb = q->rq_wb;
 
 	if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT) {
-		del_timer_sync(&rwb->window_timer);
+		blk_stat_remove_callback(q, rwb->cb);
 		rwb->win_nsec = rwb->min_lat_nsec = 0;
 		wbt_update_limits(rwb);
 	}
@@ -699,24 +689,23 @@ int wbt_init(struct request_queue *q)
 	struct rq_wb *rwb;
 	int i;
 
-	/*
-	 * For now, we depend on the stats window being larger than
-	 * our monitoring window. Ensure that this isn't inadvertently
-	 * violated.
-	 */
-	BUILD_BUG_ON(RWB_WINDOW_NSEC > BLK_STAT_NSEC);
 	BUILD_BUG_ON(WBT_NR_BITS > BLK_STAT_RES_BITS);
 
 	rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
 	if (!rwb)
 		return -ENOMEM;
 
+	rwb->cb = blk_stat_alloc_callback(wb_timer_fn, rwb);
+	if (!rwb->cb) {
+		kfree(rwb);
+		return -ENOMEM;
+	}
+
 	for (i = 0; i < WBT_NUM_RWQ; i++) {
 		atomic_set(&rwb->rq_wait[i].inflight, 0);
 		init_waitqueue_head(&rwb->rq_wait[i].wait);
 	}
 
-	setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb);
 	rwb->wc = 1;
 	rwb->queue_depth = RWB_DEF_DEPTH;
 	rwb->last_comp = rwb->last_issue = jiffies;
@@ -726,10 +715,10 @@ int wbt_init(struct request_queue *q)
 	wbt_update_limits(rwb);
 
 	/*
-	 * Assign rwb, and turn on stats tracking for this queue
+	 * Assign rwb and add the stats callback.
 	 */
 	q->rq_wb = rwb;
-	blk_stat_enable(q);
+	blk_stat_add_callback(q, rwb->cb);
 
 	rwb->min_lat_nsec = wbt_default_latency_nsec(q);
 
@@ -744,7 +733,8 @@ void wbt_exit(struct request_queue *q)
 	struct rq_wb *rwb = q->rq_wb;
 
 	if (rwb) {
-		del_timer_sync(&rwb->window_timer);
+		blk_stat_remove_callback(q, rwb->cb);
+		blk_stat_free_callback(rwb->cb);
 		q->rq_wb = NULL;
 		kfree(rwb);
 	}
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 65f1de519f67..591ff2f4b2ee 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -81,7 +81,7 @@ struct rq_wb {
 	u64 win_nsec;				/* default window size */
 	u64 cur_win_nsec;			/* current window size */
 
-	struct timer_list window_timer;
+	struct blk_stat_callback *cb;
 
 	s64 sync_issue;
 	void *sync_cookie;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 5e4c5380edf0..14789bdb3338 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -287,8 +287,6 @@ struct blk_issue_stat {
 	u64 time;
 };
 
-#define BLK_RQ_STAT_BATCH	64
-
 struct blk_rq_stat {
 	s64 mean;
 	u64 min;
@@ -296,7 +294,6 @@ struct blk_rq_stat {
 	s32 nr_samples;
 	s32 nr_batch;
 	u64 batch;
-	s64 time;
 };
 
 struct blk_stats {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..a093449adbb8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -40,6 +40,8 @@ struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
 struct rq_wb;
+struct blk_queue_stats;
+struct blk_stat_callback;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -388,6 +390,7 @@ struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
+	struct blk_queue_stats	*stats;
 	struct rq_wb		*rq_wb;
 
 	/*
@@ -505,8 +508,6 @@ struct request_queue {
 	unsigned int		nr_sorted;
 	unsigned int		in_flight[2];
 
-	struct blk_rq_stat	rq_stats[2];
-
 	/*
 	 * Number of active block driver functions for which blk_drain_queue()
 	 * must wait. Must be incremented around functions that unlock the
@@ -516,6 +517,10 @@ struct request_queue {
 
 	unsigned int		rq_timeout;
 	int			poll_nsec;
+
+	struct blk_stat_callback	*poll_cb;
+	struct blk_stats	poll_stats;
+
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
@@ -611,6 +616,7 @@ struct request_queue {
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
 #define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
+#define QUEUE_FLAG_POLL_STATS  29	/* collecting stats for hybrid polling */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
-- 
2.12.0

^ permalink raw reply related

* [PATCH 3/4] block: change stats from array type to struct type
From: Omar Sandoval @ 2017-03-14 21:03 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489524591.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This is cleanup in preparation for callback-based stats with a couple of
(minor) advantages:

1. Allocating a percpu array type is awkward.
2. It makes the callback type signature more clear. With `struct
   blk_rq_stat *`, the only way to know that you're getting an array of
   read and write stats is to look at the implementation or other users.
   With the struct type, it's immediately obvious.

Note that I left q->rq_stats and ctx->stat alone since those are going
away anyways.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c             | 14 +++++++-------
 block/blk-stat.c           | 34 +++++++++++++++++-----------------
 block/blk-stat.h           |  4 ++--
 block/blk-wbt.c            | 28 ++++++++++++++--------------
 include/linux/blk_types.h  |  5 +++++
 include/trace/events/wbt.h | 20 ++++++++++----------
 6 files changed, 55 insertions(+), 50 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index eb84daf550f4..e16f8d420683 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2743,7 +2743,7 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 				       struct blk_mq_hw_ctx *hctx,
 				       struct request *rq)
 {
-	struct blk_rq_stat stat[2];
+	struct blk_stats stats;
 	unsigned long ret = 0;
 
 	/*
@@ -2757,8 +2757,8 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	 * We don't have to do this once per IO, should optimize this
 	 * to just use the current window of stats until it changes
 	 */
-	memset(&stat, 0, sizeof(stat));
-	blk_hctx_stat_get(hctx, stat);
+	memset(&stats, 0, sizeof(stats));
+	blk_hctx_stat_get(hctx, &stats);
 
 	/*
 	 * As an optimistic guess, use half of the mean service time
@@ -2768,10 +2768,10 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	 * important on devices where the completion latencies are longer
 	 * than ~10 usec.
 	 */
-	if (req_op(rq) == REQ_OP_READ && stat[BLK_STAT_READ].nr_samples)
-		ret = (stat[BLK_STAT_READ].mean + 1) / 2;
-	else if (req_op(rq) == REQ_OP_WRITE && stat[BLK_STAT_WRITE].nr_samples)
-		ret = (stat[BLK_STAT_WRITE].mean + 1) / 2;
+	if (req_op(rq) == REQ_OP_READ && stats.read.nr_samples)
+		ret = (stats.read.mean + 1) / 2;
+	else if (req_op(rq) == REQ_OP_WRITE && stats.write.nr_samples)
+		ret = (stats.write.mean + 1) / 2;
 
 	return ret;
 }
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 9b43efb8933f..40f6c90e432a 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -48,15 +48,15 @@ static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
 	dst->nr_samples += src->nr_samples;
 }
 
-static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+static void blk_mq_stat_get(struct request_queue *q, struct blk_stats *stats)
 {
 	struct blk_mq_hw_ctx *hctx;
 	struct blk_mq_ctx *ctx;
 	uint64_t latest = 0;
 	int i, j, nr;
 
-	blk_stat_init(&dst[BLK_STAT_READ]);
-	blk_stat_init(&dst[BLK_STAT_WRITE]);
+	blk_stat_init(&stats->read);
+	blk_stat_init(&stats->write);
 
 	nr = 0;
 	do {
@@ -89,12 +89,12 @@ static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
 		queue_for_each_hw_ctx(q, hctx, i) {
 			hctx_for_each_ctx(hctx, ctx, j) {
 				if (ctx->stat[BLK_STAT_READ].time == newest) {
-					blk_stat_sum(&dst[BLK_STAT_READ],
+					blk_stat_sum(&stats->read,
 						     &ctx->stat[BLK_STAT_READ]);
 					nr++;
 				}
 				if (ctx->stat[BLK_STAT_WRITE].time == newest) {
-					blk_stat_sum(&dst[BLK_STAT_WRITE],
+					blk_stat_sum(&stats->write,
 						     &ctx->stat[BLK_STAT_WRITE]);
 					nr++;
 				}
@@ -106,24 +106,24 @@ static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
 		 */
 	} while (!nr);
 
-	dst[BLK_STAT_READ].time = dst[BLK_STAT_WRITE].time = latest;
+	stats->read.time = stats->write.time = latest;
 }
 
-void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+void blk_queue_stat_get(struct request_queue *q, struct blk_stats *stats)
 {
 	if (q->mq_ops)
-		blk_mq_stat_get(q, dst);
+		blk_mq_stat_get(q, stats);
 	else {
 		blk_stat_flush_batch(&q->rq_stats[BLK_STAT_READ]);
 		blk_stat_flush_batch(&q->rq_stats[BLK_STAT_WRITE]);
-		memcpy(&dst[BLK_STAT_READ], &q->rq_stats[BLK_STAT_READ],
-				sizeof(struct blk_rq_stat));
-		memcpy(&dst[BLK_STAT_WRITE], &q->rq_stats[BLK_STAT_WRITE],
-				sizeof(struct blk_rq_stat));
+		memcpy(&stats->read, &q->rq_stats[BLK_STAT_READ],
+		       sizeof(struct blk_rq_stat));
+		memcpy(&stats->write, &q->rq_stats[BLK_STAT_WRITE],
+		       sizeof(struct blk_rq_stat));
 	}
 }
 
-void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_stats *stats)
 {
 	struct blk_mq_ctx *ctx;
 	unsigned int i, nr;
@@ -151,13 +151,13 @@ void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
 
 		hctx_for_each_ctx(hctx, ctx, i) {
 			if (ctx->stat[BLK_STAT_READ].time == newest) {
-				blk_stat_sum(&dst[BLK_STAT_READ],
-						&ctx->stat[BLK_STAT_READ]);
+				blk_stat_sum(&stats->read,
+					     &ctx->stat[BLK_STAT_READ]);
 				nr++;
 			}
 			if (ctx->stat[BLK_STAT_WRITE].time == newest) {
-				blk_stat_sum(&dst[BLK_STAT_WRITE],
-						&ctx->stat[BLK_STAT_WRITE]);
+				blk_stat_sum(&stats->write,
+					     &ctx->stat[BLK_STAT_WRITE]);
 				nr++;
 			}
 		}
diff --git a/block/blk-stat.h b/block/blk-stat.h
index a2050a0a5314..a24439aab710 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -21,8 +21,8 @@ enum {
 };
 
 void blk_stat_add(struct blk_rq_stat *, struct request *);
-void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
-void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_stats *);
+void blk_queue_stat_get(struct request_queue *, struct blk_stats *);
 void blk_stat_clear(struct request_queue *);
 void blk_stat_init(struct blk_rq_stat *);
 bool blk_stat_is_current(struct blk_rq_stat *);
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 1aedb1f7ee0c..5e81068f7c52 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -247,7 +247,7 @@ static bool calc_wb_limits(struct rq_wb *rwb)
 	return ret;
 }
 
-static inline bool stat_sample_valid(struct blk_rq_stat *stat)
+static inline bool stat_sample_valid(struct blk_stats *stats)
 {
 	/*
 	 * We need at least one read sample, and a minimum of
@@ -255,8 +255,8 @@ static inline bool stat_sample_valid(struct blk_rq_stat *stat)
 	 * that it's writes impacting us, and not just some sole read on
 	 * a device that is in a lower power state.
 	 */
-	return stat[BLK_STAT_READ].nr_samples >= 1 &&
-		stat[BLK_STAT_WRITE].nr_samples >= RWB_MIN_WRITE_SAMPLES;
+	return (stats->read.nr_samples >= 1 &&
+		stats->write.nr_samples >= RWB_MIN_WRITE_SAMPLES);
 }
 
 static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
@@ -277,7 +277,7 @@ enum {
 	LAT_EXCEEDED,
 };
 
-static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
+static int __latency_exceeded(struct rq_wb *rwb, struct blk_stats *stats)
 {
 	struct backing_dev_info *bdi = rwb->queue->backing_dev_info;
 	u64 thislat;
@@ -293,7 +293,7 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 	 */
 	thislat = rwb_sync_issue_lat(rwb);
 	if (thislat > rwb->cur_win_nsec ||
-	    (thislat > rwb->min_lat_nsec && !stat[BLK_STAT_READ].nr_samples)) {
+	    (thislat > rwb->min_lat_nsec && !stats->read.nr_samples)) {
 		trace_wbt_lat(bdi, thislat);
 		return LAT_EXCEEDED;
 	}
@@ -301,14 +301,14 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 	/*
 	 * No read/write mix, if stat isn't valid
 	 */
-	if (!stat_sample_valid(stat)) {
+	if (!stat_sample_valid(stats)) {
 		/*
 		 * If we had writes in this stat window and the window is
 		 * current, we're only doing writes. If a task recently
 		 * waited or still has writes in flights, consider us doing
 		 * just writes as well.
 		 */
-		if ((stat[BLK_STAT_WRITE].nr_samples && blk_stat_is_current(stat)) ||
+		if ((stats->write.nr_samples && blk_stat_is_current(&stats->read)) ||
 		    wb_recent_wait(rwb) || wbt_inflight(rwb))
 			return LAT_UNKNOWN_WRITES;
 		return LAT_UNKNOWN;
@@ -317,24 +317,24 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
 	/*
 	 * If the 'min' latency exceeds our target, step down.
 	 */
-	if (stat[BLK_STAT_READ].min > rwb->min_lat_nsec) {
-		trace_wbt_lat(bdi, stat[BLK_STAT_READ].min);
-		trace_wbt_stat(bdi, stat);
+	if (stats->read.min > rwb->min_lat_nsec) {
+		trace_wbt_lat(bdi, stats->read.min);
+		trace_wbt_stat(bdi, stats);
 		return LAT_EXCEEDED;
 	}
 
 	if (rwb->scale_step)
-		trace_wbt_stat(bdi, stat);
+		trace_wbt_stat(bdi, stats);
 
 	return LAT_OK;
 }
 
 static int latency_exceeded(struct rq_wb *rwb)
 {
-	struct blk_rq_stat stat[2];
+	struct blk_stats stats;
 
-	blk_queue_stat_get(rwb->queue, stat);
-	return __latency_exceeded(rwb, stat);
+	blk_queue_stat_get(rwb->queue, &stats);
+	return __latency_exceeded(rwb, &stats);
 }
 
 static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb55d0f..5e4c5380edf0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -299,4 +299,9 @@ struct blk_rq_stat {
 	s64 time;
 };
 
+struct blk_stats {
+	struct blk_rq_stat read;
+	struct blk_rq_stat write;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
index 3c518e455680..5b7f8ef73532 100644
--- a/include/trace/events/wbt.h
+++ b/include/trace/events/wbt.h
@@ -13,9 +13,9 @@
  */
 TRACE_EVENT(wbt_stat,
 
-	TP_PROTO(struct backing_dev_info *bdi, struct blk_rq_stat *stat),
+	TP_PROTO(struct backing_dev_info *bdi, struct blk_stats *stats),
 
-	TP_ARGS(bdi, stat),
+	TP_ARGS(bdi, stats),
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
@@ -33,14 +33,14 @@ TRACE_EVENT(wbt_stat,
 
 	TP_fast_assign(
 		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->rmean		= stat[0].mean;
-		__entry->rmin		= stat[0].min;
-		__entry->rmax		= stat[0].max;
-		__entry->rnr_samples	= stat[0].nr_samples;
-		__entry->wmean		= stat[1].mean;
-		__entry->wmin		= stat[1].min;
-		__entry->wmax		= stat[1].max;
-		__entry->wnr_samples	= stat[1].nr_samples;
+		__entry->rmean		= stats->read.mean;
+		__entry->rmin		= stats->read.min;
+		__entry->rmax		= stats->read.max;
+		__entry->rnr_samples	= stats->read.nr_samples;
+		__entry->wmean		= stats->write.mean;
+		__entry->wmin		= stats->write.min;
+		__entry->wmax		= stats->write.max;
+		__entry->wnr_samples	= stats->write.nr_samples;
 	),
 
 	TP_printk("%s: rmean=%llu, rmin=%llu, rmax=%llu, rsamples=%llu, "
-- 
2.12.0

^ permalink raw reply related

* [PATCH 2/4] block: remove extra calls to wbt_exit()
From: Omar Sandoval @ 2017-03-14 21:03 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489524591.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

We always call wbt_exit() from blk_release_queue(), so these are
unnecessary.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 --
 2 files changed, 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..82425017c9b8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -889,7 +889,6 @@ int blk_init_allocated_queue(struct request_queue *q)
 		q->exit_rq_fn(q, q->fq->flush_rq);
 out_free_flush_queue:
 	blk_free_flush_queue(q->fq);
-	wbt_exit(q);
 	return -ENOMEM;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..eb84daf550f4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2430,8 +2430,6 @@ void blk_mq_free_queue(struct request_queue *q)
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
 
-	wbt_exit(q);
-
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
-- 
2.12.0

^ permalink raw reply related

* [PATCH 1/4] block: tear out blk-stat debugfs/sysfs entries
From: Omar Sandoval @ 2017-03-14 21:03 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489524591.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

These made sense when there was a single stats buffer for the request
queue, but that's going away.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-debugfs.c | 55 --------------------------------------------------
 block/blk-sysfs.c      | 26 ------------------------
 2 files changed, 81 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f6d917977b33..18672225c417 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -322,60 +322,6 @@ static const struct file_operations hctx_io_poll_fops = {
 	.release	= single_release,
 };
 
-static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
-{
-	seq_printf(m, "samples=%d, mean=%lld, min=%llu, max=%llu",
-		   stat->nr_samples, stat->mean, stat->min, stat->max);
-}
-
-static int hctx_stats_show(struct seq_file *m, void *v)
-{
-	struct blk_mq_hw_ctx *hctx = m->private;
-	struct blk_rq_stat stat[2];
-
-	blk_stat_init(&stat[BLK_STAT_READ]);
-	blk_stat_init(&stat[BLK_STAT_WRITE]);
-
-	blk_hctx_stat_get(hctx, stat);
-
-	seq_puts(m, "read: ");
-	print_stat(m, &stat[BLK_STAT_READ]);
-	seq_puts(m, "\n");
-
-	seq_puts(m, "write: ");
-	print_stat(m, &stat[BLK_STAT_WRITE]);
-	seq_puts(m, "\n");
-	return 0;
-}
-
-static int hctx_stats_open(struct inode *inode, struct file *file)
-{
-	return single_open(file, hctx_stats_show, inode->i_private);
-}
-
-static ssize_t hctx_stats_write(struct file *file, const char __user *buf,
-				size_t count, loff_t *ppos)
-{
-	struct seq_file *m = file->private_data;
-	struct blk_mq_hw_ctx *hctx = m->private;
-	struct blk_mq_ctx *ctx;
-	int i;
-
-	hctx_for_each_ctx(hctx, ctx, i) {
-		blk_stat_init(&ctx->stat[BLK_STAT_READ]);
-		blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
-	}
-	return count;
-}
-
-static const struct file_operations hctx_stats_fops = {
-	.open		= hctx_stats_open,
-	.read		= seq_read,
-	.write		= hctx_stats_write,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
 static int hctx_dispatched_show(struct seq_file *m, void *v)
 {
 	struct blk_mq_hw_ctx *hctx = m->private;
@@ -646,7 +592,6 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"sched_tags", 0400, &hctx_sched_tags_fops},
 	{"sched_tags_bitmap", 0400, &hctx_sched_tags_bitmap_fops},
 	{"io_poll", 0600, &hctx_io_poll_fops},
-	{"stats", 0600, &hctx_stats_fops},
 	{"dispatched", 0600, &hctx_dispatched_fops},
 	{"queued", 0600, &hctx_queued_fops},
 	{"run", 0600, &hctx_run_fops},
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c44b321335f3..6e1cf622af21 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -503,26 +503,6 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page)
 	return queue_var_show(blk_queue_dax(q), page);
 }
 
-static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
-{
-	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
-			pre, (long long) stat->nr_samples,
-			(long long) stat->mean, (long long) stat->min,
-			(long long) stat->max);
-}
-
-static ssize_t queue_stats_show(struct request_queue *q, char *page)
-{
-	struct blk_rq_stat stat[2];
-	ssize_t ret;
-
-	blk_queue_stat_get(q, stat);
-
-	ret = print_stat(page, &stat[BLK_STAT_READ], "read :");
-	ret += print_stat(page + ret, &stat[BLK_STAT_WRITE], "write:");
-	return ret;
-}
-
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -691,11 +671,6 @@ static struct queue_sysfs_entry queue_dax_entry = {
 	.show = queue_dax_show,
 };
 
-static struct queue_sysfs_entry queue_stats_entry = {
-	.attr = {.name = "stats", .mode = S_IRUGO },
-	.show = queue_stats_show,
-};
-
 static struct queue_sysfs_entry queue_wb_lat_entry = {
 	.attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_wb_lat_show,
@@ -733,7 +708,6 @@ static struct attribute *default_attrs[] = {
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
-	&queue_stats_entry.attr,
 	&queue_wb_lat_entry.attr,
 	&queue_poll_delay_entry.attr,
 	NULL,
-- 
2.12.0

^ permalink raw reply related

* [PATCH 0/4] block: callback-based statistics
From: Omar Sandoval @ 2017-03-14 21:03 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This patchset generalizes the blk-stats infrastructure to allow users to
register a callback to be called at a given time with the statistics of
requests completed during that window. Writeback throttling and hybrid
polling are converted to the new infrastructure. The ultimate goal is to
use this infrastructure for the mq I/O scheduler.

The details are in patch 4, which is the actual conversion. Patches 1-3
are preparation cleanups.

A couple of minor open issues:

- Since there is no single source of truth for stats anymore (it's
  per-callback now), I got rid of the sysfs/debugfs files. If we still
  want these, we could potentially expose the statistics that poll is
  using.
- Polling used to use per-hctx stats, but now the stats are gathered to
  a per-request_queue cache. This might make the stats noisier, but we
  can improve this in the future by splitting out stats by request size
  or something like that.

Omar Sandoval (4):
  block: tear out blk-stat debugfs/sysfs entries
  block: remove extra calls to wbt_exit()
  block: change stats from array type to struct type
  blk-stats: convert to callback-based statistics reporting

 block/blk-core.c           |   7 +-
 block/blk-mq-debugfs.c     |  55 ---------
 block/blk-mq.c             |  79 ++++++++----
 block/blk-mq.h             |   1 -
 block/blk-stat.c           | 300 ++++++++++++++++++++-------------------------
 block/blk-stat.h           |  98 ++++++++++++---
 block/blk-sysfs.c          |  31 +----
 block/blk-wbt.c            |  66 +++++-----
 block/blk-wbt.h            |   2 +-
 include/linux/blk_types.h  |   8 +-
 include/linux/blkdev.h     |  10 +-
 include/trace/events/wbt.h |  20 +--
 12 files changed, 331 insertions(+), 346 deletions(-)

-- 
2.12.0

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
From: Jens Axboe @ 2017-03-14 17:51 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170314174837.GB2352@vader.DHCP.thefacebook.com>

On 03/14/2017 11:48 AM, Omar Sandoval wrote:
> On Tue, Mar 14, 2017 at 08:57:50AM -0600, Jens Axboe wrote:
>> If we have scheduling enabled, we jump directly to insert-and-run.
>> That's fine, but we run the queue async and we don't pass in information
>> on whether we can block from this context or not. Fixup both these
>> cases.
> 
> Reviewed-by: Omar Sandoval <osandov@fb.com>
> 
> Just one question: we call blk_mq_get_driver_tag() with wait=false in
> blk_mq_try_issue_directly(). Should we change that to wait=can_block?
> Maybe it's pointless to try a direct issue if we'd have to wait for a
> tag anyways, though.

Exactly, don't want to wait for a tag, at that point we are just
pointlessly stalling an app that could perhaps be submitting more IO.
So I don't think we should factor that in here, better to let the
blocking vs non-blocking drivers behave the same in that regard.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
From: Omar Sandoval @ 2017-03-14 17:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block@vger.kernel.org
In-Reply-To: <3ac4f49e-3bc0-7fc6-e93a-db9ec7ad21cf@kernel.dk>

On Tue, Mar 14, 2017 at 08:57:50AM -0600, Jens Axboe wrote:
> If we have scheduling enabled, we jump directly to insert-and-run.
> That's fine, but we run the queue async and we don't pass in information
> on whether we can block from this context or not. Fixup both these
> cases.

Reviewed-by: Omar Sandoval <osandov@fb.com>

Just one question: we call blk_mq_get_driver_tag() with wait=false in
blk_mq_try_issue_directly(). Should we change that to wait=can_block?
Maybe it's pointless to try a direct issue if we'd have to wait for a
tag anyways, though.

> Signed-off-by: Jens Axboe <axboe@fb.com>
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..4196d6bee92d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1434,7 +1434,8 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
>  	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
>  }
>  
> -static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
> +static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
> +				      bool can_block)
>  {
>  	struct request_queue *q = rq->q;
>  	struct blk_mq_queue_data bd = {
> @@ -1475,7 +1476,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
>  	}
>  
>  insert:
> -	blk_mq_sched_insert_request(rq, false, true, true, false);
> +	blk_mq_sched_insert_request(rq, false, true, false, can_block);
>  }
>  
>  /*
> @@ -1569,11 +1570,11 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>  
>  		if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
>  			rcu_read_lock();
> -			blk_mq_try_issue_directly(old_rq, &cookie);
> +			blk_mq_try_issue_directly(old_rq, &cookie, false);
>  			rcu_read_unlock();
>  		} else {
>  			srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
> -			blk_mq_try_issue_directly(old_rq, &cookie);
> +			blk_mq_try_issue_directly(old_rq, &cookie, true);
>  			srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
>  		}
>  		goto done;
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply

* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Bart Van Assche @ 2017-03-14 16:32 UTC (permalink / raw)
  To: paolo.valente@linaro.org
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
	avanzini.arianna@gmail.com, broonie@kernel.org, tj@kernel.org,
	ulf.hansson@linaro.org
In-Reply-To: <81048010-02AB-4A7A-8C10-FAF7E3242DCC@linaro.org>

On Tue, 2017-03-14 at 16:35 +0100, Paolo Valente wrote:
> > Il giorno 07 mar 2017, alle ore 02:00, Bart Van Assche <bart.vanassche@=
sandisk.com> ha scritto:
> >=20
> > Additionally, the complexity of the code is huge. Just like for CFQ,
> > sooner or later someone will run into a bug or a performance issue
> > and will post a patch to fix it. However, the complexity of BFQ is
> > such that a source code review alone won't be sufficient to verify
> > whether or not such a patch negatively affects a workload or device
> > that has not been tested by the author of the patch. This makes me
> > wonder what process should be followed to verify future BFQ patches?
>=20
> Third and last, a proposal: why don't we discuss this issue at LSF
> too?  In particular, we could talk about the parts of BFQ that seem
> more complex to understand, until they become clearer to you.  Then I
> could try to understand what helped make them clearer, and translate
> it into extra comments in the code or into other, more radical
> changes.

Hello Paolo,

Sorry if my comment was not clear enough. Suppose that e.g. someone would
like to modify the following code:

static int bfq_min_budget(struct bfq_data *bfqd)
{
       if (bfqd->budgets_assigned < bfq_stats_min_budgets)
               return bfq_default_max_budget / 32;
       else
               return bfqd->bfq_max_budget / 32;
}

How to predict the performance impact of any changes in e.g. this function?
It is really great that a performance benchmark is available. But what shou=
ld
a developer do who only has access to a small subset of all the storage
devices that are supported by the Linux kernel and hence who can not run th=
e
benchmark against every supported storage device? Do developers who do not
fully understand the BFQ algorithms and who run into a performance problem
have any other option than trial and error for fixing such performance issu=
es?

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Jens Axboe @ 2017-03-14 15:42 UTC (permalink / raw)
  To: Paolo Valente, Bart Van Assche
  Cc: tj@kernel.org, ulf.hansson@linaro.org,
	linux-kernel@vger.kernel.org, fchecconi@gmail.com,
	Arianna Avanzini, linux-block@vger.kernel.org,
	linus.walleij@linaro.org, broonie@kernel.org
In-Reply-To: <81048010-02AB-4A7A-8C10-FAF7E3242DCC@linaro.org>

On 03/14/2017 09:35 AM, Paolo Valente wrote:
> First, I've developed BFQ in a sort of
> first-the-problem-then-the-solution way.  That is, each time, I have
> first implemented a benchmark that enabled me to highlight the problem
> and get all relevant statistics on it, then I have worked on BFQ to
> try to solve that problem, using the benchmark as a support.  All
> those benchmarks are in the public S suite now.  In particular, by
> running one script, and waiting at most one hour, you get graphs of
> - throughput with read/write/random/sequential workloads
> - start-up times of bash, xterm, gnome terminal and libreoffice, when
> all the above combinations of workloads are executed in the background
> - frame drop rate for the playback of a movie, again with both all the
> above combinations of workloads and the recurrent start of a bash
> shell in the background
> - kernel-task execution times (compilation, merge, ...), again with
> all the above combinations of workloads in the background
> - fairness with various combinations of weights and processes
> - throughput against interleaved I/O, with a number of readers ranging
> from 2 to 9
> 
> Every time I fix a bug, add a new feature or port BFQ to a new kernel
> version, I just run that script and compare new graphs with previous
> ones.  Any regression shows up immediately.  We already have a
> similar, working script for Android too, although covering only
> throughput, responsiveness and frame drops for the moment.  Of course,
> the coverage of these scripts is limited to only the goals for which I
> have devised and tuned BFQ so far.  But I hope that it won't be too
> hard to extend them to other important use cases (e.g., dbms).

This is great, btw, and a really nice tool set to have when evaluating
new changes.

> Second, IMO BFQ is complex also because it contains a lot of features.
> We have adopted the usual approach for handling this type of
> complexity: find clean cuts to get independent pieces, and put each
> piece in a separate file, plus one header glue file.  The pieces were:
> scheduling engine, hierarchical-scheduling support (allowing the
> engine to scheduler generic nodes in the hierarchy), cgroups support.
> Yet, Tejun last year, and Jens more recently, have asked to put
> everything in one file; for other good reasons of course.  If you do
> think that turning back to multiple files may somehow help, and there
> are no strong objections from others, then I'm willing to resume this
> option and possibly find event better splits.
> 
> Third and last, a proposal: why don't we discuss this issue at LSF
> too?  In particular, we could talk about the parts of BFQ that seem
> more complex to understand, until they become clearer to you.  Then I
> could try to understand what helped make them clearer, and translate
> it into extra comments in the code or into other, more radical
> changes.

A big issue here is the lack of nicely structured code. It's one massive
file of code, 8751 lines, or almost 270K of code. It might be a lot
easier to read and understand if it was split into smaller files,
containing various parts of it.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 4/4] blk-mq: streamline blk_mq_make_request
From: Bart Van Assche @ 2017-03-14 15:40 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-5-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> Turn the different ways of merging or issuing I/O into a series of if/els=
e
> statements instead of the current maze of gotos.  Note that this means we
> pin the CPU a little longer for some cases as the CTX put is moved to
> common code at the end of the function.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-mq.c | 67 +++++++++++++++++++++++-----------------------------=
------
>  1 file changed, 27 insertions(+), 40 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 48748cb799ed..18e449cc832f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1534,16 +1534,17 @@ static blk_qc_t blk_mq_make_request(struct reques=
t_queue *q, struct bio *bio)
> =20
>  	cookie =3D request_to_qc_t(data.hctx, rq);
> =20
> +	plug =3D current->plug;
>  	if (unlikely(is_flush_fua)) {
> -		if (q->elevator)
> -			goto elv_insert;
>  		blk_mq_bio_to_request(rq, bio);
> -		blk_insert_flush(rq);
> -		goto run_queue;
> -	}
> -
> -	plug =3D current->plug;
> -	if (plug && q->nr_hw_queues =3D=3D 1) {
> +		if (q->elevator) {
> +			blk_mq_sched_insert_request(rq, false, true,
> +						!is_sync || is_flush_fua, true);
> +		} else {
> +			blk_insert_flush(rq);
> +			blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
> +		}
> +	} else if (plug && q->nr_hw_queues =3D=3D 1) {
>  		struct request *last =3D NULL;
> =20
>  		blk_mq_bio_to_request(rq, bio);

This change introduces the following construct:

if (is_flush_fua) {
	/* use is_flush_fua */
} else ...

Have you considered to simplify the code that uses is_flush_fua further?

> @@ -1562,8 +1563,6 @@ static blk_qc_t blk_mq_make_request(struct request_=
queue *q, struct bio *bio)
>  		else
>  			last =3D list_entry_rq(plug->mq_list.prev);
> =20
> -		blk_mq_put_ctx(data.ctx);
> -
>  		if (request_count >=3D BLK_MAX_REQUEST_COUNT || (last &&
>  		    blk_rq_bytes(last) >=3D BLK_PLUG_FLUSH_SIZE)) {
>  			blk_flush_plug_list(plug, false);
> @@ -1571,56 +1570,44 @@ static blk_qc_t blk_mq_make_request(struct reques=
t_queue *q, struct bio *bio)
>  		}
> =20
>  		list_add_tail(&rq->queuelist, &plug->mq_list);
> -		goto done;
> -	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> -		struct request *old_rq =3D NULL;
> -
> +	} else if (plug && !blk_queue_nomerges(q)) {
>  		blk_mq_bio_to_request(rq, bio);
> =20
>  		/*
>  		 * We do limited plugging. If the bio can be merged, do that.
>  		 * Otherwise the existing request in the plug list will be
>  		 * issued. So the plug list will have one request at most
> +		 *
> +		 * The plug list might get flushed before this. If that happens,
> +		 * the plug list is emptry and same_queue_rq is invalid.
>  		 */

"emptry" looks like a typo?

> -		if (plug) {
> -			/*
> -			 * The plug list might get flushed before this. If that
> -			 * happens, same_queue_rq is invalid and plug list is
> -			 * empty
> -			 */
> -			if (same_queue_rq && !list_empty(&plug->mq_list)) {
> -				old_rq =3D same_queue_rq;
> -				list_del_init(&old_rq->queuelist);
> -			}
> -			list_add_tail(&rq->queuelist, &plug->mq_list);
> -		} else /* is_sync */
> -			old_rq =3D rq;
> -		blk_mq_put_ctx(data.ctx);
> -		if (old_rq)
> -			blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
> -		goto done;
> -	}
> +		if (!list_empty(&plug->mq_list))
> +			list_del_init(&same_queue_rq->queuelist);
> +		else
> +			same_queue_rq =3D NULL;
> =20
> -	if (q->elevator) {
> -elv_insert:
> -		blk_mq_put_ctx(data.ctx);
> +		list_add_tail(&rq->queuelist, &plug->mq_list);
> +		if (same_queue_rq)
> +			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
> +					&cookie);
> +	} else if (is_sync) {
> +		blk_mq_bio_to_request(rq, bio);
> +		blk_mq_try_issue_directly(data.hctx, rq, &cookie);
> +	} else if (q->elevator) {
>  		blk_mq_bio_to_request(rq, bio);
>  		blk_mq_sched_insert_request(rq, false, true,
>  						!is_sync || is_flush_fua, true);
> -		goto done;
> -	}
> -	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
> +	} else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
>  		/*
>  		 * For a SYNC request, send it to the hardware immediately. For
>  		 * an ASYNC request, just ensure that we run it later on. The
>  		 * latter allows for merging opportunities and more efficient
>  		 * dispatching.
>  		 */
> -run_queue:
>  		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
>  	}

Since this code occurs in the else branch of an if (is_flush_fua) statement=
,
can "|| is_flush_fua" be left out?

What I noticed while reviewing this patch is that there are multiple change=
s in
this patch: not only the goto statements have been eliminated but the old_r=
q
variable has been eliminated too. I think this patch would be easier to rev=
iew
if it would be split in three patches: one that removes the old_rq variable=
,
one that eliminates the goto statements and one that removes dead code.

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Paolo Valente @ 2017-03-14 15:35 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj@kernel.org, axboe@kernel.dk, ulf.hansson@linaro.org,
	linux-kernel@vger.kernel.org, fchecconi@gmail.com,
	Arianna Avanzini, linux-block@vger.kernel.org,
	linus.walleij@linaro.org, broonie@kernel.org
In-Reply-To: <1488848390.3125.14.camel@sandisk.com>


> Il giorno 07 mar 2017, alle ore 02:00, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Sat, 2017-03-04 at 17:01 +0100, Paolo Valente wrote:
>> Finally, a few details on the patchset.
>>=20
>> The first two patches introduce BFQ-v0, which is more or less the
>> first version of BFQ submitted a few years ago [1].  The remaining
>> patches turn progressively BFQ-v0 into BFQ-v8r8, the current version
>> of BFQ.
>=20
> Hello Paolo,
>=20

Hi Bart,

> Thank you for having done the work to improve, test, fix and post the
> BFQ scheduler as a patch series. However, from what I have seen in the
> patches there is a large number of tunable constants in the code for
> which no scientific approach exists to chose an optimal value.

I'm very sorry about that, I have exported those parameters over the
years, just as an aid for debugging and tuning.  Then I have forgot to
remove them :(

They'll disappear in my next submission.

> Additionally, the complexity of the code is huge. Just like for CFQ,
> sooner or later someone will run into a bug or a performance issue
> and will post a patch to fix it. However, the complexity of BFQ is
> such that a source code review alone won't be sufficient to verify
> whether or not such a patch negatively affects a workload or device
> that has not been tested by the author of the patch. This makes me
> wonder what process should be followed to verify future BFQ patches?
>=20

I've a sort of triple reply for that.

First, I've developed BFQ in a sort of
first-the-problem-then-the-solution way.  That is, each time, I have
first implemented a benchmark that enabled me to highlight the problem
and get all relevant statistics on it, then I have worked on BFQ to
try to solve that problem, using the benchmark as a support.  All
those benchmarks are in the public S suite now.  In particular, by
running one script, and waiting at most one hour, you get graphs of
- throughput with read/write/random/sequential workloads
- start-up times of bash, xterm, gnome terminal and libreoffice, when
all the above combinations of workloads are executed in the background
- frame drop rate for the playback of a movie, again with both all the
above combinations of workloads and the recurrent start of a bash
shell in the background
- kernel-task execution times (compilation, merge, ...), again with
all the above combinations of workloads in the background
- fairness with various combinations of weights and processes
- throughput against interleaved I/O, with a number of readers ranging
from 2 to 9

Every time I fix a bug, add a new feature or port BFQ to a new kernel
version, I just run that script and compare new graphs with previous
ones.  Any regression shows up immediately.  We already have a
similar, working script for Android too, although covering only
throughput, responsiveness and frame drops for the moment.  Of course,
the coverage of these scripts is limited to only the goals for which I
have devised and tuned BFQ so far.  But I hope that it won't be too
hard to extend them to other important use cases (e.g., dbms).

Second, IMO BFQ is complex also because it contains a lot of features.
We have adopted the usual approach for handling this type of
complexity: find clean cuts to get independent pieces, and put each
piece in a separate file, plus one header glue file.  The pieces were:
scheduling engine, hierarchical-scheduling support (allowing the
engine to scheduler generic nodes in the hierarchy), cgroups support.
Yet, Tejun last year, and Jens more recently, have asked to put
everything in one file; for other good reasons of course.  If you do
think that turning back to multiple files may somehow help, and there
are no strong objections from others, then I'm willing to resume this
option and possibly find event better splits.

Third and last, a proposal: why don't we discuss this issue at LSF
too?  In particular, we could talk about the parts of BFQ that seem
more complex to understand, until they become clearer to you.  Then I
could try to understand what helped make them clearer, and translate
it into extra comments in the code or into other, more radical
changes.

Thanks,
Paolo

> Thanks,
>=20
> Bart.

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
From: Jens Axboe @ 2017-03-14 15:26 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA0EC183@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>

On 03/14/2017 09:17 AM, Bart Van Assche wrote:
> On 03/14/2017 07:57 AM, Jens Axboe wrote:
>> If we have scheduling enabled, we jump directly to insert-and-run.
>> That's fine, but we run the queue async and we don't pass in information
>> on whether we can block from this context or not. Fixup both these
>> cases.
> 
> How about renaming "can_block" into "may_sleep"? Otherwise this patch
> looks fine to me.

Sure, either one is fine with me, at least to me they convey the same
information.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
From: Bart Van Assche @ 2017-03-14 15:17 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org
In-Reply-To: <3ac4f49e-3bc0-7fc6-e93a-db9ec7ad21cf@kernel.dk>

On 03/14/2017 07:57 AM, Jens Axboe wrote:=0A=
> If we have scheduling enabled, we jump directly to insert-and-run.=0A=
> That's fine, but we run the queue async and we don't pass in information=
=0A=
> on whether we can block from this context or not. Fixup both these=0A=
> cases.=0A=
=0A=
How about renaming "can_block" into "may_sleep"? Otherwise this patch=0A=
looks fine to me.=0A=
=0A=
Bart.=0A=

^ permalink raw reply

* [PATCH] blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
From: Jens Axboe @ 2017-03-14 14:57 UTC (permalink / raw)
  To: linux-block@vger.kernel.org

If we have scheduling enabled, we jump directly to insert-and-run.
That's fine, but we run the queue async and we don't pass in information
on whether we can block from this context or not. Fixup both these
cases.

Signed-off-by: Jens Axboe <axboe@fb.com>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..4196d6bee92d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1434,7 +1434,8 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
+static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
+				      bool can_block)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1475,7 +1476,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, true, true, false);
+	blk_mq_sched_insert_request(rq, false, true, false, can_block);
 }
 
 /*
@@ -1569,11 +1570,11 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 		if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
 			rcu_read_lock();
-			blk_mq_try_issue_directly(old_rq, &cookie);
+			blk_mq_try_issue_directly(old_rq, &cookie, false);
 			rcu_read_unlock();
 		} else {
 			srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
-			blk_mq_try_issue_directly(old_rq, &cookie);
+			blk_mq_try_issue_directly(old_rq, &cookie, true);
 			srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
 		}
 		goto done;

-- 
Jens Axboe

^ permalink raw reply related

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Jens Axboe @ 2017-03-14 14:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Adrian Hunter, Linus Walleij, linux-mmc@vger.kernel.org,
	Ulf Hansson, Paolo Valente, Chunyan Zhang, Baolin Wang,
	linux-block, Arnd Bergmann
In-Reply-To: <20170314144357.GA31043@lst.de>

On 03/14/2017 08:43 AM, Christoph Hellwig wrote:
> On Tue, Mar 14, 2017 at 08:36:26AM -0600, Jens Axboe wrote:
>> There's one case that doesn't look like it was converted properly, but
>> that's a mistake. The general insert-and-run cases run inline if we can,
>> but the direct-issue needs a fixup, see below.
> 
> Note that blk_mq_try_issue_directly is only called for the multiple
> hardware queue case, so MMC would not hit it.

Right, which is why I said that the general case works fine, it's
only the explicit issue-direct that currently does not. Not by
design, just an error.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Christoph Hellwig @ 2017-03-14 14:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Adrian Hunter, Linus Walleij, linux-mmc@vger.kernel.org,
	Ulf Hansson, Paolo Valente, Chunyan Zhang, Baolin Wang,
	linux-block, Christoph Hellwig, Arnd Bergmann
In-Reply-To: <a3216031-0801-45e2-320c-bfc88fa742b4@kernel.dk>

On Tue, Mar 14, 2017 at 08:36:26AM -0600, Jens Axboe wrote:
> There's one case that doesn't look like it was converted properly, but
> that's a mistake. The general insert-and-run cases run inline if we can,
> but the direct-issue needs a fixup, see below.

Note that blk_mq_try_issue_directly is only called for the multiple
hardware queue case, so MMC would not hit it.

^ permalink raw reply

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Jens Axboe @ 2017-03-14 14:36 UTC (permalink / raw)
  To: Adrian Hunter, Linus Walleij
  Cc: linux-mmc@vger.kernel.org, Ulf Hansson, Paolo Valente,
	Chunyan Zhang, Baolin Wang, linux-block, Christoph Hellwig,
	Arnd Bergmann
In-Reply-To: <a3571278-b30d-c3cb-bea6-d1776eb53025@intel.com>

On 03/14/2017 06:59 AM, Adrian Hunter wrote:
> On 13/03/17 16:19, Jens Axboe wrote:
>> On 03/13/2017 03:25 AM, Adrian Hunter wrote:
>>> On 11/03/17 00:05, Jens Axboe wrote:
>>>> On 03/10/2017 07:21 AM, Adrian Hunter wrote:
>>>>>> Essentially I take out that thread and replace it with this one worker
>>>>>> introduced in this very patch. I agree the driver can block in many ways
>>>>>> and that is why I need to have it running in process context, and this
>>>>>> is what the worker introduced here provides.
>>>>>
>>>>> The last time I looked at the blk-mq I/O scheduler code, it pulled up to
>>>>> qdepth requests from the I/O scheduler and left them on a local list while
>>>>> running ->queue_rq().  That means blocking in ->queue_rq() leaves some
>>>>> number of requests in limbo (not issued but also not in the I/O scheduler)
>>>>> for that time.
>>>>
>>>> Look again, if we're not handling the requeued dispatches, we pull one
>>>> at the time from the scheduler.
>>>>
>>>
>>> That's good :-)
>>>
>>> Now the next thing ;-)
>>>
>>> It looks like we either set BLK_MQ_F_BLOCKING and miss the possibility of
>>> issuing synchronous requests immediately, or we don't set BLK_MQ_F_BLOCKING
>>> in which case we are never allowed to sleep in ->queue_rq().  Is that true?
>>
>> Only one of those statements is true - if you don't set BLK_MQ_F_BLOCKING,
>> then you may never block in your ->queue_rq() function. But if you do set it,
>> it does not preclude immediate issue of sync requests.
> 
> I meant it gets put to the workqueue rather than issued in the context of
> the submitter.

There's one case that doesn't look like it was converted properly, but
that's a mistake. The general insert-and-run cases run inline if we can,
but the direct-issue needs a fixup, see below.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..4196d6bee92d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1434,7 +1434,8 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
+static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
+				      bool can_block)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1475,7 +1476,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	}
 
 insert:
-	blk_mq_sched_insert_request(rq, false, true, true, false);
+	blk_mq_sched_insert_request(rq, false, true, false, can_block);
 }
 
 /*
@@ -1569,11 +1570,11 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 		if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
 			rcu_read_lock();
-			blk_mq_try_issue_directly(old_rq, &cookie);
+			blk_mq_try_issue_directly(old_rq, &cookie, false);
 			rcu_read_unlock();
 		} else {
 			srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
-			blk_mq_try_issue_directly(old_rq, &cookie);
+			blk_mq_try_issue_directly(old_rq, &cookie, true);
 			srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
 		}
 		goto done;

-- 
Jens Axboe

^ permalink raw reply related

* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Paolo Valente @ 2017-03-14 14:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Tejun Heo, Fabio Checconi, Arianna Avanzini,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	ulf.hansson@linaro.org, linus.walleij@linaro.org,
	broonie@kernel.org
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA0DA145@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>


> Il giorno 06 mar 2017, alle ore 20:40, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20

Hi Bart,
thanks for such an accurate review.  I'm addressing the issues you
raised, and I'll get back in touch as soon as I have finished.

Paolo

> On 03/04/2017 08:01 AM, Paolo Valente wrote:
>> BFQ is a proportional-share I/O scheduler, whose general structure,
>> plus a lot of code, are borrowed from CFQ.
>> [ ... ]
>=20
> This description is very useful. However, since it is identical to the
> description this patch adds to Documentation/block/bfq-iosched.txt I
> propose to leave it out from the patch description.
>=20
> What seems missing to me is an overview of the limitations of BFQ. =
Does
> BFQ e.g. support multiple hardware queues?
>=20
>> +3. What are BFQ's tunable?
>> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>> +[ ... ]
>=20
> A thorough knowledge of BFQ is required to tune it properly. Users =
don't
> want to tune I/O schedulers. Has it been considered to invent =
algorithms
> to tune these parameters automatically?
>=20
>> + * Licensed under GPL-2.
>=20
> The COPYING file at the top of the tree mentions that GPL-v2 licensing
> should be specified as follows close to the start of each source file:
>=20
>    This program is free software; you can redistribute it and/or =
modify
>    it under the terms of the GNU General Public License as published =
by
>    the Free Software Foundation; either version 2 of the License, or
>    (at your option) any later version.
>=20
>    This program is distributed in the hope that it will be useful,
>    but WITHOUT ANY WARRANTY; without even the implied warranty of
>    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>    GNU General Public License for more details.
>=20
>    You should have received a copy of the GNU General Public License
>    along with this program; if not, write to the Free Software
>    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
>    02110-1301 USA
>=20
>> + * BFQ is a proportional-share I/O scheduler, with some extra
>> + * low-latency capabilities. BFQ also supports full hierarchical
>> + * scheduling through cgroups. Next paragraphs provide an =
introduction
>> + * on BFQ inner workings. Details on BFQ benefits and usage can be
>> + * found in Documentation/block/bfq-iosched.txt.
>=20
> That reference should be sufficient - please do not duplicate
> Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.
>=20
>> +/**
>> + * struct bfq_service_tree - per ioprio_class service tree.
>> + *
>> + * Each service tree represents a B-WF2Q+ scheduler on its own.  =
Each
>> + * ioprio_class has its own independent scheduler, and so its own
>> + * bfq_service_tree.  All the fields are protected by the queue lock
>> + * of the containing bfqd.
>> + */
>> +struct bfq_service_tree {
>> +	/* tree for active entities (i.e., those backlogged) */
>> +	struct rb_root active;
>> +	/* tree for idle entities (i.e., not backlogged, with V <=3D =
F_i)*/
>> +	struct rb_root idle;
>> +
>> +	struct bfq_entity *first_idle;	/* idle entity with minimum F_i =
*/
>> +	struct bfq_entity *last_idle;	/* idle entity with maximum F_i =
*/
>> +
>> +	u64 vtime; /* scheduler virtual time */
>> +	/* scheduler weight sum; active and idle entities contribute to =
it */
>> +	unsigned long wsum;
>> +};
>=20
> Inline comments next to structure members are ugly and make the
> structure definition hard to read. Please follow the instructions in
> Documentation/kernel-doc-nano-HOWTO.txt for documenting structure =
members.
>=20
>> +	u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */
>> +	u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */
>=20
> For all times and timestamps, please document the time unit (e.g. s, =
ms,
> us, ns, jiffies, ...).
>=20
>> +enum bfq_device_speed {
>> +	BFQ_BFQD_FAST,
>> +	BFQ_BFQD_SLOW,
>> +};
>=20
> What is the meaning of "fast" and "slow" devices in this context?
> Anyway, since the first patch that uses this enum is patch 6, please
> defer introduction of this enum until patch 6.
>=20
>> +
>> +/**
>> + * struct bfq_data - per-device data structure.
>> + *
>> + * All the fields are protected by @lock.
>> + */
>> +struct bfq_data {
>> +	/* device request queue */
>> +	struct request_queue *queue;
>> +	[ ... ]
>> +
>> +	/* on-disk position of the last served request */
>> +	sector_t last_position;
>=20
> What is the relevance of last_position if there are multiple hardware
> queues? Will the BFQ algorithm fail to realize its guarantees in that =
case?
>=20
> What is the relevance of this structure member for block devices that
> have multiple spindles, e.g. arrays of hard disks?
>=20
>> +enum bfqq_state_flags {
>> +	BFQ_BFQQ_FLAG_busy =3D 0,		/* has requests or is in =
service */
>> +	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
>> +	BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*
>> +					     * waiting for a request
>> +					     * without idling the device
>> +					     */
>> +	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
>> +	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
>> +	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
>> +	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this =
budget */
>> +	BFQ_BFQQ_FLAG_IO_bound,		/*
>> +					 * bfqq has timed-out at least =
once
>> +					 * having consumed at most 2/10 =
of
>> +					 * its budget
>> +					 */
>> +};
>=20
> The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about
> e.g. using the prefix "BFQQF_" instead?
>=20
>> +#define BFQ_BFQQ_FNS(name)						=
\
>> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		=
\
>> +{									=
\
>> +	(bfqq)->flags |=3D (1 << BFQ_BFQQ_FLAG_##name);			=
\
>> +}									=
\
>> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)		=
\
>> +{									=
\
>> +	(bfqq)->flags &=3D ~(1 << BFQ_BFQQ_FLAG_##name);			=
\
>> +}									=
\
>> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq)		=
\
>> +{									=
\
>> +	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) !=3D 0;	=
\
>> +}
>=20
> Are the bodies of the above functions duplicates of __set_bit(),
> __clear_bit() and test_bit()?
>=20
>> +/* Expiration reasons. */
>> +enum bfqq_expiration {
>> +	BFQ_BFQQ_TOO_IDLE =3D 0,		/*
>> +					 * queue has been idling for
>> +					 * too long
>> +					 */
>> +	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be =
used */
>> +	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
>> +	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more =
requests */
>> +	BFQ_BFQQ_PREEMPTED		/* preemption in progress */
>> +};
>=20
> The prefix of these constants refers twice to "BFQ" and does not make =
it
> clear that these constants are about expiration. How about using the
> "BFQQE_" prefix instead?
>=20
>> +/* Maximum backwards seek, in KiB. */
>> +static const int bfq_back_max =3D 16 * 1024;
>=20
> Where does this constant come from? Should it depend on geometry data
> like e.g. the number of sectors in a cylinder?
>=20
>> +#define for_each_entity(entity)	\
>> +	for (; entity ; entity =3D NULL)
>=20
> Why has this confusing #define been introduced? Shouldn't all
> occurrences of this macro be changed into the equivalent "if =
(entity)"?
> We don't want silly macros like this in the Linux kernel.
>=20
>> +#define for_each_entity_safe(entity, parent) \
>> +	for (parent =3D NULL; entity ; entity =3D parent)
>=20
> Same question here - why has this macro been introduced and how has =
its
> name been chosen? Since this macro is used only once and since no =
value
> is assigned to 'parent' in the code controlled by this construct, =
please
> remove this macro and use something that is less confusing than a =
"for"
> loop for something that is not a loop.
>=20
>> +/**
>> + * bfq_weight_to_ioprio - calc an ioprio from a weight.
>> + * @weight: the weight value to convert.
>> + *
>> + * To preserve as much as possible the old only-ioprio user =
interface,
>> + * 0 is used as an escape ioprio value for weights (numerically) =
equal or
>> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
>> + */
>> +static unsigned short bfq_weight_to_ioprio(int weight)
>> +{
>> +	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
>> +		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
>> +}
>=20
> Please consider using max() or max_t() to make this function less =
verbose.
>=20
>> +
>> +/**
>> + * bfq_active_extract - remove an entity from the active tree.
>> + * @st: the service_tree containing the tree.
>> + * @entity: the entity being removed.
>> + */
>> +static void bfq_active_extract(struct bfq_service_tree *st,
>> +			       struct bfq_entity *entity)
>> +{
>> +	struct bfq_queue *bfqq =3D bfq_entity_to_bfqq(entity);
>> +	struct rb_node *node;
>> +
>> +	node =3D bfq_find_deepest(&entity->rb_node);
>> +	bfq_extract(&st->active, entity);
>> +
>> +	if (node)
>> +		bfq_update_active_tree(node);
>> +
>> +	if (bfqq)
>> +		list_del(&bfqq->bfqq_list);
>> +}
>=20
> Which locks protect the data structures manipulated by this and other
> functions? Have you considered to use lockdep_assert_held() to =
document
> these assumptions?
>=20
>> +	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
>=20
> Please don't use parentheses if no confusion is possible. =
Additionally,
> checkpatch should have requested you to insert a space before and =
after
> the logical or operator.
>=20
>> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
>> +				       struct bfq_queue *bfqq)
>> +{
>> +	if (bfqq) {
>> +		bfq_mark_bfqq_budget_new(bfqq);
>> +		bfq_clear_bfqq_fifo_expire(bfqq);
>> +
>> +		bfqd->budgets_assigned =3D (bfqd->budgets_assigned*7 + =
256) / 8;
>=20
> Checkpatch should have asked you to insert spaces around the
> multiplication operator.
>=20
>> +/*
>> + * bfq_default_budget - return the default budget for @bfqq on =
@bfqd.
>> + * @bfqd: the device descriptor.
>> + * @bfqq: the queue to consider.
>> + *
>> + * We use 3/4 of the @bfqd maximum budget as the default value
>> + * for the max_budget field of the queues.  This lets the feedback
>> + * mechanism to start from some middle ground, then the behavior
>> + * of the process will drive the heuristics towards high values, if
>> + * it behaves as a greedy sequential reader, or towards small values
>> + * if it shows a more intermittent behavior.
>> + */
>> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
>> +					struct bfq_queue *bfqq)
>> +{
>> +	unsigned long budget;
>> +
>> +	/*
>> +	 * When we need an estimate of the peak rate we need to avoid
>> +	 * to give budgets that are too short due to previous =
measurements.
>> +	 * So, in the first 10 assignments use a ``safe'' budget value.
>> +	 */
>> +	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget =3D=3D=
 0)
>> +		budget =3D bfq_default_max_budget;
>> +	else
>> +		budget =3D bfqd->bfq_max_budget;
>> +
>> +	return budget - budget / 4;
>> +}
>=20
> Where does the magic constant "194" come from?
>=20
>=20
>> +	} else
>> +		/*
>> +		 * Async queues get always the maximum possible
>> +		 * budget, as for them we do not care about latency
>> +		 * (in addition, their ability to dispatch is limited
>> +		 * by the charging factor).
>> +		 */
>> +		budget =3D bfqd->bfq_max_budget;
>> +
>=20
> Please balance braces. Checkpatch should have warned about the use of =
"}
> else" instead of "} else {".
>=20
>> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
>> +{
>> +	unsigned long max_budget;
>> +
>> +	/*
>> +	 * The max_budget calculated when autotuning is equal to the
>> +	 * amount of sectors transferred in timeout at the
>> +	 * estimated peak rate.
>> +	 */
>> +	max_budget =3D (unsigned long)(peak_rate * 1000 *
>> +				     timeout >> BFQ_RATE_SHIFT);
>> +
>> +	return max_budget;
>> +}
>=20
> Where does the constant 1000 come from? What are the units of =
peak_rate
> and timeout? What is the maximum value of peak_rate? Can the
> multiplication overflow?
>=20
>> +/*
>> + * In addition to updating the peak rate, checks whether the process
>> + * is "slow", and returns 1 if so. This slow flag is used, in =
addition
>> + * to the budget timeout, to reduce the amount of service provided =
to
>> + * seeky processes, and hence reduce their chances to lower the
>> + * throughput. See the code for more details.
>> + */
>> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct =
bfq_queue *bfqq,
>> +				 bool compensate)
>> +{
>> +	u64 bw, usecs, expected, timeout;
>> +	ktime_t delta;
>> +	int update =3D 0;
>> +
>> +	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
>> +		return false;
>> +
>> +	if (compensate)
>> +		delta =3D bfqd->last_idling_start;
>> +	else
>> +		delta =3D ktime_get();
>> +	delta =3D ktime_sub(delta, bfqd->last_budget_start);
>> +	usecs =3D ktime_to_us(delta);
>> +
>> +	/* Don't trust short/unrealistic values. */
>> +	if (usecs < 100 || usecs >=3D LONG_MAX)
>> +		return false;
>=20
> If usecs >=3D LONG_MAX that indicates a kernel bug. Please consider
> triggering a kernel warning in that case.
>=20
>> +/*
>> + * Budget timeout is not implemented through a dedicated timer, but
>> + * just checked on request arrivals and completions, as well as on
>> + * idle timer expirations.
>> + */
>> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
>> +{
>> +	if (bfq_bfqq_budget_new(bfqq) ||
>> +	    time_before(jiffies, bfqq->budget_timeout))
>> +		return false;
>> +	return true;
>> +}
>=20
> Have you considered to use time_is_after_jiffies() instead of
> time_before(jiffies, ...)?
>=20
>> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue =
*bfqq,
>> +			  struct bfq_io_cq *bic, pid_t pid, int is_sync)
>> +{
>> +	RB_CLEAR_NODE(&bfqq->entity.rb_node);
>> +	INIT_LIST_HEAD(&bfqq->fifo);
>> +
>> +	bfqq->ref =3D 0;
>> +	bfqq->bfqd =3D bfqd;
>> +
>> +	if (bic)
>> +		bfq_set_next_ioprio_data(bfqq, bic);
>> +
>> +	if (is_sync) {
>> +		if (!bfq_class_idle(bfqq))
>> +			bfq_mark_bfqq_idle_window(bfqq);
>> +		bfq_mark_bfqq_sync(bfqq);
>> +	} else
>> +		bfq_clear_bfqq_sync(bfqq);
>> +
>> +	bfqq->ttime.last_end_request =3D ktime_get_ns() - (1ULL<<32);
>> +
>> +	bfq_mark_bfqq_IO_bound(bfqq);
>> +
>> +	bfqq->pid =3D pid;
>> +
>> +	/* Tentative initial value to trade off between thr and lat */
>> +	bfqq->max_budget =3D bfq_default_budget(bfqd, bfqq);
>> +	bfqq->budget_timeout =3D bfq_smallest_from_now();
>> +	bfqq->pid =3D pid;
>> +
>> +	/* first request is almost certainly seeky */
>> +	bfqq->seek_history =3D 1;
>> +}
>=20
> What is the meaning of the 1ULL << 32 constant?
>=20
>> +static int __init bfq_init(void)
>> +{
>> +	int ret;
>> +	char msg[50] =3D "BFQ I/O-scheduler: v0";
>=20
> Please leave out "[50]" and use "static const char" instead of "char".
>=20
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 01139f5..786fdcd 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char =
*name)
>>=20
>> 	if (!e) {
>> 		/*
>> -		 * For blk-mq devices, we default to using mq-deadline,
>> -		 * if available, for single queue devices. If deadline
>> -		 * isn't available OR we have multiple queues, default
>> -		 * to "none".
>> +		 * For blk-mq devices, we default to using bfq, if
>> +		 * available, for single queue devices. If bfq isn't
>> +		 * available, we try mq-deadline. If neither is
>> +		 * available, OR we have multiple queues, default to
>> +		 * "none".
>> 		 */
>> 		if (q->mq_ops) {
>> +			if (q->nr_hw_queues =3D=3D 1) {
>> +				e =3D elevator_get("bfq", false);
>> +				if (!e)
>> +					e =3D =
elevator_get("mq-deadline", false);
>> +			}
>> 			if (q->nr_hw_queues =3D=3D 1)
>> -				e =3D elevator_get("mq-deadline", =
false);
>> +				e =3D elevator_get("bfq", false);
>> 			if (!e)
>> 				return 0;
>> 		} else
>>=20
>=20
> As Jens wrote, it's way too early to make BFQ the default scheduler.
>=20
> Bart.

^ permalink raw reply

* Re: [PATCH RFC 13/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
From: Paolo Valente @ 2017-03-14 14:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj@kernel.org, axboe@kernel.dk, ulf.hansson@linaro.org,
	linux-kernel@vger.kernel.org, fchecconi@gmail.com,
	Arianna Avanzini, linux-block@vger.kernel.org,
	linus.walleij@linaro.org, broonie@kernel.org
In-Reply-To: <1488848033.3125.12.camel@sandisk.com>


> Il giorno 07 mar 2017, alle ore 01:54, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Sat, 2017-03-04 at 17:01 +0100, Paolo Valente wrote:
>> @@ -8301,7 +8297,7 @@ static struct blkcg_policy blkcg_policy_bfq =3D =
{
>> static int __init bfq_init(void)
>> {
>> 	int ret;
>> -	char msg[50] =3D "BFQ I/O-scheduler: v6";
>> +	char msg[50] =3D "BFQ I/O-scheduler: v7r3";
>>=20
>> #ifdef CONFIG_BFQ_GROUP_IOSCHED
>> 	ret =3D blkcg_policy_register(&blkcg_policy_bfq);
>=20
> In the Linux kernel the kernel as a whole has a version number but we
> do not assign a version number to individual core components.
>=20

I'll remove that.

Thanks,
Paolo

> Bart.

^ permalink raw reply

* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Paolo Valente @ 2017-03-14 14:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Tejun Heo, Fabio Checconi, Arianna Avanzini,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	ulf.hansson@linaro.org, linus.walleij@linaro.org,
	broonie@kernel.org
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA0DA827@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>


> Il giorno 07 mar 2017, alle ore 01:22, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On 03/04/2017 08:01 AM, Paolo Valente wrote:
>> Some patch generates WARNINGS with checkpatch.pl, but these WARNINGS
>> seem to be either unavoidable for the involved pieces of code (which
>> the patch just extends), or false positives.
>=20
> The code in this series looks reasonably clean from a code style point
> of view,

Great, thanks!

> but please address all checkpatch warnings that can be
> addressed easily. A few examples of such checkpatch warnings:
>=20
> ERROR: "foo * bar" should be "foo *bar"
>=20

The offending line is:
*(__PTR) =3D (u64)__data * NSEC_PER_USEC;

so this seems a false positive.

> WARNING: Symbolic permissions 'S_IRUGO|S_IWUSR' are not preferred.
> Consider using octal permissions '0644'.
>=20

I have used symbolic permissions because I find them much easier to
remember and decode than numeric constants, and because it is done so
in cfq-iosched.c, deadline-iosched.c and now mq-deadline.c.  But,
since you share this checkpatch complain, I will switch to constants.

Thanks,
Paolo

> Thanks,
>=20
> Bart.

^ permalink raw reply

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Adrian Hunter @ 2017-03-14 12:59 UTC (permalink / raw)
  To: Jens Axboe, Linus Walleij
  Cc: linux-mmc@vger.kernel.org, Ulf Hansson, Paolo Valente,
	Chunyan Zhang, Baolin Wang, linux-block, Christoph Hellwig,
	Arnd Bergmann
In-Reply-To: <97f2a4ea-0802-9f23-6ac7-b9d6c6afbfcc@kernel.dk>

On 13/03/17 16:19, Jens Axboe wrote:
> On 03/13/2017 03:25 AM, Adrian Hunter wrote:
>> On 11/03/17 00:05, Jens Axboe wrote:
>>> On 03/10/2017 07:21 AM, Adrian Hunter wrote:
>>>>> Essentially I take out that thread and replace it with this one worker
>>>>> introduced in this very patch. I agree the driver can block in many ways
>>>>> and that is why I need to have it running in process context, and this
>>>>> is what the worker introduced here provides.
>>>>
>>>> The last time I looked at the blk-mq I/O scheduler code, it pulled up to
>>>> qdepth requests from the I/O scheduler and left them on a local list while
>>>> running ->queue_rq().  That means blocking in ->queue_rq() leaves some
>>>> number of requests in limbo (not issued but also not in the I/O scheduler)
>>>> for that time.
>>>
>>> Look again, if we're not handling the requeued dispatches, we pull one
>>> at the time from the scheduler.
>>>
>>
>> That's good :-)
>>
>> Now the next thing ;-)
>>
>> It looks like we either set BLK_MQ_F_BLOCKING and miss the possibility of
>> issuing synchronous requests immediately, or we don't set BLK_MQ_F_BLOCKING
>> in which case we are never allowed to sleep in ->queue_rq().  Is that true?
> 
> Only one of those statements is true - if you don't set BLK_MQ_F_BLOCKING,
> then you may never block in your ->queue_rq() function. But if you do set it,
> it does not preclude immediate issue of sync requests.

I meant it gets put to the workqueue rather than issued in the context of
the submitter.

^ permalink raw reply

* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Paolo Valente @ 2017-03-14 11:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Fabio Checconi, Arianna Avanzini, linux-block,
	Linux-Kernal, Ulf Hansson, Linus Walleij, broonie
In-Reply-To: <7654009c-f38b-20f8-7536-2982338a8228@kernel.dk>


> Il giorno 06 mar 2017, alle ore 21:46, Jens Axboe <axboe@kernel.dk> ha =
scritto:
>=20
> On 03/05/2017 09:02 AM, Paolo Valente wrote:
>>=20
>>> Il giorno 05 mar 2017, alle ore 16:16, Jens Axboe <axboe@kernel.dk> =
ha scritto:
>>>=20
>>> On 03/04/2017 09:01 AM, Paolo Valente wrote:
>>>> We tag as v0 the version of BFQ containing only BFQ's engine plus
>>>> hierarchical support. BFQ's engine is introduced by this commit, =
while
>>>> hierarchical support is added by next commit. We use the v0 tag to
>>>> distinguish this minimal version of BFQ from the versions =
containing
>>>> also the features and the improvements added by next commits. =
BFQ-v0
>>>> coincides with the version of BFQ submitted a few years ago [1], =
apart
>>>> from the introduction of preemption, described below.
>>>>=20
>>>> BFQ is a proportional-share I/O scheduler, whose general structure,
>>>> plus a lot of code, are borrowed from CFQ.
>>>=20
>>> I'll take a closer look at this in the coming week.
>>=20
>> ok
>>=20
>>> But one quick
>>> comment - don't default to BFQ. Both because it might not be fully
>>> stable yet, and also because the performance limitation of it is
>>> quite severe. Whereas deadline doesn't really hurt single queue
>>> flash at all, BFQ will.
>>>=20
>>=20
>> Ok, sorry.  I was doubtful on what to do, but, to not bother you on
>> every details, I went for setting it as default, because I thought
>> people would have preferred to test it, even from boot, in this
>> preliminary stage.  I reset elevator.c in the submission, unless you
>> want me to do it even before receiving your and others' reviews.
>=20
> I don't think it's stable enough for that yet, it's seen very little
> testing outside of your own testing.

Hi Jens.

Yes, sorry, in general people of course use a kernel for a little bit
more than just testing bfq ...

> Given that, it's much better that
> people opt in to testing BFQ, so they at least know they can expect
> crashes.
>=20
> Speaking of testing, I ran into this bug:
>=20
> [ 9469.621413] general protection fault: 0000 [#1] PREEMPT SMP
> [ 9469.627872] Modules linked in: loop dm_mod xfs libcrc32c =
bfq_iosched x86_pkg_temp_thermal btrfe
> [ 9469.648196] CPU: 0 PID: 2114 Comm: kworker/0:1H Tainted: G        W =
      4.11.0-rc1+ #249
> [ 9469.657873] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS =
2.3.4 11/09/2016
> [ 9469.666742] Workqueue: xfs-log/nvme4n1p1 xfs_buf_ioend_work [xfs]
> [ 9469.674213] task: ffff881fe97646c0 task.stack: ffff881ff13d0000
> [ 9469.681053] RIP: 0010:__bfq_bfqq_expire+0xb3/0x110 [bfq_iosched]
> [ 9469.687991] RSP: 0018:ffff881fff603dd8 EFLAGS: 00010082
> [ 9469.694052] RAX: 6b6b6b6b6b6b6b6b RBX: ffff883fe8e0eb58 RCX: =
0000000000010004
> [ 9469.702251] RDX: 0000000000010004 RSI: 0000000000000000 RDI: =
00000000ffffffff
> [ 9469.710456] RBP: ffff881fff603de8 R08: 0000000000000000 R09: =
0000000000000001
> [ 9469.718659] R10: 0000000000000001 R11: 0000000000000000 R12: =
ffff883fe8dbf4e8
> [ 9469.726863] R13: 0000000000000000 R14: 0000000000000001 R15: =
0000000000007904
> [ 9469.735063] FS:  0000000000000000(0000) GS:ffff881fff600000(0000) =
knlGS:0000000000000000
> [ 9469.744539] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9469.751189] CR2: 00000000020c0018 CR3: 0000001fec1db000 CR4: =
00000000003406f0
> [ 9469.759392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: =
0000000000000000
> [ 9469.767596] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: =
0000000000000400
> [ 9469.775794] Call Trace:
> [ 9469.778748]  <IRQ>
> [ 9469.781222]  bfq_bfqq_expire+0x104/0x2f0 [bfq_iosched]
> [ 9469.787193]  ? bfq_idle_slice_timer+0x2a/0xc0 [bfq_iosched]
> [ 9469.793650]  bfq_idle_slice_timer+0x7c/0xc0 [bfq_iosched]
> [ 9469.799914]  __hrtimer_run_queues+0xd9/0x500
> [ 9469.804911]  ? bfq_rq_enqueued+0x340/0x340 [bfq_iosched]
> [ 9469.811072]  hrtimer_interrupt+0xb0/0x200
> [ 9469.815781]  local_apic_timer_interrupt+0x31/0x50
> [ 9469.821264]  smp_apic_timer_interrupt+0x33/0x50
> [ 9469.826555]  apic_timer_interrupt+0x90/0xa0
>=20
> just running the xfstest suite. It's test generic/299. Have you done
> full runs of xfstest? I'd greatly recommend that for shaking out bugs.
> Run a full loop with xfs, one with btrfs, and one with ext4 for better
> confidence in the stability of the code.
>=20

Thanks for this information and suggestions.  I'm running xfstests
right now, to reproduce at least the failure you spotted.

Thanks,
Paolo

> (gdb) l *__bfq_bfqq_expire+0xb3
> 0x5983 is in __bfq_bfqq_expire (block/bfq-iosched.c:2664).
> 2659		 * been properly deactivated or requeued, so we can =
safely
> 2660		 * execute the final step: reset in_service_entity along =
the
> 2661		 * path from entity to the root.
> 2662		 */
> 2663		for_each_entity(entity)
> 2664			entity->sched_data->in_service_entity =3D NULL;
> 2665	}
> 2666=09
> 2667	static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct =
bfq_queue *bfqq,
> 2668					bo

^ permalink raw reply

* Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: hch @ 2017-03-13 23:17 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org
In-Reply-To: <1489438828.2658.23.camel@sandisk.com>

On Mon, Mar 13, 2017 at 09:01:08PM +0000, Bart Van Assche wrote:
> > +		/*
> > +		 * @request_count may become stale because of schedule
> > +		 * out, so check the list again.
> > +		 */
> 
> The above comment was relevant as long as there was a request_count assignment
> above blk_mq_sched_get_request(). This patch moves that assignment inside if
> (plug && q->nr_hw_queues == 1). Does that mean that the above comment should be
> removed entirely?

I don't think so - for the !blk_queue_nomerges cases we still rely
on blk_attempt_plug_merge calculatіng request_count, so the comment
still applies.

^ permalink raw reply

* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: hch @ 2017-03-13 23:16 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org
In-Reply-To: <1489438361.2658.21.camel@sandisk.com>

On Mon, Mar 13, 2017 at 08:52:54PM +0000, Bart Van Assche wrote:
> > -	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> > -	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > +	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> 
> A minor comment: due to this change the outer pair of parentheses
> became superfluous. Please consider removing these.

The last patch in the series removes the statement in this form. But
if I have to respin the series for some reason I'll make sure it
gets removed here already.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox