Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH v2 4/5] blk-mq: export blk_mq_finish_request()
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This is required for schedulers that define their own put_request().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3dbfed6d0e5b..e75fa7a1a653 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -368,6 +368,7 @@ void blk_mq_finish_request(struct request *rq)
 {
 	blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
 }
+EXPORT_SYMBOL_GPL(blk_mq_finish_request);
 
 void blk_mq_free_request(struct request *rq)
 {
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 3/5] blk-mq-sched: make completed_request() callback more useful
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.h     | 11 +++--------
 block/blk-mq.c           |  5 ++++-
 include/linux/elevator.h |  2 +-
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index c6e760df0fb4..d846e07bb564 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -82,17 +82,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 	return true;
 }
 
-static inline void
-blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct elevator_queue *e = rq->q->elevator;
 
 	if (e && e->type->ops.mq.completed_request)
-		e->type->ops.mq.completed_request(hctx, rq);
-
-	BUG_ON(rq->internal_tag == -1);
-
-	blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
+		e->type->ops.mq.completed_request(rq);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b49fdfde7e06..3dbfed6d0e5b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -350,7 +350,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
-		blk_mq_sched_completed_request(hctx, rq);
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart_queues(hctx);
 	blk_queue_exit(q);
 }
@@ -443,6 +443,9 @@ static void __blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	if (rq->internal_tag != -1)
+		blk_mq_sched_completed_request(rq);
+
 	blk_mq_stat_add(rq);
 
 	if (!q->softirq_done_fn)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b7ec315ee7e7..3a216318ae73 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -106,7 +106,7 @@ struct elevator_mq_ops {
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
 	bool (*has_work)(struct blk_mq_hw_ctx *);
-	void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+	void (*completed_request)(struct request *);
 	void (*started_request)(struct request *);
 	void (*requeue_request)(struct request *);
 	struct request *(*former_request)(struct request_queue *, struct request *);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 0/5] blk-mq: Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This is v2 of Kyber, an I/O scheduler for multiqueue devices combining
several techniques: the scalable bitmap library, the new blk-stats API,
and queue depth throttling similar to blk-wbt. v1 was here [1].

v2 adds a tunable target synchronous write latency. The heuristics are
more fleshed out and balance read and synchronous write latencies based
on the latency targets. The queueing and dispatch mechanism is the same
as before.

This series is based on block/for-next + the initialization fix series I
sent out yesterday [2]. Patches 1 and 2 implement a new sbitmap
operation. Patch 3 moves a scheduler callback to somewhere more useful.
Patch 4 exports a required function. Patch 5 implements the new
scheduler.

Thanks!

1: http://marc.info/?l=linux-block&m=148978871820916&w=2
2: http://marc.info/?l=linux-block&m=149125578724683&w=2

Omar Sandoval (5):
  sbitmap: add sbitmap_get_shallow() operation
  blk-mq: add shallow depth option for blk_mq_get_tag()
  blk-mq-sched: make completed_request() callback more useful
  blk-mq: export blk_mq_finish_request()
  blk-mq: introduce Kyber multiqueue I/O scheduler

 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched                 |   9 +
 block/Makefile                        |   1 +
 block/blk-mq-sched.h                  |  11 +-
 block/blk-mq-tag.c                    |   5 +-
 block/blk-mq.c                        |   6 +-
 block/blk-mq.h                        |   1 +
 block/kyber-iosched.c                 | 706 ++++++++++++++++++++++++++++++++++
 include/linux/elevator.h              |   2 +-
 include/linux/sbitmap.h               |  55 +++
 lib/sbitmap.c                         |  75 +++-
 11 files changed, 867 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

-- 
2.12.2

^ permalink raw reply

* [PATCH v2 2/5] blk-mq: add shallow depth option for blk_mq_get_tag()
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-tag.c | 5 ++++-
 block/blk-mq.h     | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..d0be72ccb091 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -96,7 +96,10 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
 	    !hctx_may_queue(data->hctx, bt))
 		return -1;
-	return __sbitmap_queue_get(bt);
+	if (data->shallow_depth)
+		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
+	else
+		return __sbitmap_queue_get(bt);
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..77ec66369f21 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -141,6 +141,7 @@ struct blk_mq_alloc_data {
 	/* input parameter */
 	struct request_queue *q;
 	unsigned int flags;
+	unsigned int shallow_depth;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 1/5] sbitmap: add sbitmap_get_shallow() operation
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/sbitmap.h | 55 ++++++++++++++++++++++++++++++++++++
 lib/sbitmap.c           | 75 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 123 insertions(+), 7 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index d4e0a204c118..a1904aadbc45 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -176,6 +176,25 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth);
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin);
 
 /**
+ * sbitmap_get_shallow() - Try to allocate a free bit from a &struct sbitmap,
+ * limiting the depth used from each word.
+ * @sb: Bitmap to allocate from.
+ * @alloc_hint: Hint for where to start searching for a free bit.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ *
+ * This rather specific operation allows for having multiple users with
+ * different allocation limits. E.g., there can be a high-priority class that
+ * uses sbitmap_get() and a low-priority class that uses sbitmap_get_shallow()
+ * with a @shallow_depth of (1 << (@sb->shift - 1)). Then, the low-priority
+ * class can only allocate half of the total bits in the bitmap, preventing it
+ * from starving out the high-priority class.
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+			unsigned long shallow_depth);
+
+/**
  * sbitmap_any_bit_set() - Check for a set bit in a &struct sbitmap.
  * @sb: Bitmap to check.
  *
@@ -326,6 +345,19 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth);
 int __sbitmap_queue_get(struct sbitmap_queue *sbq);
 
 /**
+ * __sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
+ * sbitmap_queue, limiting the depth used from each word, with preemption
+ * already disabled.
+ * @sbq: Bitmap queue to allocate from.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+				unsigned int shallow_depth);
+
+/**
  * sbitmap_queue_get() - Try to allocate a free bit from a &struct
  * sbitmap_queue.
  * @sbq: Bitmap queue to allocate from.
@@ -346,6 +378,29 @@ static inline int sbitmap_queue_get(struct sbitmap_queue *sbq,
 }
 
 /**
+ * sbitmap_queue_get_shallow() - Try to allocate a free bit from a &struct
+ * sbitmap_queue, limiting the depth used from each word.
+ * @sbq: Bitmap queue to allocate from.
+ * @cpu: Output parameter; will contain the CPU we ran on (e.g., to be passed to
+ *       sbitmap_queue_clear()).
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+static inline int sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+					    unsigned int *cpu,
+					    unsigned int shallow_depth)
+{
+	int nr;
+
+	*cpu = get_cpu();
+	nr = __sbitmap_queue_get_shallow(sbq, shallow_depth);
+	put_cpu();
+	return nr;
+}
+
+/**
  * sbitmap_queue_clear() - Free an allocated bit and wake up waiters on a
  * &struct sbitmap_queue.
  * @sbq: Bitmap to free from.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 60e800e0b5a0..80aa8d5463fa 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -79,15 +79,15 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 }
 EXPORT_SYMBOL_GPL(sbitmap_resize);
 
-static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
-			      bool wrap)
+static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
+			      unsigned int hint, bool wrap)
 {
 	unsigned int orig_hint = hint;
 	int nr;
 
 	while (1) {
-		nr = find_next_zero_bit(&word->word, word->depth, hint);
-		if (unlikely(nr >= word->depth)) {
+		nr = find_next_zero_bit(word, depth, hint);
+		if (unlikely(nr >= depth)) {
 			/*
 			 * We started with an offset, and we didn't reset the
 			 * offset to 0 in a failure case, so start from 0 to
@@ -100,11 +100,11 @@ static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
 			return -1;
 		}
 
-		if (!test_and_set_bit(nr, &word->word))
+		if (!test_and_set_bit(nr, word))
 			break;
 
 		hint = nr + 1;
-		if (hint >= word->depth - 1)
+		if (hint >= depth - 1)
 			hint = 0;
 	}
 
@@ -119,7 +119,8 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 	index = SB_NR_TO_INDEX(sb, alloc_hint);
 
 	for (i = 0; i < sb->map_nr; i++) {
-		nr = __sbitmap_get_word(&sb->map[index],
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					sb->map[index].depth,
 					SB_NR_TO_BIT(sb, alloc_hint),
 					!round_robin);
 		if (nr != -1) {
@@ -141,6 +142,37 @@ int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin)
 }
 EXPORT_SYMBOL_GPL(sbitmap_get);
 
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+			unsigned long shallow_depth)
+{
+	unsigned int i, index;
+	int nr = -1;
+
+	index = SB_NR_TO_INDEX(sb, alloc_hint);
+
+	for (i = 0; i < sb->map_nr; i++) {
+		nr = __sbitmap_get_word(&sb->map[index].word,
+					min(sb->map[index].depth, shallow_depth),
+					SB_NR_TO_BIT(sb, alloc_hint), true);
+		if (nr != -1) {
+			nr += index << sb->shift;
+			break;
+		}
+
+		/* Jump to next index. */
+		index++;
+		alloc_hint = index << sb->shift;
+
+		if (index >= sb->map_nr) {
+			index = 0;
+			alloc_hint = 0;
+		}
+	}
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(sbitmap_get_shallow);
+
 bool sbitmap_any_bit_set(const struct sbitmap *sb)
 {
 	unsigned int i;
@@ -342,6 +374,35 @@ int __sbitmap_queue_get(struct sbitmap_queue *sbq)
 }
 EXPORT_SYMBOL_GPL(__sbitmap_queue_get);
 
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+				unsigned int shallow_depth)
+{
+	unsigned int hint, depth;
+	int nr;
+
+	hint = this_cpu_read(*sbq->alloc_hint);
+	depth = READ_ONCE(sbq->sb.depth);
+	if (unlikely(hint >= depth)) {
+		hint = depth ? prandom_u32() % depth : 0;
+		this_cpu_write(*sbq->alloc_hint, hint);
+	}
+	nr = sbitmap_get_shallow(&sbq->sb, hint, shallow_depth);
+
+	if (nr == -1) {
+		/* If the map is full, a hint won't do us much good. */
+		this_cpu_write(*sbq->alloc_hint, 0);
+	} else if (nr == hint || unlikely(sbq->round_robin)) {
+		/* Only update the hint if we used it. */
+		hint = nr + 1;
+		if (hint >= depth - 1)
+			hint = 0;
+		this_cpu_write(*sbq->alloc_hint, hint);
+	}
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(__sbitmap_queue_get_shallow);
+
 static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
 {
 	int i, wake_index;
-- 
2.12.2

^ permalink raw reply related

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Bart Van Assche @ 2017-04-04 16:25 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, linux-block, Linux SCSI List
In-Reply-To: <CACVXFVPyWL8XyFKC9uT7r4q+QT1+_RvnvJnOskCO3aSnUEVY1w@mail.gmail.com>

On 04/04/2017 09:00 AM, Ming Lei wrote:=0A=
> Just be curious, why is multi hard queue used for this case? Are there=0A=
> some real cases in SCSI?=0A=
=0A=
Hello Ming,=0A=
=0A=
Yes, there is a real need for this. Background information is available=0A=
in the following e-mail thread: Arun Easi, "scsi-mq - tag# and=0A=
can_queue, performance", April 2, 2017, linux-scsi=0A=
(http://www.spinics.net/lists/linux-scsi/msg106853.html).=0A=
=0A=
Bart.=0A=

^ permalink raw reply

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Ming Lei @ 2017-04-04 15:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, Linux SCSI List
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

On Tue, Apr 4, 2017 at 8:07 PM, Hannes Reinecke <hare@suse.de> wrote:
> Hi all,
>
> as discussed recently most existing HBAs have a host-wide tagset which
> does not map easily onto the per-queue tagset model of block mq.
> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
> enables the use of a shared tagset for all hardware queues.
> The second patch adds a flag 'host_tagset' to the SCSI host template,
> which allows drivers to enable the use of the global tagset.
>
> This patchset probably has some performance implications as
> there is a quite high probability of cache-bouncing when allocating
> tags. Also I'm not quite sure if the implemented tagset sharing
> is the correct way to handle things.
> So this can be considered an RFC.
>
> As usual, comments and reviews are welcome.

Just be curious, why is multi hard queue used for this case? Are there
some real cases in SCSI?


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Bart Van Assche @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Martin K . Petersen,
	James Bottomley, Hannes Reinecke
In-Reply-To: <20170404064252.GD9168@lst.de>

On 04/03/2017 11:42 PM, Christoph Hellwig wrote:=0A=
>> +static void scsi_restart_hctx(struct request_queue *q,=0A=
>> +			      struct blk_mq_hw_ctx *hctx)=0A=
>> +{=0A=
>> +	struct blk_mq_tags *tags =3D hctx->tags;=0A=
>> +	struct blk_mq_tag_set *set =3D q->tag_set;=0A=
>> +	int i;=0A=
>> +=0A=
>> +	rcu_read_lock();=0A=
>> +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list)=0A=
>> +		queue_for_each_hw_ctx(q, hctx, i)=0A=
>> +			if (hctx->tags =3D=3D tags)=0A=
>> +				blk_mq_sched_restart_hctx(hctx);=0A=
>> +	rcu_read_unlock();=0A=
>> +}=0A=
> =0A=
> This looks like generic block layer code, why is it in SCSI?=0A=
=0A=
Hello Christoph,=0A=
=0A=
That's an excellent question. I assume that you are fine with moving=0A=
this code to the block layer?=0A=
=0A=
Bart.=0A=
=0A=

^ permalink raw reply

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Ming Lei @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Linux Kernel Mailing List, linux-block, Martin K. Petersen
In-Reply-To: <871st8kxnk.fsf@dmlp.sw.ru>

On Tue, Apr 4, 2017 at 11:19 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> Ming Lei <tom.leiming@gmail.com> writes:
>
>> On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>>> Currently if some one try to advance bvec beyond it's size we simply
>>> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
>>> This simply means that we endup dereferencing/corrupting random memory
>>> region.
>>>
>>> Sane reaction would be to propagate error back to calling context
>>> But bvec_iter_advance's calling context is not always good for error
>>> handling. For safity reason let truncate iterator size to zero which
>>
>> IMO, we can avoid continuing to iterate by checking the return value,
>> and looks it is rude to just set iterator size as 0.
> But situation itself is horrible already. IMHO this is BUG_ON situation,

Not sure it is a real BUG_ON() since corrupt bvec array shouldn't happen
because we usually don't modify bvec array directly and just copy to local
variable for further access, but dereferencing random memory might happen.

> but since Linus hate bugons, I try to replace with something loud, but
> very safe. Since there is no guarantee that caller will ignore an error
> and try to dereference bvec the only safe thing we can to is to clamp
> iterator to zero to prevent any possible usage in future.

Since you prevent the case from happening, no dereference invalid bvec
can happen any more, but may cause dead loop. Setting iterator size as
zero can break the dead loop, but the driver still may not know this error
and continue to do its following work.

Don't have a better idea now, and looks it is fine to set iter.bi_size as
zero at the beginning of guarding bvec iteration.

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 2/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Bart Van Assche @ 2017-04-04 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Martin K . Petersen,
	James Bottomley, Hannes Reinecke
In-Reply-To: <20170404064038.GB9168@lst.de>

On 04/03/2017 11:40 PM, Christoph Hellwig wrote:=0A=
> On Mon, Apr 03, 2017 at 04:22:25PM -0700, Bart Van Assche wrote:=0A=
>> A later patch in this series will namely use RCU to iterate over=0A=
>> this list.=0A=
> =0A=
> It also adds a couple lockdep_assert_held calls, which might be worth=0A=
> mentioning.=0A=
> =0A=
> Otherwise looks good:=0A=
> =0A=
> Reviewed-by: Christoph Hellwig <hch@lst.de>=0A=
=0A=
Thanks for the review. I will update the patch description.=0A=
=0A=
Bart.=0A=
=0A=

^ permalink raw reply

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Bart Van Assche @ 2017-04-04 15:46 UTC (permalink / raw)
  To: osandov@osandov.com, hare@suse.de
  Cc: hch@lst.de, james.bottomley@hansenpartnership.com,
	linux-scsi@vger.kernel.org, osandov@fb.com,
	linux-block@vger.kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
In-Reply-To: <20170404153234.GA3234@vader>

On Tue, 2017-04-04 at 08:32 -0700, Omar Sandoval wrote:
> blk-mq already supports a shared tagset, and scsi-mq already uses that.
> When we initialize a request queue, we add it to a tagset with
> blk_mq_add_queue_set(), where we automatically mark the tagset as shared
> if there is more than one queue using it. What does this do that
> BLK_MQ_F_TAG_SHARED doesn't cover?

Hello Omar,

Today blk-mq creates one tag set per hardware queue. The sharing by
scsi-mq is between request queues for hardware queues that have the same
index but not between hardware queues of a single request queue. My
understanding is that the goal of this patch series is to make it possible
to use a single tag set for all hardware queues and all request queues that
share the same SCSI host.

Bart.=

^ permalink raw reply

* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Omar Sandoval @ 2017-04-04 15:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

On Tue, Apr 04, 2017 at 02:07:43PM +0200, Hannes Reinecke wrote:
> Hi all,
> 
> as discussed recently most existing HBAs have a host-wide tagset which
> does not map easily onto the per-queue tagset model of block mq.
> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
> enables the use of a shared tagset for all hardware queues.
> The second patch adds a flag 'host_tagset' to the SCSI host template,
> which allows drivers to enable the use of the global tagset.
> 
> This patchset probably has some performance implications as
> there is a quite high probability of cache-bouncing when allocating
> tags. Also I'm not quite sure if the implemented tagset sharing
> is the correct way to handle things.
> So this can be considered an RFC.
> 
> As usual, comments and reviews are welcome.

Hi, Hannes,

blk-mq already supports a shared tagset, and scsi-mq already uses that.
When we initialize a request queue, we add it to a tagset with
blk_mq_add_queue_set(), where we automatically mark the tagset as shared
if there is more than one queue using it. What does this do that
BLK_MQ_F_TAG_SHARED doesn't cover?

^ permalink raw reply

* Re: [PATCH] Block SQ/MQ Request Priority
From: Bart Van Assche @ 2017-04-04 15:31 UTC (permalink / raw)
  To: Adam Manzanares, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, axboe@kernel.dk
  Cc: hch@infradead.org
In-Reply-To: <1491319514-6671-1-git-send-email-adam.manzanares@wdc.com>

On Tue, 2017-04-04 at 08:25 -0700, adam.manzanares@wdc.com wrote:
>=20

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH V2 04/16] block, bfq: modify the peak-rate estimator
From: Bart Van Assche @ 2017-04-04 15:28 UTC (permalink / raw)
  To: paolo.valente@linaro.org
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
	avanzini.arianna@gmail.com, broonie@kernel.org, tj@kernel.org,
	ulf.hansson@linaro.org
In-Reply-To: <867BBC1E-080B-4A4B-88CE-BD103610BA6C@linaro.org>

On Tue, 2017-04-04 at 12:42 +0200, Paolo Valente wrote:
> > Il giorno 31 mar 2017, alle ore 17:31, Bart Van Assche <bart.vanassche@=
sandisk.com> ha scritto:
> >=20
> > On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
> > > +               delta_ktime =3D ktime_get();
> > > +       delta_ktime =3D ktime_sub(delta_ktime, bfqd->last_budget_star=
t);
> > > +       delta_usecs =3D ktime_to_us(delta_ktime);
> >=20
> > This patch changes the type of the variable in which the result of ktim=
e_to_us()
> > is stored from u64 into u32 and next compares that result with LONG_MAX=
. Since
> > ktime_to_us() returns a signed 64-bit number, are you sure you want to =
store that
> > result in a 32-bit variable? If ktime_to_us() would e.g. return 0xfffff=
fff00000100
> > or 0x100000100 then the assignment will truncate these numbers to 0x100=
.
>=20
> The instruction above the assignment you highlight stores in
> delta_ktime the difference between 'now' and the last budget start.
> The latter may have happened at most about 100 ms before 'now'.  So
> there should be no overflow issue.

Hello Paolo,

Please double check the following code: if (delta_usecs < 1000 || delta_use=
cs >=3D LONG_MAX)
Since delta_usecs is a 32-bit variable and LONG_MAX a 64-bit constant on 64=
-bit systems
I'm not sure that code will do what it is intended to do.

Thanks,

Bart.=

^ permalink raw reply

* [PATCH] Block SQ/MQ Request Priority
From: adam.manzanares @ 2017-04-04 15:25 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel; +Cc: hch, Bart.VanAssche, Adam Manzanares

From: Adam Manzanares <adam.manzanares@wdc.com>

In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.

I have tested this patch with the null block device driver with the following
parameters.

null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1

I have not seen a performance regression with this patch and I would appreciate
any feedback or additional testing.

I have also verified that io priorities are passed to the device when using
the SQ and MQ path to a SATA HDD that supports io priorities.

Thanks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
---
 block/blk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43b7d06..316a539 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1149,7 +1149,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,

 	blk_rq_init(q, rq);
 	blk_rq_set_rl(rq, rl);
-	blk_rq_set_prio(rq, ioc);
 	rq->cmd_flags = op;
 	rq->rq_flags = rq_flags;

@@ -1636,6 +1635,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)

 	req->errors = 0;
 	req->__sector = bio->bi_iter.bi_sector;
+	blk_rq_set_prio(req, rq_ioc(bio));
 	if (ioprio_valid(bio_prio(bio)))
 		req->ioprio = bio_prio(bio);
 	blk_rq_bio_prep(req->q, req, bio);
-- 
2.7.4

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply related

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Dmitry Monakhov @ 2017-04-04 15:19 UTC (permalink / raw)
  To: Ming Lei; +Cc: Linux Kernel Mailing List, linux-block, Martin K. Petersen
In-Reply-To: <CACVXFVPWe0ABh5CsXryqXi7Umtvvzcq=jp2n+t=JDRWSZJsbFA@mail.gmail.com>

Ming Lei <tom.leiming@gmail.com> writes:

> On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>> Currently if some one try to advance bvec beyond it's size we simply
>> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
>> This simply means that we endup dereferencing/corrupting random memory
>> region.
>>
>> Sane reaction would be to propagate error back to calling context
>> But bvec_iter_advance's calling context is not always good for error
>> handling. For safity reason let truncate iterator size to zero which
>
> IMO, we can avoid continuing to iterate by checking the return value,
> and looks it is rude to just set iterator size as 0.
But situation itself is horrible already. IMHO this is BUG_ON situation,
but since Linus hate bugons, I try to replace with something loud, but
very safe. Since there is no guarantee that caller will ignore an error
and try to dereference bvec the only safe thing we can to is to clamp
iterator to zero to prevent any possible usage in future.
>
>> will break external iteration loop which prevent us from unpredictable
>> memory range corruption. And even it caller ignores an error, it will
>> corrupt it's own bvecs, not others.
>>
>> This patch does:
>> - Return error back to caller with hope that it will react on this
>> - Truncate iterator size
>>
>> Code was added long time ago here 4550dd6c, luckily no one hit it
>> in real life :)
>>
>> changes since V1:
>>  - Replace  BUG_ON with error logic.
>>
>> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
>> ---
>>  drivers/nvdimm/blk.c |  4 +++-
>>  drivers/nvdimm/btt.c |  4 +++-
>>  include/linux/bio.h  |  8 ++++++--
>>  include/linux/bvec.h | 11 ++++++++---
>>  4 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
>> index 1edb3f3..04c3075 100644
>> --- a/drivers/nvdimm/blk.c
>> +++ b/drivers/nvdimm/blk.c
>> @@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
>>
>>                 len -= cur_len;
>>                 dev_offset += cur_len;
>> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               err = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               if (err)
>> +                       return err;
>>         }
>>
>>         return err;
>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>> index 03ded8d..3f3aa7b 100644
>> --- a/drivers/nvdimm/btt.c
>> +++ b/drivers/nvdimm/btt.c
>> @@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
>>
>>                 len -= cur_len;
>>                 meta_nsoff += cur_len;
>> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               ret = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
>> +               if (ret)
>> +                       return ret;
>>         }
>>
>>         return ret;
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index 0c1c95c..8bf1564 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
>>
>>         if (bio_no_advance_iter(bio))
>>                 iter->bi_size -= bytes;
>> -       else
>> -               bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +       else {
>> +               int err;
>> +               err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>> +               if (unlikely(err))
>> +                       bio->bi_error = err;
>> +       }
>>  }
>>
>>  #define __bio_for_each_segment(bvl, bio, iter, start)                  \
>> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
>> index 89b65b8..c117f1a 100644
>> --- a/include/linux/bvec.h
>> +++ b/include/linux/bvec.h
>> @@ -22,6 +22,7 @@
>>
>>  #include <linux/kernel.h>
>>  #include <linux/bug.h>
>> +#include <linux/errno.h>
>>
>>  /*
>>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
>> @@ -66,12 +67,15 @@ struct bvec_iter {
>>         .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
>>  })
>>
>> -static inline void bvec_iter_advance(const struct bio_vec *bv,
>> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>>                                      struct bvec_iter *iter,
>>                                      unsigned bytes)
>>  {
>> -       WARN_ONCE(bytes > iter->bi_size,
>> -                 "Attempted to advance past end of bvec iter\n");
>> +       if(unlikely(bytes > iter->bi_size)) {
>> +               WARN(1, "Attempted to advance past end of bvec iter\n");
>> +               iter->bi_size = 0;
>> +               return -EINVAL;
>> +       }
>>
>>         while (bytes) {
>>                 unsigned iter_len = bvec_iter_len(bv, *iter);
>> @@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
>>                         iter->bi_idx++;
>>                 }
>>         }
>> +       return 0;
>>  }
>>
>>  #define for_each_bvec(bvl, bio_vec, iter, start)                       \
>> --
>> 2.9.3
>>
>
>
>
> -- 
> Ming Lei

^ permalink raw reply

* Re: [PATCH 7/7] Guard bvec iteration logic v2
From: Ming Lei @ 2017-04-04 15:03 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Linux Kernel Mailing List, linux-block, Martin K. Petersen
In-Reply-To: <1491204212-9952-8-git-send-email-dmonakhov@openvz.org>

On Mon, Apr 3, 2017 at 3:23 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> Currently if some one try to advance bvec beyond it's size we simply
> dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
> This simply means that we endup dereferencing/corrupting random memory
> region.
>
> Sane reaction would be to propagate error back to calling context
> But bvec_iter_advance's calling context is not always good for error
> handling. For safity reason let truncate iterator size to zero which

IMO, we can avoid continuing to iterate by checking the return value,
and looks it is rude to just set iterator size as 0.

> will break external iteration loop which prevent us from unpredictable
> memory range corruption. And even it caller ignores an error, it will
> corrupt it's own bvecs, not others.
>
> This patch does:
> - Return error back to caller with hope that it will react on this
> - Truncate iterator size
>
> Code was added long time ago here 4550dd6c, luckily no one hit it
> in real life :)
>
> changes since V1:
>  - Replace  BUG_ON with error logic.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  drivers/nvdimm/blk.c |  4 +++-
>  drivers/nvdimm/btt.c |  4 +++-
>  include/linux/bio.h  |  8 ++++++--
>  include/linux/bvec.h | 11 ++++++++---
>  4 files changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
> index 1edb3f3..04c3075 100644
> --- a/drivers/nvdimm/blk.c
> +++ b/drivers/nvdimm/blk.c
> @@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
>
>                 len -= cur_len;
>                 dev_offset += cur_len;
> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               err = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               if (err)
> +                       return err;
>         }
>
>         return err;
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index 03ded8d..3f3aa7b 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
>
>                 len -= cur_len;
>                 meta_nsoff += cur_len;
> -               bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               ret = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
> +               if (ret)
> +                       return ret;
>         }
>
>         return ret;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 0c1c95c..8bf1564 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
>
>         if (bio_no_advance_iter(bio))
>                 iter->bi_size -= bytes;
> -       else
> -               bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> +       else {
> +               int err;
> +               err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> +               if (unlikely(err))
> +                       bio->bi_error = err;
> +       }
>  }
>
>  #define __bio_for_each_segment(bvl, bio, iter, start)                  \
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 89b65b8..c117f1a 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -22,6 +22,7 @@
>
>  #include <linux/kernel.h>
>  #include <linux/bug.h>
> +#include <linux/errno.h>
>
>  /*
>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
> @@ -66,12 +67,15 @@ struct bvec_iter {
>         .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
>  })
>
> -static inline void bvec_iter_advance(const struct bio_vec *bv,
> +static inline int bvec_iter_advance(const struct bio_vec *bv,
>                                      struct bvec_iter *iter,
>                                      unsigned bytes)
>  {
> -       WARN_ONCE(bytes > iter->bi_size,
> -                 "Attempted to advance past end of bvec iter\n");
> +       if(unlikely(bytes > iter->bi_size)) {
> +               WARN(1, "Attempted to advance past end of bvec iter\n");
> +               iter->bi_size = 0;
> +               return -EINVAL;
> +       }
>
>         while (bytes) {
>                 unsigned iter_len = bvec_iter_len(bv, *iter);
> @@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
>                         iter->bi_idx++;
>                 }
>         }
> +       return 0;
>  }
>
>  #define for_each_bvec(bvl, bio_vec, iter, start)                       \
> --
> 2.9.3
>



-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Ming Lei @ 2017-04-04 14:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <871staffus.fsf@notabene.neil.brown.name>

On Mon, Apr 3, 2017 at 9:18 AM, NeilBrown <neilb@suse.com> wrote:
>
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
>
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
>
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
>
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
>
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..a7e1dd215fc2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>  {
>         struct loop_cmd *cmd =
>                 container_of(work, struct loop_cmd, work);
> +       int oldflags = current->flags & PF_LESS_THROTTLE;
>
> +       current->flags |= PF_LESS_THROTTLE;
>         loop_handle_cmd(cmd);
> +       current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;
>  }

You can do it against 'lo->worker_task' instead of doing it in each
loop_queue_work(),
and this flag needn't to be restored because the kernel thread is loop
specialized.


thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Christoph Hellwig @ 2017-04-04 13:09 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Sagi Grimberg, linux-rdma, linux-nvme, linux-block, netdev,
	Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <becb84ac-7819-a207-56b1-70f16cb80e42@mellanox.com>

On Tue, Apr 04, 2017 at 10:46:54AM +0300, Max Gurtovoy wrote:
>> +	if (set->nr_hw_queues > dev->num_comp_vectors)
>> +		goto fallback;
>> +
>> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
>> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
>> +		if (!mask)
>> +			goto fallback;
>
> Christoph,
> we can use fallback also in the blk-mq-pci.c in case pci_irq_get_affinity 
> fails, right ?

For PCI it shouldn't fail as the driver calling pci_irq_get_affinity
knows how it set up the interrupts.  So I don't think it's necessary there.

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 12:48 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <d84a1dcf-6f60-d089-f81d-85df5a504c19@profitbricks.com>



On 04/04/2017 02:24 PM, Michael Wang wrote:
> On 04/04/2017 12:23 PM, Michael Wang wrote:
> [snip]
>>> add something like
>>>   if (wbio->bi_next)
>>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>>           i, r1_bio->read_disk, wbio->bi_end_io);
>>>
>>> that might help narrow down what is happening.
>>
>> Just triggered again in 4.4, dmesg like:
>>
>> [  399.240230] md: super_written gets error=-5
>> [  399.240286] md: super_written gets error=-5
>> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
>> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
> 
> Is it possible that the fail fast who changed the 'bi_end_io' inside
> fix_sync_read_error() help the used bio pass the check?

Hi, NeilBrown, below patch fixed the issue in our testing, I'll post a md
RFC patch so we can continue the discussion there.

Regards,
Michael Wang

> 
> I'm not sure but if the read bio was supposed to be reused as write
> for fail fast, maybe we should reset it like this?
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7d67235..0554110 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>                 /* Don't try recovering from here - just fail it
>                  * ... unless it is the last working device of course */
>                 md_error(mddev, rdev);
> -               if (test_bit(Faulty, &rdev->flags))
> +               if (test_bit(Faulty, &rdev->flags)) {
>                         /* Don't try to read from here, but make sure
>                          * put_buf does it's thing
>                          */
>                         bio->bi_end_io = end_sync_write;
> +                       bio->bi_next = NULL;
> +               }
>         }
>  
>         while(sectors) {
> 
> Regards,
> Michael Wang
> 
> 
>> [  399.240363] ------------[ cut here ]------------
>> [  399.240364] kernel BUG at block/blk-core.c:2147!
>> [  399.240365] invalid opcode: 0000 [#1] SMP 
>> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
>> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
>> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
>> [  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
>> [  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
>> [  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
>> [  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
>> [  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
>> [  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
>> [  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
>> [  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
>> [  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
>> [  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
>> [  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  399.240392] Stack:
>> [  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
>> [  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
>> [  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
>> [  399.240396] Call Trace:
>> [  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
>> [  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
>> [  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
>> [  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
>> [  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
>> [  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
>> [  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
>> [  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
>> [  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
>> [  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
>> [  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
>> [  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
>> [  399.240435]  RSP <ffff8800d72b7d10>
>>
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> NeilBrown
>>>

^ permalink raw reply

* Re: [PATCH 0/9] convert genericirq.tmpl and kernel-api.tmpl to DocBook
From: Mauro Carvalho Chehab @ 2017-04-04 12:34 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Linux Media Mailing List, Linux Doc Mailing List,
	Mauro Carvalho Chehab, Noam Camus, James Morris, zijun_hu,
	Markus Heiser, linux-clk, Jani Nikula, Andrew Morton, Jens Axboe,
	Nicholas Piggin, Russell King, linux-block, Kirill A. Shutemov,
	Mauro Carvalho Chehab, Joonsoo Kim, Ingo Molnar, Bjorn Helgaas,
	Serge E. Hallyn, Michal Hocko, Ross Zwisler, Chris Wilson,
	linux-mm, linux-security-module, Silvio Fricke, Takashi Iwai,
	Sebastian Andrzej Siewior, Jan Kara, Vlastimil Babka, linux-pci,
	Matt Fleming, Johannes Weiner, Andrey Ryabinin, Andy Lutomirski,
	Mel Gorman, Andy Shevchenko, Alexey Dobriyan, Hillf Danton
In-Reply-To: <20170402143418.3de75239@lwn.net>

Em Sun, 2 Apr 2017 14:34:18 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:

> On Thu, 30 Mar 2017 17:11:27 -0300
> Mauro Carvalho Chehab <mchehab@s-opensource.com> wrote:
> 
> > This series converts just two documents, adding them to the
> > core-api.rst book. It addresses the errors/warnings that popup
> > after the conversion.
> > 
> > I had to add two fixes to scripts/kernel-doc, in order to solve
> > some of the issues.  
> 
> I've applied the set, including the add-on to move some stuff to
> driver-api - thanks.

Thanks!

> For whatever reason, I had a hard time applying a few of these; "git am"
> would tell me this:
> 
> > Applying: docs-rst: core_api: move driver-specific stuff to drivers_api
> > fatal: sha1 information is lacking or useless (Documentation/driver-api/index.rst).
> > Patch failed at 0001 docs-rst: core_api: move driver-specific stuff to drivers_api
> > The copy of the patch that failed is found in: .git/rebase-apply/patch  
> 
> I was able to get around this, but it took some hand work.  How are you
> generating these?

That's weird. I'm using this to generate the patches:

	git format-patch -o $tmp_dir --stat --summary --patience --signoff --thread=shallow

plus some scripting that runs scripts/get_maintainers.

After that, I run:

	git send-email $tmp_dir

Then, exim sends the patches to a smart SMTP server (currently,
infradead.org, but I'm switching to s-opensource.org, as I'm getting
some troubles because the IP doesn't match the From: line).

Thanks,
Mauro

^ permalink raw reply

* [PATCH] cfq: Disable writeback throttling by default
From: Jan Kara @ 2017-04-04 12:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jan Kara

Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

  8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
  8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
  8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
  8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
  8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
  8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
  8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
  8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
  8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
  8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
  8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
  8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
  8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
...
  8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
  8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
  8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
  8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
  8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
  8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
  8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
  8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
  8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
...
  8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
  8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
  8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
  8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
  8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
  8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
  8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
  8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
  8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
  8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
  8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
...
  8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
  8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
  8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
  8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
  8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
  8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
  8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
  8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
  8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1

So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.

Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.

Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/cfq-iosched.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 440b95ee593c..da69b079725f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3761,16 +3761,14 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
+static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
 	uint64_t serial_nr;
-	bool nonroot_cg;
 
 	rcu_read_lock();
 	serial_nr = bio_blkcg(bio)->css.serial_nr;
-	nonroot_cg = bio_blkcg(bio) != &blkcg_root;
 	rcu_read_unlock();
 
 	/*
@@ -3778,7 +3776,7 @@ static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 	 * spuriously on a newly created cic but there's no harm.
 	 */
 	if (unlikely(!cfqd) || likely(cic->blkcg_serial_nr == serial_nr))
-		return nonroot_cg;
+		return;
 
 	/*
 	 * Drop reference to queues.  New queues will be assigned in new
@@ -3799,12 +3797,10 @@ static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 	}
 
 	cic->blkcg_serial_nr = serial_nr;
-	return nonroot_cg;
 }
 #else
-static inline bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
+static inline void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
-	return false;
 }
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
@@ -4449,12 +4445,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	const int rw = rq_data_dir(rq);
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
-	bool disable_wbt;
 
 	spin_lock_irq(q->queue_lock);
 
 	check_ioprio_changed(cic, bio);
-	disable_wbt = check_blkcg_changed(cic, bio);
+	check_blkcg_changed(cic, bio);
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
@@ -4491,9 +4486,6 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	rq->elv.priv[1] = cfqq->cfqg;
 	spin_unlock_irq(q->queue_lock);
 
-	if (disable_wbt)
-		wbt_disable_default(q);
-
 	return 0;
 }
 
@@ -4706,6 +4698,7 @@ static void cfq_registered_queue(struct request_queue *q)
 	 */
 	if (blk_queue_nonrot(q))
 		cfqd->cfq_slice_idle = 0;
+	wbt_disable_default(q);
 }
 
 /*
-- 
2.10.2

^ permalink raw reply related

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 12:24 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <2a6f8c30-616c-d6fe-1c3f-ab687c145cd7@profitbricks.com>

On 04/04/2017 12:23 PM, Michael Wang wrote:
[snip]
>> add something like
>>   if (wbio->bi_next)
>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>           i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
> 
> Just triggered again in 4.4, dmesg like:
> 
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

Is it possible that the fail fast who changed the 'bi_end_io' inside
fix_sync_read_error() help the used bio pass the check?

I'm not sure but if the read bio was supposed to be reused as write
for fail fast, maybe we should reset it like this?

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7d67235..0554110 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1986,11 +1986,13 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
                /* Don't try recovering from here - just fail it
                 * ... unless it is the last working device of course */
                md_error(mddev, rdev);
-               if (test_bit(Faulty, &rdev->flags))
+               if (test_bit(Faulty, &rdev->flags)) {
                        /* Don't try to read from here, but make sure
                         * put_buf does it's thing
                         */
                        bio->bi_end_io = end_sync_write;
+                       bio->bi_next = NULL;
+               }
        }
 
        while(sectors) {

Regards,
Michael Wang


> [  399.240363] ------------[ cut here ]------------
> [  399.240364] kernel BUG at block/blk-core.c:2147!
> [  399.240365] invalid opcode: 0000 [#1] SMP 
> [  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
> [  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
> [  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
> [  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
> [  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
> [  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
> [  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
> [  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
> [  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
> [  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
> [  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
> [  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
> [  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
> [  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  399.240392] Stack:
> [  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
> [  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
> [  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
> [  399.240396] Call Trace:
> [  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
> [  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
> [  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
> [  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
> [  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
> [  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
> [  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
> [  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
> [  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
> [  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
> [  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
> [  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
> [  399.240435]  RSP <ffff8800d72b7d10>
> 
> 
> Regards,
> Michael Wang
> 
>>
>> NeilBrown
>>

^ permalink raw reply related

* Re: [PATCH 2/7] bio-integrity: save original iterator for verify stage
From: Dmitry Monakhov @ 2017-04-04 12:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <20170404070122.GC12008@infradead.org>

Christoph Hellwig <hch@infradead.org> writes:

> This is a pretty big increase in the bio_integrity_payload size,
> but I guess we can't get around it..

Yes, everybody hate this solution, me too, but I've stated with
other approach and it is appeaded to be very ugly.

My idea was that we have two types of iterator incrementors: bio_advance()
and bio_xx_complete, First one is called during split, later is called
on completion ( req_bio_endio() ) . So we can add new field "bi_done" to
iterator which has similar meaning as bi_bvec_done, but at full iterator
scope.  It is incremented during completion, but before end_io.
Chain bios will propogate bi_done to parent bio to parent one.
On ->vefify_fn() iterator will be rewinded (counter part of bvec_advance) to
iter->bi_done bytes, so we will get oritinal iterator. 
I've even prepare a patch for this idea and it looks big and awful.
Even more it does not works if chained bios overlapts (raid1,raid10,
etc).

But... at the time I've wrote this email I've realized that I do not
care about what happen with chained bios. The only thing is important
is parent bio and how far it was advanced. If bi_done is incremented
inside bvec_iter_advance() I can be shure that at the moment
->bi_end_io()
original position can be restored by rewinding back to io_done bytes.

I'll try to implement this.

>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* [PATCH 2/2] scsi: Add template flag 'host_tagset'
From: Hannes Reinecke @ 2017-04-04 12:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi,
	Hannes Reinecke, Hannes Reinecke
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

Add a host template flag 'host_tagset' to enable the use of a
global tagmap for block-mq.

Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ba22866..00036cb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2193,6 +2193,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	shost->tag_set.cmd_size = cmd_size;
 	shost->tag_set.numa_node = NUMA_NO_NODE;
 	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	if (shost->hostt->host_tagset)
+		shost->tag_set.flags |= BLK_MQ_F_GLOBAL_TAGS;
 	shost->tag_set.flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
 	shost->tag_set.driver_data = shost;
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 3cd8c3b..dff3ec1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -457,6 +457,11 @@ struct scsi_host_template {
 	unsigned no_async_abort:1;
 
 	/*
+	 * True if the host supports a host-wide tagspace
+	 */
+	unsigned host_tagset:1;
+
+	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
 	unsigned int max_host_blocked;
-- 
1.8.5.6

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox