Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 8/8] block: use bio_has_data to check if a bio has bvecs
From: Christoph Hellwig @ 2017-04-10 16:08 UTC (permalink / raw)
  To: axboe, martin.petersen, philipp.reisner, lars.ellenberg,
	target-devel
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel
In-Reply-To: <20170410160807.23674-1-hch@lst.de>

Now that Write Same is gone and discard bios never have a payload we
can simply use bio_has_data as an indicator that the bio has bvecs
that need to be handled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bio.c         |  8 +-------
 block/blk-merge.c   |  9 +--------
 include/linux/bio.h | 21 +++++----------------
 3 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b310e7ef3fbf..1c9f04c30ba9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,15 +679,9 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
 
-	switch (bio_op(bio)) {
-	case REQ_OP_DISCARD:
-	case REQ_OP_SECURE_ERASE:
-	case REQ_OP_WRITE_ZEROES:
-		break;
-	default:
+	if (bio_has_data(bio)) {
 		__bio_for_each_segment(bv, bio_src, iter, iter_src)
 			bio->bi_io_vec[bio->bi_vcnt++] = bv;
-		break;
 	}
 
 	if (bio_integrity(bio_src)) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d6c86bfc5722..549d060097f1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -232,16 +232,9 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
 
-	if (!bio)
+	if (!bio || !bio_has_data(bio))
 		return 0;
 
-	switch (bio_op(bio)) {
-	case REQ_OP_DISCARD:
-	case REQ_OP_SECURE_ERASE:
-	case REQ_OP_WRITE_ZEROES:
-		return 0;
-	}
-
 	fbio = bio;
 	cluster = blk_queue_cluster(q);
 	seg_size = 0;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7a24a1a24967..86bf531f97aa 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -178,26 +178,15 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
 static inline unsigned __bio_segments(struct bio *bio, struct bvec_iter *bvec)
 {
 	unsigned segs = 0;
-	struct bio_vec bv;
-	struct bvec_iter iter;
 
-	/*
-	 * We special case discard/write same/write zeroes, because they
-	 * interpret bi_size differently:
-	 */
+	if (bio_has_data(bio)) {
+		struct bio_vec bv;
+		struct bvec_iter iter;
 
-	switch (bio_op(bio)) {
-	case REQ_OP_DISCARD:
-	case REQ_OP_SECURE_ERASE:
-	case REQ_OP_WRITE_ZEROES:
-		return 0;
-	default:
-		break;
+		__bio_for_each_segment(bv, bio, iter, *bvec)
+			segs++;
 	}
 
-	__bio_for_each_segment(bv, bio, iter, *bvec)
-		segs++;
-
 	return segs;
 }
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH 1/3] blk-mq: unify hctx delayed_run_work and run_work
From: Christoph Hellwig @ 2017-04-10 16:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, osandov, hch, bart.vanassche
In-Reply-To: <1491839696-24783-2-git-send-email-axboe@fb.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 2/3] block: add kblock_mod_delayed_work_on()
From: Christoph Hellwig @ 2017-04-10 16:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, osandov, hch, bart.vanassche
In-Reply-To: <1491839696-24783-3-git-send-email-axboe@fb.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 1/3] blk-mq: unify hctx delayed_run_work and run_work
From: Bart Van Assche @ 2017-04-10 16:17 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@fb.com; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <1491839696-24783-2-git-send-email-axboe@fb.com>

On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
> They serve the exact same purpose. Get rid of the non-delayed
> work variant, and just run it without delay for the normal case.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>=

^ permalink raw reply

* Re: [PATCH 2/3] block: add kblock_mod_delayed_work_on()
From: Bart Van Assche @ 2017-04-10 16:17 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@fb.com; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <1491839696-24783-3-git-send-email-axboe@fb.com>

On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
> This modifies (or adds, if not currently pending) an existing
> delayed work item.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>=

^ permalink raw reply

* Re: [PATCH 3/3] blk-mq: unify hctx delay_work and run_work
From: Bart Van Assche @ 2017-04-10 16:21 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@fb.com; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <1491839696-24783-4-git-send-email-axboe@fb.com>

On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
> @@ -1281,27 +1280,39 @@ static void blk_mq_run_work_fn(struct work_struct=
 *work)
>  	struct blk_mq_hw_ctx *hctx;
> =20
>  	hctx =3D container_of(work, struct blk_mq_hw_ctx, run_work.work);
> -	__blk_mq_run_hw_queue(hctx);
> -}
> =20
> -static void blk_mq_delay_work_fn(struct work_struct *work)
> -{
> -	struct blk_mq_hw_ctx *hctx;
> +	/*
> +	 * If we are stopped, don't run the queue. The exception is if
> +	 * BLK_MQ_S_START_ON_RUN is set. For that case, we auto-clear
> +	 * the STOPPED bit and run it.
> +	 */
> +	if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) {
> +		if (!test_bit(BLK_MQ_S_START_ON_RUN, &hctx->state))
> +			return;
> =20
> -	hctx =3D container_of(work, struct blk_mq_hw_ctx, delay_work.work);
> +		clear_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
> +		clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
> +	}
> =20
> -	if (test_and_clear_bit(BLK_MQ_S_STOPPED, &hctx->state))
> -		__blk_mq_run_hw_queue(hctx);
> +	__blk_mq_run_hw_queue(hctx);
>  }
> =20
> +
>  void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>  {
>  	if (unlikely(!blk_mq_hw_queue_mapped(hctx)))
>  		return;
> =20
> +	/*
> +	 * Stop the hw queue, then modify currently delayed work.
> +	 * This should prevent us from running the queue prematurely.
> +	 * Mark the queue as auto-clearing STOPPED when it runs.
> +	 */
>  	blk_mq_stop_hw_queue(hctx);
> -	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> -			&hctx->delay_work, msecs_to_jiffies(msecs));
> +	set_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
> +	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> +					&hctx->run_work,
> +					msecs_to_jiffies(msecs));
>  }
>  EXPORT_SYMBOL(blk_mq_delay_queue);

Hello Jens,

Is it possible for a block driver to call blk_mq_delay_queue() while
blk_mq_delay_work_fn() is running? Can that result in BLK_MQ_S_STOPPED
and=A0BLK_MQ_S_START_ON_RUN being checked by=A0blk_mq_delay_work_fn() after
blk_mq_delay_queue() has set BLK_MQ_S_STOPPED and before it has set
BLK_MQ_S_START_ON_RUN?

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH 3/3] blk-mq: unify hctx delay_work and run_work
From: Jens Axboe @ 2017-04-10 16:53 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <1491841316.4199.12.camel@sandisk.com>

On 04/10/2017 10:21 AM, Bart Van Assche wrote:
> On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
>> @@ -1281,27 +1280,39 @@ static void blk_mq_run_work_fn(struct work_struct *work)
>>  	struct blk_mq_hw_ctx *hctx;
>>  
>>  	hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work);
>> -	__blk_mq_run_hw_queue(hctx);
>> -}
>>  
>> -static void blk_mq_delay_work_fn(struct work_struct *work)
>> -{
>> -	struct blk_mq_hw_ctx *hctx;
>> +	/*
>> +	 * If we are stopped, don't run the queue. The exception is if
>> +	 * BLK_MQ_S_START_ON_RUN is set. For that case, we auto-clear
>> +	 * the STOPPED bit and run it.
>> +	 */
>> +	if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) {
>> +		if (!test_bit(BLK_MQ_S_START_ON_RUN, &hctx->state))
>> +			return;
>>  
>> -	hctx = container_of(work, struct blk_mq_hw_ctx, delay_work.work);
>> +		clear_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
>> +		clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
>> +	}
>>  
>> -	if (test_and_clear_bit(BLK_MQ_S_STOPPED, &hctx->state))
>> -		__blk_mq_run_hw_queue(hctx);
>> +	__blk_mq_run_hw_queue(hctx);
>>  }
>>  
>> +
>>  void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>  {
>>  	if (unlikely(!blk_mq_hw_queue_mapped(hctx)))
>>  		return;
>>  
>> +	/*
>> +	 * Stop the hw queue, then modify currently delayed work.
>> +	 * This should prevent us from running the queue prematurely.
>> +	 * Mark the queue as auto-clearing STOPPED when it runs.
>> +	 */
>>  	blk_mq_stop_hw_queue(hctx);
>> -	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>> -			&hctx->delay_work, msecs_to_jiffies(msecs));
>> +	set_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
>> +	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>> +					&hctx->run_work,
>> +					msecs_to_jiffies(msecs));
>>  }
>>  EXPORT_SYMBOL(blk_mq_delay_queue);
> 
> Hello Jens,
> 
> Is it possible for a block driver to call blk_mq_delay_queue() while
> blk_mq_delay_work_fn() is running? Can that result in BLK_MQ_S_STOPPED
> and BLK_MQ_S_START_ON_RUN being checked by blk_mq_delay_work_fn() after
> blk_mq_delay_queue() has set BLK_MQ_S_STOPPED and before it has set
> BLK_MQ_S_START_ON_RUN?

Yeah, I don't think it's bullet proof. I just looked at the in-kernel
users, and there's just one, nvme/fc.c. And it looks really buggy,
since it manually stops _all_ queues, then delays the one hw queue.
So we'll end up with potentially a bunch of stopped queues, and only
one getting restarted.

I think for blk_mq_delay_queue(), what we really care about is "this
queue has to run again sometime in the future". If that happens
much sooner than asked for, that's OK, the caller will just delay
again if it needs it. For most cases, we'll be close.

Obviously we can't have ordering issues and end up in a bad state,
we have to prevent that.

I'll fiddle with this a bit more.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH V2 00/16] Introduce the BFQ I/O scheduler
From: Bart Van Assche @ 2017-04-10 16:56 UTC (permalink / raw)
  To: tj@kernel.org, paolo.valente@linaro.org, axboe@kernel.dk
  Cc: ulf.hansson@linaro.org, linux-kernel@vger.kernel.org,
	fchecconi@gmail.com, avanzini.arianna@gmail.com,
	linux-block@vger.kernel.org, linus.walleij@linaro.org,
	broonie@kernel.org
In-Reply-To: <20170331124743.3530-1-paolo.valente@linaro.org>

On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
> [ ... ]

Hello Paolo,

Is the git tree that is available at https://github.com/Algodev-github/bfq-=
mq
appropriate for testing BFQ? If I merge that tree with v4.11-rc6 and if I r=
un
the srp-test software against that tree as follows:

    ./run_tests -e bfq-mq -t 02-mq

then the following appears on the console:

[ 2748.650352] BUG: unable to handle kernel NULL pointer dereference at 000=
00000000000d0
[ 2748.650442] IP: __bfq_insert_request+0x26/0x650 [bfq_mq_iosched]
[ 2748.650509] PGD 0=A0
[ 2748.650511]=A0
[ 2748.650585] Oops: 0000 [#1] SMP
[ 2748.651107] CPU: 9 PID: 10772 Comm: kworker/9:2H Tainted: G=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0I=A0=A0=A0=A0=A04.11.0-rc6-dbg+ #1
[ 2748.651191] Workqueue: kblockd blk_mq_requeue_work
[ 2748.651228] task: ffff88037c808040 task.stack: ffffc90003b4c000
[ 2748.651268] RIP: 0010:__bfq_insert_request+0x26/0x650 [bfq_mq_iosched]
[ 2748.651307] RSP: 0018:ffffc90003b4f9d8 EFLAGS: 00010002
[ 2748.651345] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000000=
00001
[ 2748.651383] RDX: 0000000000000001 RSI: ffff880377f52e80 RDI: ffff880401f=
774e8
[ 2748.651423] RBP: ffffc90003b4fa80 R08: 9093955f00000000 R09: 00000000000=
00001
[ 2748.651464] R10: ffffc90003b4fa00 R11: ffffffffa06d0d53 R12: ffff880401f=
77840
[ 2748.651506] R13: ffff880401f774e8 R14: ffff880378a451e0 R15: 00000000000=
00000
[ 2748.651547] FS:=A0=A00000000000000000(0000) GS:ffff88046f040000(0000) kn=
lGS:0000000000000000
[ 2748.651588] CS:=A0=A00010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2748.651626] CR2: 00000000000000d0 CR3: 0000000001c0f000 CR4: 00000000001=
406e0
[ 2748.651664] Call Trace:
[ 2748.651778]=A0=A0bfq_insert_request+0x83/0x280 [bfq_mq_iosched]
[ 2748.651934]=A0=A0bfq_insert_requests+0x50/0x70 [bfq_mq_iosched]
[ 2748.651975]=A0=A0blk_mq_sched_insert_request+0x11e/0x170
[ 2748.652015]=A0=A0blk_insert_cloned_request+0xb6/0x1f0
[ 2748.652361]=A0=A0map_request+0x13c/0x290 [dm_mod]
[ 2748.652403]=A0=A0dm_mq_queue_rq+0x90/0x160 [dm_mod]
[ 2748.652441]=A0=A0blk_mq_dispatch_rq_list+0x1f2/0x3e0
[ 2748.652479]=A0=A0blk_mq_sched_dispatch_requests+0xf1/0x190
[ 2748.652516]=A0=A0__blk_mq_run_hw_queue+0x12d/0x1c0
[ 2748.652553]=A0=A0__blk_mq_delay_run_hw_queue+0xe3/0xf0
[ 2748.652593]=A0=A0blk_mq_run_hw_queues+0x5c/0x80
[ 2748.652632]=A0=A0blk_mq_requeue_work+0x132/0x150
[ 2748.652671]=A0=A0process_one_work+0x206/0x6a0
[ 2748.652709]=A0=A0worker_thread+0x49/0x4a0
[ 2748.652745]=A0=A0kthread+0x107/0x140
[ 2748.652854]=A0=A0ret_from_fork+0x2e/0x40
[ 2748.652891] Code: ff 0f 1f 40 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 =
48 83 c4 80 8b 87 58 03 00 00 48 8b 9e b0 00 00 00 85 c0 0f 84 8b 04 00 00 =
<48> 8b 83 d0 00 00 00 48 85 c0 0f 84 63 04 00 00
48 83 e8 10 48=A0
[ 2748.653049] RIP: __bfq_insert_request+0x26/0x650 [bfq_mq_iosched] RSP: f=
fffc90003b4f9d8
[ 2748.653090] CR2: 00000000000000d0

The crash address corresponds to the following source code according to gdb=
:

(gdb) list *(__bfq_insert_request+0x26)
0xd6f6 is in __bfq_insert_request (block/bfq-mq-iosched.c:4430).
4425
4426 =A0=A0=A0static void __bfq_insert_request(struct bfq_data *bfqd, struc=
t request *rq)
4427 =A0=A0=A0{
4428 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0struct bfq_queue *bfqq =3D RQ_BFQQ(rq=
), *new_bfqq;
4429
4430 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0assert_spin_locked(&bfqd->lock);
4431
4432 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0bfq_log_bfqq(bfqd, bfqq, "__insert_re=
q: rq %p bfqq %p", rq, bfqq);
4433
4434 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0/*

Bart.=

^ permalink raw reply

* Re: [PATCH 3/3] blk-mq: unify hctx delay_work and run_work
From: Jens Axboe @ 2017-04-10 17:23 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: hch@lst.de, osandov@fb.com
In-Reply-To: <54caf130-6a6c-ca81-1b58-dd413079fcf2@fb.com>

On 04/10/2017 10:53 AM, Jens Axboe wrote:
> On 04/10/2017 10:21 AM, Bart Van Assche wrote:
>> On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
>>> @@ -1281,27 +1280,39 @@ static void blk_mq_run_work_fn(struct work_struct *work)
>>>  	struct blk_mq_hw_ctx *hctx;
>>>  
>>>  	hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work);
>>> -	__blk_mq_run_hw_queue(hctx);
>>> -}
>>>  
>>> -static void blk_mq_delay_work_fn(struct work_struct *work)
>>> -{
>>> -	struct blk_mq_hw_ctx *hctx;
>>> +	/*
>>> +	 * If we are stopped, don't run the queue. The exception is if
>>> +	 * BLK_MQ_S_START_ON_RUN is set. For that case, we auto-clear
>>> +	 * the STOPPED bit and run it.
>>> +	 */
>>> +	if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) {
>>> +		if (!test_bit(BLK_MQ_S_START_ON_RUN, &hctx->state))
>>> +			return;
>>>  
>>> -	hctx = container_of(work, struct blk_mq_hw_ctx, delay_work.work);
>>> +		clear_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
>>> +		clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
>>> +	}
>>>  
>>> -	if (test_and_clear_bit(BLK_MQ_S_STOPPED, &hctx->state))
>>> -		__blk_mq_run_hw_queue(hctx);
>>> +	__blk_mq_run_hw_queue(hctx);
>>>  }
>>>  
>>> +
>>>  void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>>  {
>>>  	if (unlikely(!blk_mq_hw_queue_mapped(hctx)))
>>>  		return;
>>>  
>>> +	/*
>>> +	 * Stop the hw queue, then modify currently delayed work.
>>> +	 * This should prevent us from running the queue prematurely.
>>> +	 * Mark the queue as auto-clearing STOPPED when it runs.
>>> +	 */
>>>  	blk_mq_stop_hw_queue(hctx);
>>> -	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>>> -			&hctx->delay_work, msecs_to_jiffies(msecs));
>>> +	set_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
>>> +	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>>> +					&hctx->run_work,
>>> +					msecs_to_jiffies(msecs));
>>>  }
>>>  EXPORT_SYMBOL(blk_mq_delay_queue);
>>
>> Hello Jens,
>>
>> Is it possible for a block driver to call blk_mq_delay_queue() while
>> blk_mq_delay_work_fn() is running? Can that result in BLK_MQ_S_STOPPED
>> and BLK_MQ_S_START_ON_RUN being checked by blk_mq_delay_work_fn() after
>> blk_mq_delay_queue() has set BLK_MQ_S_STOPPED and before it has set
>> BLK_MQ_S_START_ON_RUN?
> 
> Yeah, I don't think it's bullet proof. I just looked at the in-kernel
> users, and there's just one, nvme/fc.c. And it looks really buggy,
> since it manually stops _all_ queues, then delays the one hw queue.
> So we'll end up with potentially a bunch of stopped queues, and only
> one getting restarted.
> 
> I think for blk_mq_delay_queue(), what we really care about is "this
> queue has to run again sometime in the future". If that happens
> much sooner than asked for, that's OK, the caller will just delay
> again if it needs it. For most cases, we'll be close.
> 
> Obviously we can't have ordering issues and end up in a bad state,
> we have to prevent that.
> 
> I'll fiddle with this a bit more.

Spent a bit of time looking at the workqueue code. Looks like we're
guaranteed that we'll have at least one run of the handler after
calling kblockd_mod_delayed_work_on(). If the handler is currently
running, the we will sucessfully queue a new invocation. That's the
troublesome case. If it's not currently running, we just push the run
sometime into the future. In both cases, we know it'll run _after_
setting BLK_MQ_S_START_ON_RUN, which is the important part.

Hence I think the patch is fine as-is. Let me know if you disagree!

-- 
Jens Axboe

^ permalink raw reply

* [PATCH for-4.4 15/16] blk-mq: Avoid memory reclaim when remapping queues
From: Sumit Semwal @ 2017-04-10 17:44 UTC (permalink / raw)
  To: stable
  Cc: Gabriel Krisman Bertazi, Brian King, Douglas Miller, linux-block,
	linux-scsi, Jens Axboe, Sumit Semwal
In-Reply-To: <1491846272-14882-1-git-send-email-sumit.semwal@linaro.org>

From: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

[ Upstream commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 ]

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d8d63c3..0d1af3e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1470,7 +1470,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&tags->page_list);
 
 	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+				 GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
 	if (!tags->rqs) {
 		blk_mq_free_tags(tags);
@@ -1496,7 +1496,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 
 		do {
 			page = alloc_pages_node(set->numa_node,
-				GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
+				GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
 				this_order);
 			if (page)
 				break;
@@ -1517,7 +1517,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		 * Allow kmemleak to scan these pages as they contain pointers
 		 * to additional allocations like via ops->init_request().
 		 */
-		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
+		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
 		entries_per_page = order_to_size(this_order) / rq_size;
 		to_do = min(entries_per_page, set->queue_depth - i);
 		left -= to_do * rq_size;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3] lightnvm: physical block device (pblk) target
From: Matias Bjørling @ 2017-04-10 17:47 UTC (permalink / raw)
  To: Bart Van Assche, jg@lightnvm.io
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org
In-Reply-To: <1491839741.4199.7.camel@sandisk.com>

On 04/10/2017 05:55 PM, Bart Van Assche wrote:
> On Sun, 2017-04-09 at 11:15 +0200, Javier Gonzï¿½lez wrote:
>> On 8 Apr 2017, at 22.56, Bart Van Assche <bart.vanassche@sandisk.com> wrote:
>>> On 04/07/17 11:50, Javier Gonzï¿½lez wrote:
>> struct ppa_addr, which is the physical address format is not affected,
>> but pblk's internal L2P address representation (pblk_addr) is. You can
>> see that the type either represents struct ppa_addr or ppa_addr_32. How
>> would you define a type that can either be u64 or u32 with different bit
>> offsets at run-time? Note that address conversions to this type is in
>> the fast path and this format allows us to only use bit shifts.
>
> Selecting the appropriate representation at run-time would require to pass
> pblk_addr by reference instead of by value to any function that expects a
> pblk_addr. It would also require to have two versions of every data structure
> that depends on pblk_addr and to use casts to convert to the appropritate
> version. However, this approach is probably only worth to be implemented if
> it doesn't introduce too much additional complexity.
>

Hi Bart, thanks for the feedback!

I've spent the time today to implement it such that 32bit entries are 
used if the device addressing can fit. We do see a slight overhead, 
however not enough to be a deal-breaker. We'll post an updated version 
with the flag removed.

>>>> +#ifdef CONFIG_NVM_DEBUG
>>>> +	atomic_add(nr_entries, &pblk->inflight_writes);
>>>> +	atomic_add(nr_entries, &pblk->req_writes);
>>>> +#endif
>>>
>>> Has it been considered to use the "static key" feature such that
>>> consistency checks can be enabled at run-time instead of having to
>>> rebuild the kernel to enable CONFIG_NVM_DEBUG?
>>
>> I haven't considered it. I'll look into it. I would like to have this
>> counters and the corresponding sysfs entry only available on debug mode
>> since it allows us to have a good picture of the FTL state.
>
> If there are sysfs entries that depend on CONFIG_NVM_DEBUG then the static
> key mechanism is probably not a good alternative for CONFIG_NVM_DEBUG.
>
>>> Has it been considered to add support for keeping a subset of the L2P
>>> translation table in memory instead of keeping it in memory in its entirety?
>>
>> Yes. L2P caching is on our roadmap and will be included in the future.
>
> That's great. This will also help with reducing the time between discovery of
> a lightnvm device and the time at which I/O can start since the L2P table must
> be available before I/O can start.

Definitely. I would love if one could jump in and implement it. We have 
a bunch of features that we want in before implementing a translation cache.
>
> Bart.
>

^ permalink raw reply

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
From: Steve Wise @ 2017-04-10 18:05 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

On 4/2/2017 8:41 AM, Sagi Grimberg wrote:
> This patch set is aiming to automatically find the optimal
> queue <-> irq multi-queue assignments in storage ULPs (demonstrated
> on nvme-rdma) based on the underlying rdma device irq affinity
> settings.
>
> First two patches modify mlx5 core driver to use generic API
> to allocate array of irq vectors with automatic affinity
> settings instead of open-coding exactly what it does (and
> slightly worse).
>
> Then, in order to obtain an affinity map for a given completion
> vector, we expose a new RDMA core API, and implement it in mlx5.
>
> The third part is addition of a rdma-based queue mapping helper
> to blk-mq that maps the tagset hctx's according to the device
> affinity mappings.
>
> I'd happily convert some more drivers, but I'll need volunteers
> to test as I don't have access to any other devices.

I'll test cxgb4 if you convert it. :)

^ permalink raw reply

* Re: [PATCH 0/2] Export more queue state information through debugfs
From: Jens Axboe @ 2017-04-10 18:28 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block
In-Reply-To: <20170330182127.24288-1-bart.vanassche@sandisk.com>

On 03/30/2017 12:21 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> This is a short patch series with one patch that exports the queue state
> and another patch that shows symbolic names for hctx state and flags
> instead of a numerical bitmask.
> 
> Please consider these patches for kernel v4.12.

Thanks, added for 4.12.


-- 
Jens Axboe

^ permalink raw reply

* [PATCH v4] lightnvm: pblk
From: Javier González @ 2017-04-10 18:36 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Bart.VanAssche, Javier González

This patch introduces pblk, a new target for LightNVM implementing a
full host-based FTL. Details on the commit message.

Changes since v3:
* Apply Bart's feedback [1]
* Implement dynamic L2P optimizations for > 32-bit physical media
* geometry (from Matias Bjørling)
* Fix memory leak on GC (Reported by Simon A. F. Lund)
* 8064 is a perfectly round number of lines :)

[1] https://lkml.org/lkml/2017/4/8/172

Changes since v2:
* Rebase on top of Matias' for-4.12/core
* Implement L2P scan recovery to recover L2P table in case of power
 failure.
* Re-design disk format to be more flexible in future versions (from
  Matias Bjørling)
* Implement per-instance uuid to allow correct recovery without
  forcing line erases (from Matias Bjørling)
* Re-design GC threading to have several GC readers and a single
  writer that places data on the write buffer. This allows to maximize
  the GC write buffer budget without having unnecessary GC writers
  competing for the write buffer lock.
* Simplify sysfs interface.
* Refactoring and several code improvements (together with Matias
 Bjørling)

Changes since v1:
* Rebase on top of Matias' for-4.12/core
* Move from per-LUN block allocation to a line model. This means that a
  whole lines across all LUNs is allocated at a time. Data is still
  stripped in a round-robin fashion at a page granurality.
* Implement new disk format scheme, where metadata is stored per line
  instead of per LUN. This allows for space optimizations.
* Improvements on GC workqueue management and victim selection.
* Implement sysfs interface to query pblk's operation and statistics.
* Implement a user - GC I/O rate-limiter
* Various bug fixes

Javier González (1):
  lightnvm: physical block device (pblk) target

 Documentation/lightnvm/pblk.txt  |   21 +
 drivers/lightnvm/Kconfig         |    9 +
 drivers/lightnvm/Makefile        |    5 +
 drivers/lightnvm/pblk-cache.c    |  114 +++
 drivers/lightnvm/pblk-core.c     | 1658 ++++++++++++++++++++++++++++++++++++++
 drivers/lightnvm/pblk-gc.c       |  555 +++++++++++++
 drivers/lightnvm/pblk-init.c     |  948 ++++++++++++++++++++++
 drivers/lightnvm/pblk-map.c      |  134 +++
 drivers/lightnvm/pblk-rb.c       |  856 ++++++++++++++++++++
 drivers/lightnvm/pblk-read.c     |  529 ++++++++++++
 drivers/lightnvm/pblk-recovery.c | 1003 +++++++++++++++++++++++
 drivers/lightnvm/pblk-rl.c       |  182 +++++
 drivers/lightnvm/pblk-sysfs.c    |  502 ++++++++++++
 drivers/lightnvm/pblk-write.c    |  412 ++++++++++
 drivers/lightnvm/pblk.h          | 1136 ++++++++++++++++++++++++++
 15 files changed, 8064 insertions(+)
 create mode 100644 Documentation/lightnvm/pblk.txt
 create mode 100644 drivers/lightnvm/pblk-cache.c
 create mode 100644 drivers/lightnvm/pblk-core.c
 create mode 100644 drivers/lightnvm/pblk-gc.c
 create mode 100644 drivers/lightnvm/pblk-init.c
 create mode 100644 drivers/lightnvm/pblk-map.c
 create mode 100644 drivers/lightnvm/pblk-rb.c
 create mode 100644 drivers/lightnvm/pblk-read.c
 create mode 100644 drivers/lightnvm/pblk-recovery.c
 create mode 100644 drivers/lightnvm/pblk-rl.c
 create mode 100644 drivers/lightnvm/pblk-sysfs.c
 create mode 100644 drivers/lightnvm/pblk-write.c
 create mode 100644 drivers/lightnvm/pblk.h

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 0/2] Export more queue state information through debugfs
From: Bart Van Assche @ 2017-04-10 18:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block@vger.kernel.org
In-Reply-To: <826af944-2316-8352-118c-91887d747eca@kernel.dk>

On 04/10/2017 11:28 AM, Jens Axboe wrote:
> On 03/30/2017 12:21 PM, Bart Van Assche wrote:
>> This is a short patch series with one patch that exports the queue state
>> and another patch that shows symbolic names for hctx state and flags
>> instead of a numerical bitmask.
>>
>> Please consider these patches for kernel v4.12.
> 
> Thanks, added for 4.12.

Hello Jens,

Thanks! This infrastructure was essential while analyzing queue stalls.

After I had posted this series I improved and extended the blk-mq debugfs
functionality further. Please consider including the patch below in v4.12.

Thanks,

Bart.


From: Bart Van Assche <bart.vanassche@sandisk.com>
Subject: [PATCH] blk-mq: Two fixes for the code that exports the queue state

Remove the array entry for QUEUE_FLAG_RESTART since that flag has
been removed after the blk-mq-debugfs patch that introduced this
array entry was posted.

Avoid that querying the queue state of a dead queue triggers a
kernel crash.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
 block/blk-mq-debugfs.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 91d09f58a596..a1ce823578c7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -92,7 +92,6 @@ static const char *const blk_queue_flag_name[] = {
 	[QUEUE_FLAG_FLUSH_NQ]	 = "FLUSH_NQ",
 	[QUEUE_FLAG_DAX]	 = "DAX",
 	[QUEUE_FLAG_STATS]	 = "STATS",
-	[QUEUE_FLAG_RESTART]	 = "RESTART",
 	[QUEUE_FLAG_POLL_STATS]	 = "POLL_STATS",
 	[QUEUE_FLAG_REGISTERED]	 = "REGISTERED",
 };
@@ -112,6 +111,14 @@ static ssize_t blk_queue_flags_store(struct file *file, const char __user *ubuf,
 	struct request_queue *q = file_inode(file)->i_private;
 	char op[16] = { }, *s;
 
+	/*
+	 * The debugfs attributes are removed after blk_cleanup_queue() has
+	 * called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set
+	 * to avoid triggering a use-after-free.
+	 */
+	if (blk_queue_dead(q))
+		return -ENOENT;
+
 	len = min(len, sizeof(op) - 1);
 	if (copy_from_user(op, ubuf, len))
 		return -EFAULT;
-- 
2.12.0

^ permalink raw reply related

* [PATCH 1/3] lightnvm: convert sprintf into snprintf
From: Javier González @ 2017-04-10 18:51 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González

Convert sprintf calls to snprintf in order to make possible buffer
overflow more obvious.

Signed-off-by: Javier González <javier@cnexlabs.com>
---
 drivers/lightnvm/core.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index c3340ef..bdbb333 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -272,7 +272,8 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 		goto err_disk;
 	blk_queue_make_request(tqueue, tt->make_rq);
 
-	sprintf(tdisk->disk_name, "%s", create->tgtname);
+	snprintf(tdisk->disk_name, sizeof(tdisk->disk_name), "%s",
+							create->tgtname);
 	tdisk->flags = GENHD_FL_EXT_DEVT;
 	tdisk->major = 0;
 	tdisk->first_minor = 0;
@@ -1195,13 +1196,13 @@ static long nvm_ioctl_get_devices(struct file *file, void __user *arg)
 	list_for_each_entry(dev, &nvm_devices, devices) {
 		struct nvm_ioctl_device_info *info = &devices->info[i];
 
-		sprintf(info->devname, "%s", dev->name);
+		snprintf(info->devname, sizeof(info->devname), "%s", dev->name);
 
 		/* kept for compatibility */
 		info->bmversion[0] = 1;
 		info->bmversion[1] = 0;
 		info->bmversion[2] = 0;
-		sprintf(info->bmname, "%s", "gennvm");
+		snprintf(info->bmname, sizeof(info->bmname), "%s", "gennvm");
 		i++;
 
 		if (i > 31) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/3] lightnvm: make nvm_free static
From: Javier González @ 2017-04-10 18:51 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González
In-Reply-To: <1491850297-18235-1-git-send-email-javier@cnexlabs.com>

Prefix the nvm_free static function with a missing static keyword.

Signed-off-by: Javier González <javier@cnexlabs.com>
---
 drivers/lightnvm/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index bdbb333..3e51a05 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -999,7 +999,7 @@ static int nvm_core_init(struct nvm_dev *dev)
 	return ret;
 }
 
-void nvm_free(struct nvm_dev *dev)
+static void nvm_free(struct nvm_dev *dev)
 {
 	if (!dev)
 		return;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH 1/3] lightnvm: convert sprintf into snprintf
From: Bart Van Assche @ 2017-04-10 18:56 UTC (permalink / raw)
  To: mb@lightnvm.io, jg@lightnvm.io
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	javier@cnexlabs.com
In-Reply-To: <1491850297-18235-1-git-send-email-javier@cnexlabs.com>

On Mon, 2017-04-10 at 20:51 +0200, Javier Gonz=E1lez wrote:
> Convert sprintf calls to snprintf in order to make possible buffer
> overflow more obvious.
>=20
> Signed-off-by: Javier Gonz=E1lez <javier@cnexlabs.com>
> ---
>  drivers/lightnvm/core.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>=20
> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
> index c3340ef..bdbb333 100644
> --- a/drivers/lightnvm/core.c
> +++ b/drivers/lightnvm/core.c
> @@ -272,7 +272,8 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct=
 nvm_ioctl_create *create)
>  		goto err_disk;
>  	blk_queue_make_request(tqueue, tt->make_rq);
> =20
> -	sprintf(tdisk->disk_name, "%s", create->tgtname);
> +	snprintf(tdisk->disk_name, sizeof(tdisk->disk_name), "%s",
> +							create->tgtname);
>  	tdisk->flags =3D GENHD_FL_EXT_DEVT;
>  	tdisk->major =3D 0;
>  	tdisk->first_minor =3D 0;
> @@ -1195,13 +1196,13 @@ static long nvm_ioctl_get_devices(struct file *fi=
le, void __user *arg)
>  	list_for_each_entry(dev, &nvm_devices, devices) {
>  		struct nvm_ioctl_device_info *info =3D &devices->info[i];
> =20
> -		sprintf(info->devname, "%s", dev->name);
> +		snprintf(info->devname, sizeof(info->devname), "%s", dev->name);
> =20
>  		/* kept for compatibility */
>  		info->bmversion[0] =3D 1;
>  		info->bmversion[1] =3D 0;
>  		info->bmversion[2] =3D 0;
> -		sprintf(info->bmname, "%s", "gennvm");
> +		snprintf(info->bmname, sizeof(info->bmname), "%s", "gennvm");
>  		i++;
> =20
>  		if (i > 31) {

Hello Javier,

Although the above changes look fine to me, have you considered to use strl=
cpy()
instead of snprintf() for these string copy operations?

Bart.=

^ permalink raw reply

* Re: [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
From: Anthony Youngman @ 2017-04-10 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170405142205.6477-4-hch@lst.de>

s/past/paste/

On 05/04/17 15:21, Christoph Hellwig wrote:
> Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
> that limit the write zeroes size.

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 0/2] Export more queue state information through debugfs
From: Omar Sandoval @ 2017-04-10 20:00 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block@vger.kernel.org
In-Reply-To: <51f5cd27-4ae4-0a21-63e2-c7a2ec95e257@sandisk.com>

On Mon, Apr 10, 2017 at 11:40:49AM -0700, Bart Van Assche wrote:
> On 04/10/2017 11:28 AM, Jens Axboe wrote:
> > On 03/30/2017 12:21 PM, Bart Van Assche wrote:
> >> This is a short patch series with one patch that exports the queue state
> >> and another patch that shows symbolic names for hctx state and flags
> >> instead of a numerical bitmask.
> >>
> >> Please consider these patches for kernel v4.12.
> > 
> > Thanks, added for 4.12.
> 
> Hello Jens,
> 
> Thanks! This infrastructure was essential while analyzing queue stalls.
> 
> After I had posted this series I improved and extended the blk-mq debugfs
> functionality further. Please consider including the patch below in v4.12.
> 
> Thanks,
> 
> Bart.
> 
> 
> From: Bart Van Assche <bart.vanassche@sandisk.com>
> Subject: [PATCH] blk-mq: Two fixes for the code that exports the queue state
> 
> Remove the array entry for QUEUE_FLAG_RESTART since that flag has
> been removed after the blk-mq-debugfs patch that introduced this
> array entry was posted.
> 
> Avoid that querying the queue state of a dead queue triggers a
> kernel crash.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> ---
>  block/blk-mq-debugfs.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 91d09f58a596..a1ce823578c7 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -92,7 +92,6 @@ static const char *const blk_queue_flag_name[] = {
>  	[QUEUE_FLAG_FLUSH_NQ]	 = "FLUSH_NQ",
>  	[QUEUE_FLAG_DAX]	 = "DAX",
>  	[QUEUE_FLAG_STATS]	 = "STATS",
> -	[QUEUE_FLAG_RESTART]	 = "RESTART",
>  	[QUEUE_FLAG_POLL_STATS]	 = "POLL_STATS",
>  	[QUEUE_FLAG_REGISTERED]	 = "REGISTERED",
>  };
> @@ -112,6 +111,14 @@ static ssize_t blk_queue_flags_store(struct file *file, const char __user *ubuf,
>  	struct request_queue *q = file_inode(file)->i_private;
>  	char op[16] = { }, *s;
>  
> +	/*
> +	 * The debugfs attributes are removed after blk_cleanup_queue() has
> +	 * called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set
> +	 * to avoid triggering a use-after-free.
> +	 */
> +	if (blk_queue_dead(q))
> +		return -ENOENT;
> +

Can you just move blk_queue_flags_fops to blk_mq_debugfs_queue_attrs
instead of adding blk_queue_attrs?

^ permalink raw reply

* Re: [PATCH 0/2] Export more queue state information through debugfs
From: Bart Van Assche @ 2017-04-10 20:12 UTC (permalink / raw)
  To: osandov@osandov.com; +Cc: linux-block@vger.kernel.org, axboe@kernel.dk
In-Reply-To: <20170410200045.GA18846@vader.DHCP.thefacebook.com>

On Mon, 2017-04-10 at 13:00 -0700, Omar Sandoval wrote:
> Can you just move blk_queue_flags_fops to blk_mq_debugfs_queue_attrs
> instead of adding blk_queue_attrs?

Hello Omar,

Are you aware that that change will move the "state" attribute one level
down in the hierarchy? blk_mq_debugfs_queue_attrs attributes are created
under "mq" while blk_queue_flags_fops attributes are created at the same
level as the "mq" attribute. I had added blk_queue_flags_fops because the
"state" attribute is not related to blk-mq. That attribute is also relevant
for single-queue block layer queues.

Bart.

^ permalink raw reply

* Re: [PATCH 0/2] Export more queue state information through debugfs
From: Omar Sandoval @ 2017-04-10 20:21 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org, axboe@kernel.dk
In-Reply-To: <1491855119.4199.21.camel@sandisk.com>

On Mon, Apr 10, 2017 at 08:12:00PM +0000, Bart Van Assche wrote:
> On Mon, 2017-04-10 at 13:00 -0700, Omar Sandoval wrote:
> > Can you just move blk_queue_flags_fops to blk_mq_debugfs_queue_attrs
> > instead of adding blk_queue_attrs?
> 
> Hello Omar,
> 
> Are you aware that that change will move the "state" attribute one level
> down in the hierarchy? blk_mq_debugfs_queue_attrs attributes are created
> under "mq" while blk_queue_flags_fops attributes are created at the same
> level as the "mq" attribute. I had added blk_queue_flags_fops because the
> "state" attribute is not related to blk-mq. That attribute is also relevant
> for single-queue block layer queues.

Yes, I am aware of that. We don't set up debugfs for single-queue queues
anyways, so the top-level directory is really just for blktrace. It
simplifies the lifetime and cleanup to have everything under mq, so
please move it there.

^ permalink raw reply

* Re: [PATCH v4] lightnvm: pblk
From: Bart Van Assche @ 2017-04-10 20:35 UTC (permalink / raw)
  To: Javier González, mb@lightnvm.io
  Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Javier González
In-Reply-To: <1491849399-12798-1-git-send-email-javier@cnexlabs.com>

On 04/10/2017 11:36 AM, Javier González wrote:
> Changes since v3:
> * Apply Bart's feedback [1]

Thanks for having addressed these comments. But please also make sure
that the pblk driver builds cleanly with W=1 C=2. When running "make
M=drivers/lightnvm W=1 C=2" several warnings are reported that should be
reviewed. At least the endianness warnings should be addressed. An example:

  CHECK   drivers/lightnvm/pblk-gc.c
drivers/lightnvm/pblk-gc.c:254:18: warning: incorrect type in assignment
(different base types)
drivers/lightnvm/pblk-gc.c:254:18:    expected unsigned long long
[usertype] *lba_list
drivers/lightnvm/pblk-gc.c:254:18:    got restricted __le64 [usertype] *

Please also review the warnings reported by smatch (make
M=drivers/lightnvm C=2 CHECK="smatch -p=kernel"). A few examples that
most likely indicate bugs:

  CHECK   drivers/lightnvm/pblk-init.c
drivers/lightnvm/pblk-init.c:915: pblk_init() error: passing non
negative 1 to ERR_PTR
drivers/lightnvm/pblk-rb.c:782: pblk_rb_tear_down_check() error: we
previously assumed 'rb->entries' could be null (see line 778)  CHECK
drivers/lightnvm/pblk-read.c
drivers/lightnvm/pblk-read.c:486: pblk_submit_read_gc() error: 'bio'
dereferencing possible ERR_PTR()

Thanks,

Bart.

^ permalink raw reply

* for-next branch and mq-deadline
From: Bart Van Assche @ 2017-04-10 23:06 UTC (permalink / raw)
  To: osandov@osandov.com, axboe@fb.com; +Cc: linux-block@vger.kernel.org

Hello Jens and Omar,

A few minutes ago I ran into the following issue with a kernel based on the
block for-next branch (commit 602c7aa1f2fa), mq-deadline scheduler as defau=
lt
scheduler and running the srp-test software with scheduler none. This test
involves switching from the mq-deadline scheduler to "none" before every te=
st
starts. From the logs it looks like this crash was triggered while removing
block devices. The last output produced by the test script was:

Running test /home/bart/software/infiniband/srp-test/tests/03 ...
Test large transfer sizes with cmd_sg_entries=3D255
Using /dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b31000000000 -> ..=
/../dm-1
SRP LUN /sys/class/scsi_device/10:0:0:0 / sdc: removing /dev/dm-1:=A0[ test=
 script hangs ]

>From the netconsole output:

[ 2325.133052] general protection fault: 0000 [#1] SMP
[ 2325.133100] Modules linked in: ib_srp scsi_transport_srp dm_service_time=
 target_core_user uio target_core_pscsi target_core_file ib_srpt target_cor=
e_iblock target_core_mod brd netconsole
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_na=
t nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_connt=
rack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp
tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptab=
le_filter ip_tables x_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad af_=
packet =A0
rdma_cm configfs ib_cm iw_cm mlx4_ib ib_core msr sb_edac edac_core x86_pkg_=
temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass cr=
ct10dif_pclmul crc32_pclmul crc32c_intel mlx4_core
ghash_clmulni_intel tg3 pcbc ptp pps_core aesni_intel devlink=A0iTCO_wdt li=
bphy iTCO_vendor_support aes_x86_64 ioatdma crypto_simd ipmi_si lpc_ich mei=
_me dcdbas glue_helper mei shpchp dca mfd_core
cryptd ipmi_devintf pcspkr ipmi_msghandler tpm_tis tpm_tis_core tpm wmi but=
ton acpi_pad hid_generic usbhid ehci_pci ehci_hcd mgag200 i2c_algo_bit drm_=
kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm sr_mod cdrom xhci_pci xhci_hcd usbcore usb_common sg dm=
_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4 [last unloa=
ded: scsi_transport_srp]
[ 2325.133403] CPU: 6 PID: 163 Comm: kworker/6:1H Tainted: G =A0=A0=A0=A0=
=A0=A0=A0=A0=A0I =A0=A0=A0=A04.11.0-rc6-dbg+ #1
[ 2325.133480] Workqueue: kblockd blk_mq_run_work_fn
[ 2325.133509] task: ffff88017b183140 task.stack: ffffc90003034000
[ 2325.133547] RIP: 0010:__lock_acquire+0xfe/0x1280
[ 2325.133577] RSP: 0018:ffffc90003037c10 EFLAGS: 00010002
[ 2325.133613] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000001 RCX: 00000000000=
00000
[ 2325.133648] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000=
00000
[ 2325.133682] RBP: ffffc90003037cc0 R08: 0000000000000001 R09: 00000000000=
00000
[ 2325.133718] R10: 0000000000000000 R11: ffffffff8134d909 R12: ffff8803726=
fbc50
[ 2325.133753] R13: ffff88017b183140 R14: 0000000000000001 R15: 00000000000=
00000
[ 2325.133936] FS: =A00000000000000000(0000) GS:ffff88046ef80000(0000) knlG=
S:0000000000000000
[ 2325.133974] CS: =A00010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2325.134013] CR2: 00007f6d4214cf90 CR3: 000000032341c000 CR4: 00000000001=
406e0
[ 2325.134047] Call Trace:
[ 2325.134086] =A0lock_acquire+0xd5/0x1c0
[ 2325.134197] =A0_raw_spin_lock+0x2a/0x40
[ 2325.134261] =A0dd_dispatch_request+0x29/0x1e0
[ 2325.134292] =A0blk_mq_sched_dispatch_requests+0x139/0x190
[ 2325.134328] =A0__blk_mq_run_hw_queue+0x12d/0x1c0
[ 2325.134360] =A0blk_mq_run_work_fn+0xd/0x10
[ 2325.134392] =A0process_one_work+0x206/0x6a0
[ 2325.134429] =A0worker_thread+0x49/0x4a0
[ 2325.134465] =A0kthread+0x107/0x140
[ 2325.134579] =A0ret_from_fork+0x2e/0x40
[ 2325.134616] Code: 8d 62 f8 c3 49 81 3c 24 80 f4 e5 81 41 be 00 00 00 00 =
45 0f 45 f0 83 fe 01 77 89 89 f0 49 8b 44 c4 08 48 85 c0 0f 84 79 ff ff ff =
<f0> ff 80 98 01 00 00 8b 15 ed e3 9e 01 45 8b 85
d8 07 00 00 85 =A0
[ 2325.134723] RIP: __lock_acquire+0xfe/0x1280 RSP: ffffc90003037c10
[ 2325.142207] ---[ end trace 0a93cb11fc868acd ]---
[ 2325.185765] BUG: sleeping function called from invalid context at ./incl=
ude/linux/percpu-rwsem.h:33
[ 2325.185805] in_atomic(): 1, irqs_disabled(): 1, pid: 163, name: kworker/=
6:1H
[ 2325.185837] INFO: lockdep is turned off.
[ 2325.186078] irq event stamp: 1927795
[ 2325.186111] hardirqs last =A0enabled at (1927795): [<ffffffff81680347>] =
_raw_spin_unlock_irq+0x27/0x40
[ 2325.186141] hardirqs last disabled at (1927794): [<ffffffff8168010e>] _r=
aw_spin_lock_irq+0xe/0x40
[ 2325.186177] softirqs last =A0enabled at (1927064): [<ffffffff8106dd2c>] =
__do_softirq+0x37c/0x4c0
[ 2325.186207] softirqs last disabled at (1927011): [<ffffffff8106dffe>] ir=
q_exit+0xbe/0xd0

(gdb) list *(dd_dispatch_request+0x29)
0xffffffff8134d909 is in dd_dispatch_request (block/mq-deadline.c:199).
194 =A0=A0=A0=A0=A0* deadline_dispatch_requests selects the best request ac=
cording to
195 =A0=A0=A0=A0=A0* read/write expire, fifo_batch, etc
196 =A0=A0=A0=A0=A0*/
197 =A0=A0=A0=A0static struct request *__dd_dispatch_request(struct blk_mq_=
hw_ctx *hctx)
198 =A0=A0=A0=A0{
199 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0struct deadline_data *dd =3D hctx->=
queue->elevator->elevator_data;
200 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0struct request *rq;
201 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0bool reads, writes;
202 =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0int data_dir;
203

Bart.=

^ permalink raw reply

* Re: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart
From: Bart Van Assche @ 2017-04-10 23:47 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	kys@microsoft.com, longli@microsoft.com, axboe@kernel.dk
  Cc: sthemmin@microsoft.com
In-Reply-To: <DM5PR03MB2490E70D9B18A701A81005C8A00D0@DM5PR03MB2490.namprd03.prod.outlook.com>

On Thu, 2017-04-06 at 04:21 +0000, KY Srinivasan wrote:
> > -----Original Message-----
> > From: Bart Van Assche [mailto:Bart.VanAssche@sandisk.com]
> > Sent: Wednesday, April 5, 2017 8:46 PM
> > To: linux-kernel@vger.kernel.org; linux-block@vger.kernel.org; Long Li
> > <longli@microsoft.com>; axboe@kernel.dk
> > Cc: Stephen Hemminger <sthemmin@microsoft.com>; KY Srinivasan
> > <kys@microsoft.com>
> > Subject: Re: [PATCH] block-mq: set both block queue and hardware queue
> > restart bit for restart
> >=20
> > On Thu, 2017-04-06 at 03:38 +0000, Long Li wrote:
> > > > -----Original Message-----
> > > > From: Bart Van Assche [mailto:Bart.VanAssche@sandisk.com]
> > > >=20
> > > > Please drop this patch. I'm working on a better solution.
> > >=20
> > > Thank you. Looking forward to your patch.
> >=20
> > Hello Long,
> >=20
> > It would help if you could share the name of the block or SCSI driver w=
ith
> > which you ran into that lockup and also if you could share the name of =
the
> > I/O scheduler used in your test.
>=20
> The tests that indicated the issue were run Hyper-V. The driver is storvs=
c_drv.c
> The I/O scheduler was I think noop.

Hello Long and K.Y.,

Thank you for the feedback. Can you repeat your test with kernel v4.11-rc6?=
 The
patches that went into the block layer for v4.11-rc6 should be sufficient t=
o fix
this:

$ PAGER=3D git log --format=3Dshort v4.11-rc5..v4.11-rc6 block include/linu=
x/blk* =A0
commit 6d8c6c0f97ad8a3517c42b179c1dc8e77397d0e2
Author: Bart Van Assche <bart.vanassche@sandisk.com>

=A0=A0=A0=A0blk-mq: Restart a single queue if tag sets are shared

commit 7587a5ae7eef0439f7be31f1b5959af062bbc5ec
Author: Bart Van Assche <bart.vanassche@sandisk.com>

=A0=A0=A0=A0blk-mq: Introduce blk_mq_delay_run_hw_queue()

commit ebe8bddb6e30d7a02775b9972099271fc5910f37
Author: Omar Sandoval <osandov@fb.com>

=A0=A0=A0=A0blk-mq: remap queues when adding/removing hardware queues

commit 54d5329d425650fafaf90660a139c771d2d49cae
Author: Omar Sandoval <osandov@fb.com>

=A0=A0=A0=A0blk-mq-sched: fix crash in switch error path

commit 93252632e828da3e90241a1c0e766556abf71598
Author: Omar Sandoval <osandov@fb.com>

=A0=A0=A0=A0blk-mq-sched: set up scheduler tags when bringing up new queues

commit 6917ff0b5bd4139e08a3f3146529dcb3b95ba7a6
Author: Omar Sandoval <osandov@fb.com>

=A0=A0=A0=A0blk-mq-sched: refactor scheduler initialization

commit 81380ca10778b99dce98940cfc993214712df335
Author: Omar Sandoval <osandov@fb.com>

=A0=A0=A0=A0blk-mq: use the right hctx when getting a driver tag fails

commit ac77a0c463c1d7d659861f7b6d1261970dd3282a
Author: Minchan Kim <minchan@kernel.org>

=A0=A0=A0=A0block: do not put mq context in blk_mq_alloc_request_hctx

Bart.=

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox