Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: hch @ 2017-03-13 23:16 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org
In-Reply-To: <1489438361.2658.21.camel@sandisk.com>

On Mon, Mar 13, 2017 at 08:52:54PM +0000, Bart Van Assche wrote:
> > -	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> > -	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > +	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> 
> A minor comment: due to this change the outer pair of parentheses
> became superfluous. Please consider removing these.

The last patch in the series removes the statement in this form. But
if I have to respin the series for some reason I'll make sure it
gets removed here already.

^ permalink raw reply

* Re: NULL deref in cpu hot unplug on jens for-linus branch
From: Sagi Grimberg @ 2017-03-13 21:46 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org, linux-nvme
In-Reply-To: <54f833f1-ab98-5d22-e7d4-5e6059a4c467@fb.com>

> Are you saying your code works on top of 4.11-rc2, but not on top of my
> for-linus?

I was actually on Linus 4.11-rc1 before I rebased on top of your
for-linus.

> That seems odd. Looking at the oops, you are crashing with
> !tags in __blk_mq_tag_idle. The below should work around it, but I'm
> puzzled why this is new.

I got it just once (out of a single run :)), but maybe it is
possible that its racy and not really new.

But another example where this can happen:
blk_mq_realloc_hw_ctxs explicitly checks on hctx->tags != NULL
but right after calls blk_mq_exit_hctx() which goes in the
same route, won't this happen there too? Or is it assumed that
hctx->state does not have BLK_MQ_S_TAG_ACTIVE on here?

> Is it related to the other path you fixed in this patch:
>
> commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0
> Author: Sagi Grimberg <sagi@grimberg.me>
> Date:   Mon Mar 13 16:10:11 2017 +0200
>
>     blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
>
> Since that's also handling hctx->tags == NULL.

The above patch prevented a NULL deref earlier when the
tags were reinitialized, now we are all setup and we
happen to remove an old namespace.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9d97bfc4d465..1283f74bfdfb 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -54,9 +54,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>  	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>  		return;
>
> -	atomic_dec(&tags->active_queues);
> +	if (tags) {
> +		atomic_dec(&tags->active_queues);
>
> -	blk_mq_tag_wakeup_all(tags, false);
> +		blk_mq_tag_wakeup_all(tags, false);
> +	}
>  }
>
>  /*
>

I'll see if I can test it out later this week. thanks.

^ permalink raw reply

* Re: [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Bart Van Assche @ 2017-03-13 21:02 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-4-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
> new wrapper that takes care of RCU / SRCU locking to avoid having
> boileplate code in the caller which would get duplicated with new callers=
.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=

^ permalink raw reply

* Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Bart Van Assche @ 2017-03-13 21:01 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-3-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> @@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)
>  	}
> =20
>  	plug =3D current->plug;
> -	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> +	if (plug && q->nr_hw_queues =3D=3D 1) {
> +		struct request *last =3D NULL;
> +
> +		blk_mq_bio_to_request(rq, bio);
> +
> +		/*
> +		 * @request_count may become stale because of schedule
> +		 * out, so check the list again.
> +		 */

The above comment was relevant as long as there was a request_count assignm=
ent
above blk_mq_sched_get_request(). This patch moves that assignment inside i=
f
(plug && q->nr_hw_queues =3D=3D 1). Does that mean that the above comment s=
hould be
removed entirely?

> +		if (list_empty(&plug->mq_list))
> +			request_count =3D 0;
> +		else if (blk_queue_nomerges(q))
> +			request_count =3D blk_plug_queued_count(q);
> +
> +		if (!request_count)
> +			trace_block_plug(q);
> +		else
> +			last =3D list_entry_rq(plug->mq_list.prev);
> +
> +		blk_mq_put_ctx(data.ctx);
> +
> +		if (request_count >=3D BLK_MAX_REQUEST_COUNT || (last &&
> +		    blk_rq_bytes(last) >=3D BLK_PLUG_FLUSH_SIZE)) {
> +			blk_flush_plug_list(plug, false);
> +			trace_block_plug(q);
> +		}
> +
> +		list_add_tail(&rq->queuelist, &plug->mq_list);
> +		goto done;
> +	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
>  		struct request *old_rq =3D NULL;
> =20
>  		blk_mq_bio_to_request(rq, bio);

Bart.=

^ permalink raw reply

* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Bart Van Assche @ 2017-03-13 20:52 UTC (permalink / raw)
  To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-2-hch@lst.de>

On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> This flag was never used since it was introduced.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-mq.c         | 8 +-------
>  include/linux/blk-mq.h | 1 -
>  2 files changed, 1 insertion(+), 8 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..acf0ddf4af52 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)
>  	}
> =20
>  	plug =3D current->plug;
> -	/*
> -	 * If the driver supports defer issued based on 'last', then
> -	 * queue it up like normal since we can potentially save some
> -	 * CPU this way.
> -	 */
> -	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> -	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> +	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {

A minor comment: due to this change the outer pair of parentheses
became superfluous. Please consider removing these.

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH] brd: make rd_size static
From: Bart Van Assche @ 2017-03-13 20:07 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, yanaijie@huawei.com, axboe@kernel.dk
  Cc: zhaohongjiang@huawei.com, miaoxie@huawei.com
In-Reply-To: <4dd2367b-61b9-d390-5bb1-955e90fef0b0@kernel.dk>

On Sat, 2017-03-11 at 15:29 -0700, Jens Axboe wrote:
> On 03/10/2017 12:32 AM, Jason Yan wrote:
> > Fixes the following sparse warning:
> >=20
> > drivers/block/brd.c:411:15: warning: symbol 'rd_size' was not declared.
> > Should it be static?
>=20
> If you do a search on this topic, you'll find others that attempted
> to do the same. Arm uses it for tag parsing, for some reason, your
> patch below would break it.
>=20
> It'd be great if this was fixed up for real, though.

How about something like the (untested) patch below?


Subject: [PATCH] arch/arm/kernel/atags_parse.c: Fix rd_size declaration

Ensure that the ARM setup code treats "rd_size" as unsigned long instead of=
 int.

---
 arch/arm/kernel/atags_parse.c | 3 ++-
 drivers/block/brd.c           | 2 ++
 drivers/block/brd.h           | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/brd.h

diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index 68c6ae0b9e4c..f18b6deaf050 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -30,6 +30,7 @@
 #include <asm/mach/arch.h>
=20
 #include "atags.h"
+#include "../../../drivers/block/brd.h"
=20
 static char default_command_line[COMMAND_LINE_SIZE] __initdata =3D CONFIG_=
CMDLINE;
=20
@@ -91,7 +92,7 @@ __tagtable(ATAG_VIDEOTEXT, parse_tag_videotext);
 #ifdef CONFIG_BLK_DEV_RAM
 static int __init parse_tag_ramdisk(const struct tag *tag)
 {
-	extern int rd_size, rd_image_start, rd_prompt, rd_doload;
+	extern int rd_image_start, rd_prompt, rd_doload;
=20
 	rd_image_start =3D tag->u.ramdisk.start;
 	rd_doload =3D (tag->u.ramdisk.flags & 1) =3D=3D 0;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..f1f9f0338fbd 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -25,6 +25,8 @@
=20
 #include <linux/uaccess.h>
=20
+#include "brd.h"
+
 #define SECTOR_SHIFT		9
 #define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
 #define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
diff --git a/drivers/block/brd.h b/drivers/block/brd.h
new file mode 100644
index 000000000000..dbb0f92fefc8
--- /dev/null
+++ b/drivers/block/brd.h
@@ -0,0 +1 @@
+extern unsigned long rd_size;
--=20
2.12.0

^ permalink raw reply related

* Re: [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Dan Williams @ 2017-03-13 18:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Thiago Jung Bauermann,
	Tejun Heo, Tahsin Erdogan, Omar Sandoval
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

On Mon, Mar 13, 2017 at 8:13 AM, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to fix several
> different races and issues I've found when testing device shutdown and reuse.
> The first two patches fix possible (theoretical) problems when opening of a
> block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
> that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
> problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
> code where get_gendisk() can return already freed gendisk structure (again
> triggered by Omar's stress test).
>
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed them
> with Omar's script, tested memcg writeback, tested static (not udev managed)
> device inodes.

Passes testing with the libnvdimm unit tests that have been tripped up
by block-unplug bugs in the past.

^ permalink raw reply

* Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0
From: Abdul Haleem @ 2017-03-13 17:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Brian Foster, linux-xfs, linux-block, mpe, linuxppc-dev,
	linux-kernel
In-Reply-To: <158127ce-d71d-56a5-3dd3-f676b106c65d@kernel.dk>

On Sat, 2017-03-11 at 15:46 -0700, Jens Axboe wrote:
> On 03/09/2017 05:59 AM, Brian Foster wrote:
> > cc linux-block
> > 
> > On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
> >> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
> >>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
> >>>>
> >>>> Issue is not reproducible all the time.
> 
> Is that still the case with -git as of yesterday? Check that you
> have this merge:
> 
> 34bbce9e344b47e8871273409632f525973afad4
> 
> in your tree.
> 

Thanks for pointing out, with the below merge commit warnings disappear.

commit 34bbce9e344b47e8871273409632f525973afad4
Merge: bb61ce5 672a2c8
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Mar 9 15:53:25 2017 -0800

    Merge branch 'for-linus' of git://git.kernel.dk/linux-block


Thanks for the fix !

Reported-by & Tested-by : Abdul Haleem <abdhalee@linux.vnet.ibm.com>

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre

^ permalink raw reply

* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tahsin Erdogan @ 2017-03-13 16:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, David Rientjes, linux-kernel
In-Reply-To: <c514c076-18ea-8605-d439-e46730aa29e0@kernel.dk>

>> Do you mean, you prefer the approach that was taken in v1 patch or
>> something else?
>
> I can no longer find v1 of the patch, just v2 and on. Can you send a
> link to it?

https://lkml.org/lkml/2017/2/28/8

^ permalink raw reply

* [PATCH 4/4] blk-mq: streamline blk_mq_make_request
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos.  Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 67 +++++++++++++++++++++++-----------------------------------
 1 file changed, 27 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 48748cb799ed..18e449cc832f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,16 +1534,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	cookie = request_to_qc_t(data.hctx, rq);
 
+	plug = current->plug;
 	if (unlikely(is_flush_fua)) {
-		if (q->elevator)
-			goto elv_insert;
 		blk_mq_bio_to_request(rq, bio);
-		blk_insert_flush(rq);
-		goto run_queue;
-	}
-
-	plug = current->plug;
-	if (plug && q->nr_hw_queues == 1) {
+		if (q->elevator) {
+			blk_mq_sched_insert_request(rq, false, true,
+						!is_sync || is_flush_fua, true);
+		} else {
+			blk_insert_flush(rq);
+			blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+		}
+	} else if (plug && q->nr_hw_queues == 1) {
 		struct request *last = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
@@ -1562,8 +1563,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		else
 			last = list_entry_rq(plug->mq_list.prev);
 
-		blk_mq_put_ctx(data.ctx);
-
 		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
 		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
 			blk_flush_plug_list(plug, false);
@@ -1571,56 +1570,44 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		}
 
 		list_add_tail(&rq->queuelist, &plug->mq_list);
-		goto done;
-	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
-		struct request *old_rq = NULL;
-
+	} else if (plug && !blk_queue_nomerges(q)) {
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
 		 * We do limited plugging. If the bio can be merged, do that.
 		 * Otherwise the existing request in the plug list will be
 		 * issued. So the plug list will have one request at most
+		 *
+		 * The plug list might get flushed before this. If that happens,
+		 * the plug list is emptry and same_queue_rq is invalid.
 		 */
-		if (plug) {
-			/*
-			 * The plug list might get flushed before this. If that
-			 * happens, same_queue_rq is invalid and plug list is
-			 * empty
-			 */
-			if (same_queue_rq && !list_empty(&plug->mq_list)) {
-				old_rq = same_queue_rq;
-				list_del_init(&old_rq->queuelist);
-			}
-			list_add_tail(&rq->queuelist, &plug->mq_list);
-		} else /* is_sync */
-			old_rq = rq;
-		blk_mq_put_ctx(data.ctx);
-		if (old_rq)
-			blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
-		goto done;
-	}
+		if (!list_empty(&plug->mq_list))
+			list_del_init(&same_queue_rq->queuelist);
+		else
+			same_queue_rq = NULL;
 
-	if (q->elevator) {
-elv_insert:
-		blk_mq_put_ctx(data.ctx);
+		list_add_tail(&rq->queuelist, &plug->mq_list);
+		if (same_queue_rq)
+			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
+					&cookie);
+	} else if (is_sync) {
+		blk_mq_bio_to_request(rq, bio);
+		blk_mq_try_issue_directly(data.hctx, rq, &cookie);
+	} else if (q->elevator) {
 		blk_mq_bio_to_request(rq, bio);
 		blk_mq_sched_insert_request(rq, false, true,
 						!is_sync || is_flush_fua, true);
-		goto done;
-	}
-	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+	} else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
 		/*
 		 * For a SYNC request, send it to the hardware immediately. For
 		 * an ASYNC request, just ensure that we run it later on. The
 		 * latter allows for merging opportunities and more efficient
 		 * dispatching.
 		 */
-run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
+
 	blk_mq_put_ctx(data.ctx);
-done:
 	return cookie;
 }
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53e49a3f6f0a..48748cb799ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1434,7 +1434,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
+static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
@@ -1478,13 +1478,27 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	blk_mq_sched_insert_request(rq, false, true, true, false);
 }
 
+static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, blk_qc_t *cookie)
+{
+	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+		rcu_read_lock();
+		__blk_mq_try_issue_directly(rq, cookie);
+		rcu_read_unlock();
+	} else {
+		unsigned int srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
+		__blk_mq_try_issue_directly(rq, cookie);
+		srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
+	}
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
 	struct blk_mq_alloc_data data = { .flags = 0 };
 	struct request *rq;
-	unsigned int request_count = 0, srcu_idx;
+	unsigned int request_count = 0;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
@@ -1582,18 +1596,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		} else /* is_sync */
 			old_rq = rq;
 		blk_mq_put_ctx(data.ctx);
-		if (!old_rq)
-			goto done;
-
-		if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
-			rcu_read_lock();
-			blk_mq_try_issue_directly(old_rq, &cookie);
-			rcu_read_unlock();
-		} else {
-			srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
-			blk_mq_try_issue_directly(old_rq, &cookie);
-			srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
-		}
+		if (old_rq)
+			blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
 		goto done;
 	}
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 164 +++++++++++----------------------------------------------
 1 file changed, 31 insertions(+), 133 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf0ddf4af52..53e49a3f6f0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1478,11 +1478,6 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
 	blk_mq_sched_insert_request(rq, false, true, true, false);
 }
 
-/*
- * Multiple hardware queue variant. This will not use per-process plugs,
- * but will attempt to bypass the hctx queueing if we can go straight to
- * hardware for SYNC IO.
- */
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
@@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	plug = current->plug;
-	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
+	if (plug && q->nr_hw_queues == 1) {
+		struct request *last = NULL;
+
+		blk_mq_bio_to_request(rq, bio);
+
+		/*
+		 * @request_count may become stale because of schedule
+		 * out, so check the list again.
+		 */
+		if (list_empty(&plug->mq_list))
+			request_count = 0;
+		else if (blk_queue_nomerges(q))
+			request_count = blk_plug_queued_count(q);
+
+		if (!request_count)
+			trace_block_plug(q);
+		else
+			last = list_entry_rq(plug->mq_list.prev);
+
+		blk_mq_put_ctx(data.ctx);
+
+		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+			blk_flush_plug_list(plug, false);
+			trace_block_plug(q);
+		}
+
+		list_add_tail(&rq->queuelist, &plug->mq_list);
+		goto done;
+	} else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
 		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
@@ -1596,119 +1620,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	return cookie;
 }
 
-/*
- * Single hardware queue variant. This will attempt to use any per-process
- * plug for merging and IO deferral.
- */
-static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
-{
-	const int is_sync = op_is_sync(bio->bi_opf);
-	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_plug *plug;
-	unsigned int request_count = 0;
-	struct blk_mq_alloc_data data = { .flags = 0 };
-	struct request *rq;
-	blk_qc_t cookie;
-	unsigned int wb_acct;
-
-	blk_queue_bounce(q, &bio);
-
-	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
-		bio_io_error(bio);
-		return BLK_QC_T_NONE;
-	}
-
-	blk_queue_split(q, &bio, q->bio_split);
-
-	if (!is_flush_fua && !blk_queue_nomerges(q)) {
-		if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
-			return BLK_QC_T_NONE;
-	} else
-		request_count = blk_plug_queued_count(q);
-
-	if (blk_mq_sched_bio_merge(q, bio))
-		return BLK_QC_T_NONE;
-
-	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
-
-	trace_block_getrq(q, bio, bio->bi_opf);
-
-	rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
-	if (unlikely(!rq)) {
-		__wbt_done(q->rq_wb, wb_acct);
-		return BLK_QC_T_NONE;
-	}
-
-	wbt_track(&rq->issue_stat, wb_acct);
-
-	cookie = request_to_qc_t(data.hctx, rq);
-
-	if (unlikely(is_flush_fua)) {
-		if (q->elevator)
-			goto elv_insert;
-		blk_mq_bio_to_request(rq, bio);
-		blk_insert_flush(rq);
-		goto run_queue;
-	}
-
-	/*
-	 * A task plug currently exists. Since this is completely lockless,
-	 * utilize that to temporarily store requests until the task is
-	 * either done or scheduled away.
-	 */
-	plug = current->plug;
-	if (plug) {
-		struct request *last = NULL;
-
-		blk_mq_bio_to_request(rq, bio);
-
-		/*
-		 * @request_count may become stale because of schedule
-		 * out, so check the list again.
-		 */
-		if (list_empty(&plug->mq_list))
-			request_count = 0;
-		if (!request_count)
-			trace_block_plug(q);
-		else
-			last = list_entry_rq(plug->mq_list.prev);
-
-		blk_mq_put_ctx(data.ctx);
-
-		if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
-		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
-			blk_flush_plug_list(plug, false);
-			trace_block_plug(q);
-		}
-
-		list_add_tail(&rq->queuelist, &plug->mq_list);
-		return cookie;
-	}
-
-	if (q->elevator) {
-elv_insert:
-		blk_mq_put_ctx(data.ctx);
-		blk_mq_bio_to_request(rq, bio);
-		blk_mq_sched_insert_request(rq, false, true,
-						!is_sync || is_flush_fua, true);
-		goto done;
-	}
-	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
-		/*
-		 * For a SYNC request, send it to the hardware immediately. For
-		 * an ASYNC request, just ensure that we run it later on. The
-		 * latter allows for merging opportunities and more efficient
-		 * dispatching.
-		 */
-run_queue:
-		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
-	}
-
-	blk_mq_put_ctx(data.ctx);
-done:
-	return cookie;
-}
-
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
@@ -2366,10 +2277,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_LIST_HEAD(&q->requeue_list);
 	spin_lock_init(&q->requeue_lock);
 
-	if (q->nr_hw_queues > 1)
-		blk_queue_make_request(q, blk_mq_make_request);
-	else
-		blk_queue_make_request(q, blk_sq_make_request);
+	blk_queue_make_request(q, blk_mq_make_request);
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
@@ -2717,16 +2625,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 	set->nr_hw_queues = nr_hw_queues;
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
-
-		/*
-		 * Manually set the make_request_fn as blk_queue_make_request
-		 * resets a lot of the queue settings.
-		 */
-		if (q->nr_hw_queues > 1)
-			q->make_request_fn = blk_mq_make_request;
-		else
-			q->make_request_fn = blk_sq_make_request;
-
 		blk_mq_queue_reinit(q, cpu_online_mask);
 	}
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>

This flag was never used since it was introduced.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c         | 8 +-------
 include/linux/blk-mq.h | 1 -
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..acf0ddf4af52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	plug = current->plug;
-	/*
-	 * If the driver supports defer issued based on 'last', then
-	 * queue it up like normal since we can potentially save some
-	 * CPU this way.
-	 */
-	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
-	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+	if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
 		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b296a9006117..5b3e201c8d4f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,7 +152,6 @@ enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
-	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.11.0

^ permalink raw reply related

* unify and streamline the blk-mq make_request implementations
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
  To: axboe; +Cc: linux-block

A bunch of cleanups to get us a nice I/O submission path.

^ permalink raw reply

* Re: NULL deref in cpu hot unplug on jens for-linus branch
From: Jens Axboe @ 2017-03-13 15:42 UTC (permalink / raw)
  To: Sagi Grimberg, linux-block@vger.kernel.org, linux-nvme
In-Reply-To: <a4e0cb7c-c779-b681-e66c-2159c5f2b09f@grimberg.me>

On 03/13/2017 09:24 AM, Sagi Grimberg wrote:
> Hey Jens,
> 
> After some fixes to nvme-rdma in the area of cpu hot unplug and
> rebase to jens for-linus branch I get the following NULL deref [1]
> 
> This crash did not happen before I rebased to for-linus (unless I
> screwed up something).
> 
> I'm on my way out so I just send it out in hope that someone can
> figure it out before I do...
> 
> After I offlined a cpu, I got the nvmf target to disconnect
> from the host, the host then schedules a reconnect. after the
> host reconnects it issues a namespace scanning which removes
> an old namespace. Then we get to blk_cleanup_queue which
> then triggers the NULL deref.
> 
> The strange thing is that we pass the
>   (blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...
> 
> [1]
> --
> [   55.865818] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000008
> [   55.867094] IP: __blk_mq_tag_idle+0x19/0x30
> [   55.867825] PGD 0
> 
> [   55.868477] Oops: 0002 [#1] SMP
> [   55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
> mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper 
> cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid 
> ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp 
> libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper 
> syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm 
> floppy pps_core pata_acpi
> [   55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
> [   55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   55.879055] Workqueue: events nvme_scan_work [nvme_core]
> [   55.879940] task: ffffa0b13e1d9080 task.stack: ffffad2000244000
> [   55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
> [   55.881713] RSP: 0018:ffffad2000247c70 EFLAGS: 00010203
> [   55.882582] RAX: 0000000000000000 RBX: ffffa0b13376f400 RCX: 
> ffffa0b13fc11d00
> [   55.883808] RDX: 0000000000000001 RSI: ffffa0b13376f400 RDI: 
> ffffa0b13376f400
> [   55.884983] RBP: ffffad2000247c70 R08: 0000000000000000 R09: 
> ffffffffbee42e20
> [   55.886168] R10: ffffad2000247b88 R11: 0000000000000008 R12: 
> ffffa0b1384c6018
> [   55.887343] R13: 0000000000000001 R14: 0000000000000080 R15: 
> 0000000000000000
> [   55.888517] FS:  0000000000000000(0000) GS:ffffa0b13fc00000(0000) 
> knlGS:0000000000000000
> [   55.889816] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   55.890738] CR2: 0000000000000008 CR3: 000000003ba2f000 CR4: 
> 00000000003406f0
> [   55.891878] Call Trace:
> [   55.892285]  blk_mq_exit_hctx.isra.41+0xc4/0xd0
> [   55.893020]  blk_mq_free_queue+0x110/0x130
> [   55.893693]  blk_cleanup_queue+0xe0/0x150
> [   55.894346]  nvme_ns_remove+0x78/0xd0 [nvme_core]
> [   55.895109]  nvme_validate_ns+0x8c/0x290 [nvme_core]
> [   55.895911]  ? nvme_scan_work+0x28a/0x370 [nvme_core]
> [   55.896726]  nvme_scan_work+0x2ad/0x370 [nvme_core]
> [   55.897523]  process_one_work+0x16b/0x480
> [   55.898174]  worker_thread+0x4b/0x500
> [   55.898771]  kthread+0x101/0x140
> [   55.899299]  ? process_one_work+0x480/0x480
> [   55.899977]  ? kthread_create_on_node+0x40/0x40
> [   55.900711]  ? start_kernel+0x3bc/0x461
> [   55.901336]  ? acpi_early_init+0x83/0xf9
> [   55.901980]  ? acpi_load_tables+0x31/0x85
> [   55.902632]  ret_from_fork+0x2c/0x40
> [   55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66 
> 90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48 
> 89 e5 <f0> ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00
> [   55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ffffad2000247c70
> [   55.907209] CR2: 0000000000000008
> [   55.907750] ---[ end trace f016dee1082237cf ]---

Are you saying your code works on top of 4.11-rc2, but not on top of my
for-linus? That seems odd. Looking at the oops, you are crashing with
!tags in __blk_mq_tag_idle. The below should work around it, but I'm
puzzled why this is new. Is it related to the other path you fixed in
this patch:

commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0
Author: Sagi Grimberg <sagi@grimberg.me>
Date:   Mon Mar 13 16:10:11 2017 +0200

    blk-mq: Fix tagset reinit in the presence of cpu hot-unplug

Since that's also handling hctx->tags == NULL.


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..1283f74bfdfb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -54,9 +54,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return;
 
-	atomic_dec(&tags->active_queues);
+	if (tags) {
+		atomic_dec(&tags->active_queues);
 
-	blk_mq_tag_wakeup_all(tags, false);
+		blk_mq_tag_wakeup_all(tags, false);
+	}
 }
 
 /*

-- 
Jens Axboe

^ permalink raw reply related

* NULL deref in cpu hot unplug on jens for-linus branch
From: Sagi Grimberg @ 2017-03-13 15:24 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, Jens Axboe, linux-nvme

Hey Jens,

After some fixes to nvme-rdma in the area of cpu hot unplug and
rebase to jens for-linus branch I get the following NULL deref [1]

This crash did not happen before I rebased to for-linus (unless I
screwed up something).

I'm on my way out so I just send it out in hope that someone can
figure it out before I do...

After I offlined a cpu, I got the nvmf target to disconnect
from the host, the host then schedules a reconnect. after the
host reconnects it issues a namespace scanning which removes
an old namespace. Then we get to blk_cleanup_queue which
then triggers the NULL deref.

The strange thing is that we pass the
  (blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...

[1]
--
[   55.865818] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000008
[   55.867094] IP: __blk_mq_tag_idle+0x19/0x30
[   55.867825] PGD 0

[   55.868477] Oops: 0002 [#1] SMP
[   55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper 
cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid 
ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp 
libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm 
floppy pps_core pata_acpi
[   55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
[   55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   55.879055] Workqueue: events nvme_scan_work [nvme_core]
[   55.879940] task: ffffa0b13e1d9080 task.stack: ffffad2000244000
[   55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
[   55.881713] RSP: 0018:ffffad2000247c70 EFLAGS: 00010203
[   55.882582] RAX: 0000000000000000 RBX: ffffa0b13376f400 RCX: 
ffffa0b13fc11d00
[   55.883808] RDX: 0000000000000001 RSI: ffffa0b13376f400 RDI: 
ffffa0b13376f400
[   55.884983] RBP: ffffad2000247c70 R08: 0000000000000000 R09: 
ffffffffbee42e20
[   55.886168] R10: ffffad2000247b88 R11: 0000000000000008 R12: 
ffffa0b1384c6018
[   55.887343] R13: 0000000000000001 R14: 0000000000000080 R15: 
0000000000000000
[   55.888517] FS:  0000000000000000(0000) GS:ffffa0b13fc00000(0000) 
knlGS:0000000000000000
[   55.889816] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.890738] CR2: 0000000000000008 CR3: 000000003ba2f000 CR4: 
00000000003406f0
[   55.891878] Call Trace:
[   55.892285]  blk_mq_exit_hctx.isra.41+0xc4/0xd0
[   55.893020]  blk_mq_free_queue+0x110/0x130
[   55.893693]  blk_cleanup_queue+0xe0/0x150
[   55.894346]  nvme_ns_remove+0x78/0xd0 [nvme_core]
[   55.895109]  nvme_validate_ns+0x8c/0x290 [nvme_core]
[   55.895911]  ? nvme_scan_work+0x28a/0x370 [nvme_core]
[   55.896726]  nvme_scan_work+0x2ad/0x370 [nvme_core]
[   55.897523]  process_one_work+0x16b/0x480
[   55.898174]  worker_thread+0x4b/0x500
[   55.898771]  kthread+0x101/0x140
[   55.899299]  ? process_one_work+0x480/0x480
[   55.899977]  ? kthread_create_on_node+0x40/0x40
[   55.900711]  ? start_kernel+0x3bc/0x461
[   55.901336]  ? acpi_early_init+0x83/0xf9
[   55.901980]  ? acpi_load_tables+0x31/0x85
[   55.902632]  ret_from_fork+0x2c/0x40
[   55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66 
90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48 
89 e5 <f0> ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00
[   55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ffffad2000247c70
[   55.907209] CR2: 0000000000000008
[   55.907750] ---[ end trace f016dee1082237cf ]---
--

^ permalink raw reply

* [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-13 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara

Hello,

this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
  on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
  accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2

Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun

Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback

								Honza

^ permalink raw reply

* [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 -
 mm/backing-dev.c                 | 22 +---------------------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
-	atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b67be4fc12c4..8c30b1a7aae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
 /*
  * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
  * blkcg->cgwb_list, and memcg->cgwb_list.  bdi->cgwb_tree is also RCU
- * protected.  cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
 
 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
-	struct backing_dev_info *bdi = wb->bdi;
 
 	wb_shutdown(wb);
 
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
 	kfree_rcu(wb, rcu);
-
-	if (atomic_dec_and_test(&bdi->usage_cnt))
-		wake_up_all(&cgwb_release_wait);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
 		/* we might have raced another instance of this function */
 		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
-			atomic_inc(&bdi->usage_cnt);
 			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
-	atomic_set(&bdi->usage_cnt, 1);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -728,18 +720,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
-
-	/*
-	 * All cgwb's must be shutdown and released before returning.  Drain
-	 * the usage counter to wait for all cgwb's ever created on @bdi.
-	 */
-	atomic_dec(&bdi->usage_cnt);
-	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
-	/*
-	 * Grab back our reference so that we hold it when @bdi gets
-	 * re-registered.
-	 */
-	atomic_inc(&bdi->usage_cnt);
 }
 
 /**
-- 
2.10.2

^ permalink raw reply related

* [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return ret;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
 
 static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 {
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
 	wb_shutdown(&bdi->wb);
-	cgwb_bdi_destroy(bdi);
+	cgwb_bdi_unregister(bdi);
 
 	if (bdi->dev) {
 		bdi_debug_unregister(bdi);
-- 
2.10.2

^ permalink raw reply related

* [PATCH 10/11] kobject: Export kobject_get_unless_zero()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara, Greg Kroah-Hartman
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Make the function available for outside use and fortify it against NULL
kobject.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/kobject.h | 2 ++
 lib/kobject.c           | 5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
 extern int __must_check kobject_move(struct kobject *, struct kobject *);
 
 extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+						struct kobject *kobj);
 extern void kobject_put(struct kobject *kobj);
 
 extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
 }
 EXPORT_SYMBOL(kobject_get);
 
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
 {
+	if (!kobj)
+		return NULL;
 	if (!kref_get_unless_zero(&kobj->kref))
 		kobj = NULL;
 	return kobj;
 }
+EXPORT_SYMBOL(kobject_get_unless_zero);
 
 /*
  * kobject_cleanup - free kobject resources.
-- 
2.10.2

^ permalink raw reply related

* [PATCH 11/11] block: Fix oops scsi_disk_get()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

When device open races with device shutdown, we can get the following
oops in scsi_disk_get():

[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217]  ? disk_get_part+0x22/0x1f0
[11863.059217]  sd_open+0x39/0x130
[11863.059217]  __blkdev_get+0x69/0x430
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? bd_acquire+0x96/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_get+0x126/0x350
[11863.059217]  ? _raw_spin_unlock+0x2b/0x40
[11863.059217]  ? bd_acquire+0x7f/0xc0
[11863.059217]  ? blkdev_get+0x350/0x350
[11863.059217]  blkdev_open+0x65/0x80
...

As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.

We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.

Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
 	owner = disk->fops->owner;
 	if (owner && !try_module_get(owner))
 		return NULL;
-	kobj = kobject_get(&disk_to_dev(disk)->kobj);
+	kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
 	if (kobj == NULL) {
 		module_put(owner);
 		return NULL;
-- 
2.10.2

^ permalink raw reply related

* [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c            | 8 ++------
 include/linux/writeback.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5ec8750f5332..c66f5dd4a02a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
 	spin_lock(&bdev_lock);
 	list_del_init(&bdev->bd_list);
 	spin_unlock(&bdev_lock);
+	/* Detach inode from wb early as bdi_put() may free bdi->wb */
+	inode_detach_wb(inode);
 	if (bdev->bd_bdi != &noop_backing_dev_info) {
 		bdi_put(bdev->bd_bdi);
 		bdev->bd_bdi = &noop_backing_dev_info;
@@ -1880,12 +1882,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
 	if (inode->i_wb) {
+		WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
 		wb_put(inode->i_wb);
 		inode->i_wb = NULL;
 	}
-- 
2.10.2

^ permalink raw reply related

* [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().

So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 03d4ba27c133..e3d56dba4da8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	return err;
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	}
 	spin_unlock_bh(&wb->work_lock);
 
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
 						release_work);
 	struct backing_dev_info *bdi = wb->bdi;
 
-	spin_lock_irq(&cgwb_lock);
-	list_del_rcu(&wb->bdi_node);
-	spin_unlock_irq(&cgwb_lock);
-
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
 	percpu_ref_kill(&wb->refcnt);
 }
 
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_del_rcu(&wb->bdi_node);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 static int cgwb_create(struct backing_dev_info *bdi,
 		       struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
@@ -766,6 +772,13 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 	spin_unlock_irq(&cgwb_lock);
 }
 
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+	spin_lock_irq(&cgwb_lock);
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	spin_unlock_irq(&cgwb_lock);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -793,6 +806,16 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 	wb_congested_put(bdi->wb_congested);
 }
 
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+}
+
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+	list_del_rcu(&wb->bdi_node);
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 int bdi_init(struct backing_dev_info *bdi)
@@ -811,8 +834,6 @@ int bdi_init(struct backing_dev_info *bdi)
 
 	ret = cgwb_bdi_init(bdi);
 
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
-
 	return ret;
 }
 EXPORT_SYMBOL(bdi_init);
@@ -848,6 +869,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
+	cgwb_bdi_register(bdi);
 	bdi->dev = dev;
 
 	bdi_debug_register(bdi, dev_name(dev));
-- 
2.10.2

^ permalink raw reply related

* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  1 +
 mm/backing-dev.c                 | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
  */
 enum wb_state {
 	WB_registered,		/* bdi_register() was done */
+	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
+	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
 	cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	/*
+	 * Make sure bit gets cleared after shutdown is finished. Matches with
+	 * the barrier provided by test_and_clear_bit() above.
+	 */
+	smp_wmb();
+	clear_bit(WB_shutting_down, &wb->state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
+	struct bdi_writeback *wb;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+
+	while (!list_empty(&bdi->wb_list)) {
+		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+				      bdi_node);
+		spin_unlock_irq(&cgwb_lock);
+		wb_shutdown(wb);
+		spin_lock_irq(&cgwb_lock);
+	}
 	spin_unlock_irq(&cgwb_lock);
 
 	/*
-- 
2.10.2

^ permalink raw reply related

* [PATCH 04/11] bdi: Make wb->bdi a proper reference
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 	memset(wb, 0, sizeof(*wb));
 
+	if (wb != &bdi->wb)
+		bdi_get(bdi);
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	wb->dirty_sleep = jiffies;
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-	if (!wb->congested)
-		return -ENOMEM;
+	if (!wb->congested) {
+		err = -ENOMEM;
+		goto out_put_bdi;
+	}
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
 	if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	fprop_local_destroy_percpu(&wb->completions);
 out_put_cong:
 	wb_congested_put(wb->congested);
+out_put_bdi:
+	if (wb != &bdi->wb)
+		bdi_put(bdi);
 	return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
 	fprop_local_destroy_percpu(&wb->completions);
 	wb_congested_put(wb->congested);
+	if (wb != &wb->bdi->wb)
+		bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox