* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: hch @ 2017-03-13 23:16 UTC (permalink / raw)
To: Bart Van Assche; +Cc: hch@lst.de, axboe@kernel.dk, linux-block@vger.kernel.org
In-Reply-To: <1489438361.2658.21.camel@sandisk.com>
On Mon, Mar 13, 2017 at 08:52:54PM +0000, Bart Van Assche wrote:
> > - if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> > - !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > + if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
>
> A minor comment: due to this change the outer pair of parentheses
> became superfluous. Please consider removing these.
The last patch in the series removes the statement in this form. But
if I have to respin the series for some reason I'll make sure it
gets removed here already.
^ permalink raw reply
* Re: NULL deref in cpu hot unplug on jens for-linus branch
From: Sagi Grimberg @ 2017-03-13 21:46 UTC (permalink / raw)
To: Jens Axboe, linux-block@vger.kernel.org, linux-nvme
In-Reply-To: <54f833f1-ab98-5d22-e7d4-5e6059a4c467@fb.com>
> Are you saying your code works on top of 4.11-rc2, but not on top of my
> for-linus?
I was actually on Linus 4.11-rc1 before I rebased on top of your
for-linus.
> That seems odd. Looking at the oops, you are crashing with
> !tags in __blk_mq_tag_idle. The below should work around it, but I'm
> puzzled why this is new.
I got it just once (out of a single run :)), but maybe it is
possible that its racy and not really new.
But another example where this can happen:
blk_mq_realloc_hw_ctxs explicitly checks on hctx->tags != NULL
but right after calls blk_mq_exit_hctx() which goes in the
same route, won't this happen there too? Or is it assumed that
hctx->state does not have BLK_MQ_S_TAG_ACTIVE on here?
> Is it related to the other path you fixed in this patch:
>
> commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0
> Author: Sagi Grimberg <sagi@grimberg.me>
> Date: Mon Mar 13 16:10:11 2017 +0200
>
> blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
>
> Since that's also handling hctx->tags == NULL.
The above patch prevented a NULL deref earlier when the
tags were reinitialized, now we are all setup and we
happen to remove an old namespace.
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9d97bfc4d465..1283f74bfdfb 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -54,9 +54,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
> if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> return;
>
> - atomic_dec(&tags->active_queues);
> + if (tags) {
> + atomic_dec(&tags->active_queues);
>
> - blk_mq_tag_wakeup_all(tags, false);
> + blk_mq_tag_wakeup_all(tags, false);
> + }
> }
>
> /*
>
I'll see if I can test it out later this week. thanks.
^ permalink raw reply
* Re: [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Bart Van Assche @ 2017-03-13 21:02 UTC (permalink / raw)
To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-4-hch@lst.de>
On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
> new wrapper that takes care of RCU / SRCU locking to avoid having
> boileplate code in the caller which would get duplicated with new callers=
.
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=
^ permalink raw reply
* Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Bart Van Assche @ 2017-03-13 21:01 UTC (permalink / raw)
To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-3-hch@lst.de>
On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> @@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)
> }
> =20
> plug =3D current->plug;
> - if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> + if (plug && q->nr_hw_queues =3D=3D 1) {
> + struct request *last =3D NULL;
> +
> + blk_mq_bio_to_request(rq, bio);
> +
> + /*
> + * @request_count may become stale because of schedule
> + * out, so check the list again.
> + */
The above comment was relevant as long as there was a request_count assignm=
ent
above blk_mq_sched_get_request(). This patch moves that assignment inside i=
f
(plug && q->nr_hw_queues =3D=3D 1). Does that mean that the above comment s=
hould be
removed entirely?
> + if (list_empty(&plug->mq_list))
> + request_count =3D 0;
> + else if (blk_queue_nomerges(q))
> + request_count =3D blk_plug_queued_count(q);
> +
> + if (!request_count)
> + trace_block_plug(q);
> + else
> + last =3D list_entry_rq(plug->mq_list.prev);
> +
> + blk_mq_put_ctx(data.ctx);
> +
> + if (request_count >=3D BLK_MAX_REQUEST_COUNT || (last &&
> + blk_rq_bytes(last) >=3D BLK_PLUG_FLUSH_SIZE)) {
> + blk_flush_plug_list(plug, false);
> + trace_block_plug(q);
> + }
> +
> + list_add_tail(&rq->queuelist, &plug->mq_list);
> + goto done;
> + } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
> struct request *old_rq =3D NULL;
> =20
> blk_mq_bio_to_request(rq, bio);
Bart.=
^ permalink raw reply
* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Bart Van Assche @ 2017-03-13 20:52 UTC (permalink / raw)
To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313154833.14165-2-hch@lst.de>
On Mon, 2017-03-13 at 09:48 -0600, Christoph Hellwig wrote:
> This flag was never used since it was introduced.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> block/blk-mq.c | 8 +-------
> include/linux/blk-mq.h | 1 -
> 2 files changed, 1 insertion(+), 8 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..acf0ddf4af52 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)
> }
> =20
> plug =3D current->plug;
> - /*
> - * If the driver supports defer issued based on 'last', then
> - * queue it up like normal since we can potentially save some
> - * CPU this way.
> - */
> - if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
> - !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> + if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
A minor comment: due to this change the outer pair of parentheses
became superfluous. Please consider removing these.
Thanks,
Bart.=
^ permalink raw reply
* Re: [PATCH] brd: make rd_size static
From: Bart Van Assche @ 2017-03-13 20:07 UTC (permalink / raw)
To: linux-block@vger.kernel.org, yanaijie@huawei.com, axboe@kernel.dk
Cc: zhaohongjiang@huawei.com, miaoxie@huawei.com
In-Reply-To: <4dd2367b-61b9-d390-5bb1-955e90fef0b0@kernel.dk>
On Sat, 2017-03-11 at 15:29 -0700, Jens Axboe wrote:
> On 03/10/2017 12:32 AM, Jason Yan wrote:
> > Fixes the following sparse warning:
> >=20
> > drivers/block/brd.c:411:15: warning: symbol 'rd_size' was not declared.
> > Should it be static?
>=20
> If you do a search on this topic, you'll find others that attempted
> to do the same. Arm uses it for tag parsing, for some reason, your
> patch below would break it.
>=20
> It'd be great if this was fixed up for real, though.
How about something like the (untested) patch below?
Subject: [PATCH] arch/arm/kernel/atags_parse.c: Fix rd_size declaration
Ensure that the ARM setup code treats "rd_size" as unsigned long instead of=
int.
---
arch/arm/kernel/atags_parse.c | 3 ++-
drivers/block/brd.c | 2 ++
drivers/block/brd.h | 1 +
3 files changed, 5 insertions(+), 1 deletion(-)
create mode 100644 drivers/block/brd.h
diff --git a/arch/arm/kernel/atags_parse.c b/arch/arm/kernel/atags_parse.c
index 68c6ae0b9e4c..f18b6deaf050 100644
--- a/arch/arm/kernel/atags_parse.c
+++ b/arch/arm/kernel/atags_parse.c
@@ -30,6 +30,7 @@
#include <asm/mach/arch.h>
=20
#include "atags.h"
+#include "../../../drivers/block/brd.h"
=20
static char default_command_line[COMMAND_LINE_SIZE] __initdata =3D CONFIG_=
CMDLINE;
=20
@@ -91,7 +92,7 @@ __tagtable(ATAG_VIDEOTEXT, parse_tag_videotext);
#ifdef CONFIG_BLK_DEV_RAM
static int __init parse_tag_ramdisk(const struct tag *tag)
{
- extern int rd_size, rd_image_start, rd_prompt, rd_doload;
+ extern int rd_image_start, rd_prompt, rd_doload;
=20
rd_image_start =3D tag->u.ramdisk.start;
rd_doload =3D (tag->u.ramdisk.flags & 1) =3D=3D 0;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..f1f9f0338fbd 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -25,6 +25,8 @@
=20
#include <linux/uaccess.h>
=20
+#include "brd.h"
+
#define SECTOR_SHIFT 9
#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT)
diff --git a/drivers/block/brd.h b/drivers/block/brd.h
new file mode 100644
index 000000000000..dbb0f92fefc8
--- /dev/null
+++ b/drivers/block/brd.h
@@ -0,0 +1 @@
+extern unsigned long rd_size;
--=20
2.12.0
^ permalink raw reply related
* Re: [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Dan Williams @ 2017-03-13 18:10 UTC (permalink / raw)
To: Jan Kara
Cc: Jens Axboe, linux-block, Christoph Hellwig, Thiago Jung Bauermann,
Tejun Heo, Tahsin Erdogan, Omar Sandoval
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
On Mon, Mar 13, 2017 at 8:13 AM, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to fix several
> different races and issues I've found when testing device shutdown and reuse.
> The first two patches fix possible (theoretical) problems when opening of a
> block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
> that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
> problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
> code where get_gendisk() can return already freed gendisk structure (again
> triggered by Omar's stress test).
>
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed them
> with Omar's script, tested memcg writeback, tested static (not udev managed)
> device inodes.
Passes testing with the libnvdimm unit tests that have been tripped up
by block-unplug bugs in the past.
^ permalink raw reply
* Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0
From: Abdul Haleem @ 2017-03-13 17:30 UTC (permalink / raw)
To: Jens Axboe
Cc: Brian Foster, linux-xfs, linux-block, mpe, linuxppc-dev,
linux-kernel
In-Reply-To: <158127ce-d71d-56a5-3dd3-f676b106c65d@kernel.dk>
On Sat, 2017-03-11 at 15:46 -0700, Jens Axboe wrote:
> On 03/09/2017 05:59 AM, Brian Foster wrote:
> > cc linux-block
> >
> > On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
> >> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
> >>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
> >>>>
> >>>> Issue is not reproducible all the time.
>
> Is that still the case with -git as of yesterday? Check that you
> have this merge:
>
> 34bbce9e344b47e8871273409632f525973afad4
>
> in your tree.
>
Thanks for pointing out, with the below merge commit warnings disappear.
commit 34bbce9e344b47e8871273409632f525973afad4
Merge: bb61ce5 672a2c8
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu Mar 9 15:53:25 2017 -0800
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Thanks for the fix !
Reported-by & Tested-by : Abdul Haleem <abdhalee@linux.vnet.ibm.com>
--
Regard's
Abdul Haleem
IBM Linux Technology Centre
^ permalink raw reply
* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tahsin Erdogan @ 2017-03-13 16:17 UTC (permalink / raw)
To: Jens Axboe; +Cc: Tejun Heo, linux-block, David Rientjes, linux-kernel
In-Reply-To: <c514c076-18ea-8605-d439-e46730aa29e0@kernel.dk>
>> Do you mean, you prefer the approach that was taken in v1 patch or
>> something else?
>
> I can no longer find v1 of the patch, just v2 and on. Can you send a
> link to it?
https://lkml.org/lkml/2017/2/28/8
^ permalink raw reply
* [PATCH 4/4] blk-mq: streamline blk_mq_make_request
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>
Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos. Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 67 +++++++++++++++++++++++-----------------------------------
1 file changed, 27 insertions(+), 40 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 48748cb799ed..18e449cc832f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,16 +1534,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
cookie = request_to_qc_t(data.hctx, rq);
+ plug = current->plug;
if (unlikely(is_flush_fua)) {
- if (q->elevator)
- goto elv_insert;
blk_mq_bio_to_request(rq, bio);
- blk_insert_flush(rq);
- goto run_queue;
- }
-
- plug = current->plug;
- if (plug && q->nr_hw_queues == 1) {
+ if (q->elevator) {
+ blk_mq_sched_insert_request(rq, false, true,
+ !is_sync || is_flush_fua, true);
+ } else {
+ blk_insert_flush(rq);
+ blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+ }
+ } else if (plug && q->nr_hw_queues == 1) {
struct request *last = NULL;
blk_mq_bio_to_request(rq, bio);
@@ -1562,8 +1563,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
else
last = list_entry_rq(plug->mq_list.prev);
- blk_mq_put_ctx(data.ctx);
-
if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
blk_flush_plug_list(plug, false);
@@ -1571,56 +1570,44 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
list_add_tail(&rq->queuelist, &plug->mq_list);
- goto done;
- } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
- struct request *old_rq = NULL;
-
+ } else if (plug && !blk_queue_nomerges(q)) {
blk_mq_bio_to_request(rq, bio);
/*
* We do limited plugging. If the bio can be merged, do that.
* Otherwise the existing request in the plug list will be
* issued. So the plug list will have one request at most
+ *
+ * The plug list might get flushed before this. If that happens,
+ * the plug list is emptry and same_queue_rq is invalid.
*/
- if (plug) {
- /*
- * The plug list might get flushed before this. If that
- * happens, same_queue_rq is invalid and plug list is
- * empty
- */
- if (same_queue_rq && !list_empty(&plug->mq_list)) {
- old_rq = same_queue_rq;
- list_del_init(&old_rq->queuelist);
- }
- list_add_tail(&rq->queuelist, &plug->mq_list);
- } else /* is_sync */
- old_rq = rq;
- blk_mq_put_ctx(data.ctx);
- if (old_rq)
- blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
- goto done;
- }
+ if (!list_empty(&plug->mq_list))
+ list_del_init(&same_queue_rq->queuelist);
+ else
+ same_queue_rq = NULL;
- if (q->elevator) {
-elv_insert:
- blk_mq_put_ctx(data.ctx);
+ list_add_tail(&rq->queuelist, &plug->mq_list);
+ if (same_queue_rq)
+ blk_mq_try_issue_directly(data.hctx, same_queue_rq,
+ &cookie);
+ } else if (is_sync) {
+ blk_mq_bio_to_request(rq, bio);
+ blk_mq_try_issue_directly(data.hctx, rq, &cookie);
+ } else if (q->elevator) {
blk_mq_bio_to_request(rq, bio);
blk_mq_sched_insert_request(rq, false, true,
!is_sync || is_flush_fua, true);
- goto done;
- }
- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+ } else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
* an ASYNC request, just ensure that we run it later on. The
* latter allows for merging opportunities and more efficient
* dispatching.
*/
-run_queue:
blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
}
+
blk_mq_put_ctx(data.ctx);
-done:
return cookie;
}
--
2.11.0
^ permalink raw reply related
* [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53e49a3f6f0a..48748cb799ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1434,7 +1434,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
}
-static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
+static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
{
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1478,13 +1478,27 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
blk_mq_sched_insert_request(rq, false, true, true, false);
}
+static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+ struct request *rq, blk_qc_t *cookie)
+{
+ if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+ rcu_read_lock();
+ __blk_mq_try_issue_directly(rq, cookie);
+ rcu_read_unlock();
+ } else {
+ unsigned int srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
+ __blk_mq_try_issue_directly(rq, cookie);
+ srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
+ }
+}
+
static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = op_is_sync(bio->bi_opf);
const int is_flush_fua = op_is_flush(bio->bi_opf);
struct blk_mq_alloc_data data = { .flags = 0 };
struct request *rq;
- unsigned int request_count = 0, srcu_idx;
+ unsigned int request_count = 0;
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
@@ -1582,18 +1596,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
} else /* is_sync */
old_rq = rq;
blk_mq_put_ctx(data.ctx);
- if (!old_rq)
- goto done;
-
- if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
- rcu_read_lock();
- blk_mq_try_issue_directly(old_rq, &cookie);
- rcu_read_unlock();
- } else {
- srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
- blk_mq_try_issue_directly(old_rq, &cookie);
- srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
- }
+ if (old_rq)
+ blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
goto done;
}
--
2.11.0
^ permalink raw reply related
* [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>
They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 164 +++++++++++----------------------------------------------
1 file changed, 31 insertions(+), 133 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf0ddf4af52..53e49a3f6f0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1478,11 +1478,6 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
blk_mq_sched_insert_request(rq, false, true, true, false);
}
-/*
- * Multiple hardware queue variant. This will not use per-process plugs,
- * but will attempt to bypass the hctx queueing if we can go straight to
- * hardware for SYNC IO.
- */
static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = op_is_sync(bio->bi_opf);
@@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
plug = current->plug;
- if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
+ if (plug && q->nr_hw_queues == 1) {
+ struct request *last = NULL;
+
+ blk_mq_bio_to_request(rq, bio);
+
+ /*
+ * @request_count may become stale because of schedule
+ * out, so check the list again.
+ */
+ if (list_empty(&plug->mq_list))
+ request_count = 0;
+ else if (blk_queue_nomerges(q))
+ request_count = blk_plug_queued_count(q);
+
+ if (!request_count)
+ trace_block_plug(q);
+ else
+ last = list_entry_rq(plug->mq_list.prev);
+
+ blk_mq_put_ctx(data.ctx);
+
+ if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+ blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+ blk_flush_plug_list(plug, false);
+ trace_block_plug(q);
+ }
+
+ list_add_tail(&rq->queuelist, &plug->mq_list);
+ goto done;
+ } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
struct request *old_rq = NULL;
blk_mq_bio_to_request(rq, bio);
@@ -1596,119 +1620,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
return cookie;
}
-/*
- * Single hardware queue variant. This will attempt to use any per-process
- * plug for merging and IO deferral.
- */
-static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
-{
- const int is_sync = op_is_sync(bio->bi_opf);
- const int is_flush_fua = op_is_flush(bio->bi_opf);
- struct blk_plug *plug;
- unsigned int request_count = 0;
- struct blk_mq_alloc_data data = { .flags = 0 };
- struct request *rq;
- blk_qc_t cookie;
- unsigned int wb_acct;
-
- blk_queue_bounce(q, &bio);
-
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio_io_error(bio);
- return BLK_QC_T_NONE;
- }
-
- blk_queue_split(q, &bio, q->bio_split);
-
- if (!is_flush_fua && !blk_queue_nomerges(q)) {
- if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
- return BLK_QC_T_NONE;
- } else
- request_count = blk_plug_queued_count(q);
-
- if (blk_mq_sched_bio_merge(q, bio))
- return BLK_QC_T_NONE;
-
- wb_acct = wbt_wait(q->rq_wb, bio, NULL);
-
- trace_block_getrq(q, bio, bio->bi_opf);
-
- rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
- if (unlikely(!rq)) {
- __wbt_done(q->rq_wb, wb_acct);
- return BLK_QC_T_NONE;
- }
-
- wbt_track(&rq->issue_stat, wb_acct);
-
- cookie = request_to_qc_t(data.hctx, rq);
-
- if (unlikely(is_flush_fua)) {
- if (q->elevator)
- goto elv_insert;
- blk_mq_bio_to_request(rq, bio);
- blk_insert_flush(rq);
- goto run_queue;
- }
-
- /*
- * A task plug currently exists. Since this is completely lockless,
- * utilize that to temporarily store requests until the task is
- * either done or scheduled away.
- */
- plug = current->plug;
- if (plug) {
- struct request *last = NULL;
-
- blk_mq_bio_to_request(rq, bio);
-
- /*
- * @request_count may become stale because of schedule
- * out, so check the list again.
- */
- if (list_empty(&plug->mq_list))
- request_count = 0;
- if (!request_count)
- trace_block_plug(q);
- else
- last = list_entry_rq(plug->mq_list.prev);
-
- blk_mq_put_ctx(data.ctx);
-
- if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
- blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
- blk_flush_plug_list(plug, false);
- trace_block_plug(q);
- }
-
- list_add_tail(&rq->queuelist, &plug->mq_list);
- return cookie;
- }
-
- if (q->elevator) {
-elv_insert:
- blk_mq_put_ctx(data.ctx);
- blk_mq_bio_to_request(rq, bio);
- blk_mq_sched_insert_request(rq, false, true,
- !is_sync || is_flush_fua, true);
- goto done;
- }
- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
- /*
- * For a SYNC request, send it to the hardware immediately. For
- * an ASYNC request, just ensure that we run it later on. The
- * latter allows for merging opportunities and more efficient
- * dispatching.
- */
-run_queue:
- blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
- }
-
- blk_mq_put_ctx(data.ctx);
-done:
- return cookie;
-}
-
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx)
{
@@ -2366,10 +2277,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
INIT_LIST_HEAD(&q->requeue_list);
spin_lock_init(&q->requeue_lock);
- if (q->nr_hw_queues > 1)
- blk_queue_make_request(q, blk_mq_make_request);
- else
- blk_queue_make_request(q, blk_sq_make_request);
+ blk_queue_make_request(q, blk_mq_make_request);
/*
* Do this after blk_queue_make_request() overrides it...
@@ -2717,16 +2625,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
set->nr_hw_queues = nr_hw_queues;
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
-
- /*
- * Manually set the make_request_fn as blk_queue_make_request
- * resets a lot of the queue settings.
- */
- if (q->nr_hw_queues > 1)
- q->make_request_fn = blk_mq_make_request;
- else
- q->make_request_fn = blk_sq_make_request;
-
blk_mq_queue_reinit(q, cpu_online_mask);
}
--
2.11.0
^ permalink raw reply related
* [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
To: axboe; +Cc: linux-block
In-Reply-To: <20170313154833.14165-1-hch@lst.de>
This flag was never used since it was introduced.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 8 +-------
include/linux/blk-mq.h | 1 -
2 files changed, 1 insertion(+), 8 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..acf0ddf4af52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
plug = current->plug;
- /*
- * If the driver supports defer issued based on 'last', then
- * queue it up like normal since we can potentially save some
- * CPU this way.
- */
- if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
- !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+ if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
struct request *old_rq = NULL;
blk_mq_bio_to_request(rq, bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b296a9006117..5b3e201c8d4f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,7 +152,6 @@ enum {
BLK_MQ_F_SHOULD_MERGE = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
BLK_MQ_F_SG_MERGE = 1 << 2,
- BLK_MQ_F_DEFER_ISSUE = 1 << 4,
BLK_MQ_F_BLOCKING = 1 << 5,
BLK_MQ_F_NO_SCHED = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
--
2.11.0
^ permalink raw reply related
* unify and streamline the blk-mq make_request implementations
From: Christoph Hellwig @ 2017-03-13 15:48 UTC (permalink / raw)
To: axboe; +Cc: linux-block
A bunch of cleanups to get us a nice I/O submission path.
^ permalink raw reply
* Re: NULL deref in cpu hot unplug on jens for-linus branch
From: Jens Axboe @ 2017-03-13 15:42 UTC (permalink / raw)
To: Sagi Grimberg, linux-block@vger.kernel.org, linux-nvme
In-Reply-To: <a4e0cb7c-c779-b681-e66c-2159c5f2b09f@grimberg.me>
On 03/13/2017 09:24 AM, Sagi Grimberg wrote:
> Hey Jens,
>
> After some fixes to nvme-rdma in the area of cpu hot unplug and
> rebase to jens for-linus branch I get the following NULL deref [1]
>
> This crash did not happen before I rebased to for-linus (unless I
> screwed up something).
>
> I'm on my way out so I just send it out in hope that someone can
> figure it out before I do...
>
> After I offlined a cpu, I got the nvmf target to disconnect
> from the host, the host then schedules a reconnect. after the
> host reconnects it issues a namespace scanning which removes
> an old namespace. Then we get to blk_cleanup_queue which
> then triggers the NULL deref.
>
> The strange thing is that we pass the
> (blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...
>
> [1]
> --
> [ 55.865818] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008
> [ 55.867094] IP: __blk_mq_tag_idle+0x19/0x30
> [ 55.867825] PGD 0
>
> [ 55.868477] Oops: 0002 [#1] SMP
> [ 55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core
> mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul
> ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper
> cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid
> ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp
> libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper
> syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm
> floppy pps_core pata_acpi
> [ 55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
> [ 55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [ 55.879055] Workqueue: events nvme_scan_work [nvme_core]
> [ 55.879940] task: ffffa0b13e1d9080 task.stack: ffffad2000244000
> [ 55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
> [ 55.881713] RSP: 0018:ffffad2000247c70 EFLAGS: 00010203
> [ 55.882582] RAX: 0000000000000000 RBX: ffffa0b13376f400 RCX:
> ffffa0b13fc11d00
> [ 55.883808] RDX: 0000000000000001 RSI: ffffa0b13376f400 RDI:
> ffffa0b13376f400
> [ 55.884983] RBP: ffffad2000247c70 R08: 0000000000000000 R09:
> ffffffffbee42e20
> [ 55.886168] R10: ffffad2000247b88 R11: 0000000000000008 R12:
> ffffa0b1384c6018
> [ 55.887343] R13: 0000000000000001 R14: 0000000000000080 R15:
> 0000000000000000
> [ 55.888517] FS: 0000000000000000(0000) GS:ffffa0b13fc00000(0000)
> knlGS:0000000000000000
> [ 55.889816] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 55.890738] CR2: 0000000000000008 CR3: 000000003ba2f000 CR4:
> 00000000003406f0
> [ 55.891878] Call Trace:
> [ 55.892285] blk_mq_exit_hctx.isra.41+0xc4/0xd0
> [ 55.893020] blk_mq_free_queue+0x110/0x130
> [ 55.893693] blk_cleanup_queue+0xe0/0x150
> [ 55.894346] nvme_ns_remove+0x78/0xd0 [nvme_core]
> [ 55.895109] nvme_validate_ns+0x8c/0x290 [nvme_core]
> [ 55.895911] ? nvme_scan_work+0x28a/0x370 [nvme_core]
> [ 55.896726] nvme_scan_work+0x2ad/0x370 [nvme_core]
> [ 55.897523] process_one_work+0x16b/0x480
> [ 55.898174] worker_thread+0x4b/0x500
> [ 55.898771] kthread+0x101/0x140
> [ 55.899299] ? process_one_work+0x480/0x480
> [ 55.899977] ? kthread_create_on_node+0x40/0x40
> [ 55.900711] ? start_kernel+0x3bc/0x461
> [ 55.901336] ? acpi_early_init+0x83/0xf9
> [ 55.901980] ? acpi_load_tables+0x31/0x85
> [ 55.902632] ret_from_fork+0x2c/0x40
> [ 55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66
> 90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48
> 89 e5 <f0> ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00
> [ 55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ffffad2000247c70
> [ 55.907209] CR2: 0000000000000008
> [ 55.907750] ---[ end trace f016dee1082237cf ]---
Are you saying your code works on top of 4.11-rc2, but not on top of my
for-linus? That seems odd. Looking at the oops, you are crashing with
!tags in __blk_mq_tag_idle. The below should work around it, but I'm
puzzled why this is new. Is it related to the other path you fixed in
this patch:
commit 0067d4b020ea07a58540acb2c5fcd3364bf326e0
Author: Sagi Grimberg <sagi@grimberg.me>
Date: Mon Mar 13 16:10:11 2017 +0200
blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
Since that's also handling hctx->tags == NULL.
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..1283f74bfdfb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -54,9 +54,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
return;
- atomic_dec(&tags->active_queues);
+ if (tags) {
+ atomic_dec(&tags->active_queues);
- blk_mq_tag_wakeup_all(tags, false);
+ blk_mq_tag_wakeup_all(tags, false);
+ }
}
/*
--
Jens Axboe
^ permalink raw reply related
* NULL deref in cpu hot unplug on jens for-linus branch
From: Sagi Grimberg @ 2017-03-13 15:24 UTC (permalink / raw)
To: linux-block@vger.kernel.org, Jens Axboe, linux-nvme
Hey Jens,
After some fixes to nvme-rdma in the area of cpu hot unplug and
rebase to jens for-linus branch I get the following NULL deref [1]
This crash did not happen before I rebased to for-linus (unless I
screwed up something).
I'm on my way out so I just send it out in hope that someone can
figure it out before I do...
After I offlined a cpu, I got the nvmf target to disconnect
from the host, the host then schedules a reconnect. after the
host reconnects it issues a namespace scanning which removes
an old namespace. Then we get to blk_cleanup_queue which
then triggers the NULL deref.
The strange thing is that we pass the
(blk_mq_hw_queue_mapped(hctx)) condition but still hit a NULL...
[1]
--
[ 55.865818] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[ 55.867094] IP: __blk_mq_tag_idle+0x19/0x30
[ 55.867825] PGD 0
[ 55.868477] Oops: 0002 [#1] SMP
[ 55.869010] Modules linked in: nvme_rdma nvme_fabrics nvme_core
mlx5_ib ppdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper
cryptd joydev input_leds serio_raw i2c_piix4 parport_pc parport mac_hid
ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp
libiscsi sunrpc scsi_transport_iscsi autofs4 cirrus ttm drm_kms_helper
syscopyarea sysfillrect sysimgblt mlx5_core fb_sys_fops ptp psmouse drm
floppy pps_core pata_acpi
[ 55.876358] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 4.11.0-rc1+ #136
[ 55.877492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 55.879055] Workqueue: events nvme_scan_work [nvme_core]
[ 55.879940] task: ffffa0b13e1d9080 task.stack: ffffad2000244000
[ 55.880921] RIP: 0010:__blk_mq_tag_idle+0x19/0x30
[ 55.881713] RSP: 0018:ffffad2000247c70 EFLAGS: 00010203
[ 55.882582] RAX: 0000000000000000 RBX: ffffa0b13376f400 RCX:
ffffa0b13fc11d00
[ 55.883808] RDX: 0000000000000001 RSI: ffffa0b13376f400 RDI:
ffffa0b13376f400
[ 55.884983] RBP: ffffad2000247c70 R08: 0000000000000000 R09:
ffffffffbee42e20
[ 55.886168] R10: ffffad2000247b88 R11: 0000000000000008 R12:
ffffa0b1384c6018
[ 55.887343] R13: 0000000000000001 R14: 0000000000000080 R15:
0000000000000000
[ 55.888517] FS: 0000000000000000(0000) GS:ffffa0b13fc00000(0000)
knlGS:0000000000000000
[ 55.889816] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 55.890738] CR2: 0000000000000008 CR3: 000000003ba2f000 CR4:
00000000003406f0
[ 55.891878] Call Trace:
[ 55.892285] blk_mq_exit_hctx.isra.41+0xc4/0xd0
[ 55.893020] blk_mq_free_queue+0x110/0x130
[ 55.893693] blk_cleanup_queue+0xe0/0x150
[ 55.894346] nvme_ns_remove+0x78/0xd0 [nvme_core]
[ 55.895109] nvme_validate_ns+0x8c/0x290 [nvme_core]
[ 55.895911] ? nvme_scan_work+0x28a/0x370 [nvme_core]
[ 55.896726] nvme_scan_work+0x2ad/0x370 [nvme_core]
[ 55.897523] process_one_work+0x16b/0x480
[ 55.898174] worker_thread+0x4b/0x500
[ 55.898771] kthread+0x101/0x140
[ 55.899299] ? process_one_work+0x480/0x480
[ 55.899977] ? kthread_create_on_node+0x40/0x40
[ 55.900711] ? start_kernel+0x3bc/0x461
[ 55.901336] ? acpi_early_init+0x83/0xf9
[ 55.901980] ? acpi_load_tables+0x31/0x85
[ 55.902632] ret_from_fork+0x2c/0x40
[ 55.903215] Code: 74 09 48 8d 7b 48 e8 67 4b 06 00 5b 41 5c 5d c3 66
90 0f 1f 44 00 00 48 8b 87 08 01 00 00 f0 0f ba 77 18 01 72 01 c3 55 48
89 e5 <f0> ff 48 08 48 8d 78 10 e8 3a 4b 06 00 5d c3 0f 1f 84 00 00 00
[ 55.906220] RIP: __blk_mq_tag_idle+0x19/0x30 RSP: ffffad2000247c70
[ 55.907209] CR2: 0000000000000008
[ 55.907750] ---[ end trace f016dee1082237cf ]---
--
^ permalink raw reply
* [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-13 15:13 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
Hello,
this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).
People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.
Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2
Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun
Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback
Honza
^ permalink raw reply
* [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 1 -
mm/backing-dev.c | 22 +---------------------
2 files changed, 1 insertion(+), 22 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
#ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
- atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
#else
struct bdi_writeback_congested *wb_congested;
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b67be4fc12c4..8c30b1a7aae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
/*
* cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
* blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
- * protected. cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
/**
* wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
- struct backing_dev_info *bdi = wb->bdi;
wb_shutdown(wb);
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
-
- if (atomic_dec_and_test(&bdi->usage_cnt))
- wake_up_all(&cgwb_release_wait);
}
static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
/* we might have raced another instance of this function */
ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
if (!ret) {
- atomic_inc(&bdi->usage_cnt);
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
list_add(&wb->memcg_node, memcg_cgwb_list);
list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
- atomic_set(&bdi->usage_cnt, 1);
ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
if (!ret) {
@@ -728,18 +720,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
spin_lock_irq(&cgwb_lock);
}
spin_unlock_irq(&cgwb_lock);
-
- /*
- * All cgwb's must be shutdown and released before returning. Drain
- * the usage counter to wait for all cgwb's ever created on @bdi.
- */
- atomic_dec(&bdi->usage_cnt);
- wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
- /*
- * Grab back our reference so that we hold it when @bdi gets
- * re-registered.
- */
- atomic_inc(&bdi->usage_cnt);
}
/**
--
2.10.2
^ permalink raw reply related
* [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return ret;
}
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
{
struct radix_tree_iter iter;
void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return 0;
}
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
static void cgwb_bdi_exit(struct backing_dev_info *bdi)
{
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(&bdi->wb);
- cgwb_bdi_destroy(bdi);
+ cgwb_bdi_unregister(bdi);
if (bdi->dev) {
bdi_debug_unregister(bdi);
--
2.10.2
^ permalink raw reply related
* [PATCH 10/11] kobject: Export kobject_get_unless_zero()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara, Greg Kroah-Hartman
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Make the function available for outside use and fortify it against NULL
kobject.
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/kobject.h | 2 ++
lib/kobject.c | 5 ++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
extern int __must_check kobject_move(struct kobject *, struct kobject *);
extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+ struct kobject *kobj);
extern void kobject_put(struct kobject *kobj);
extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
}
EXPORT_SYMBOL(kobject_get);
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
{
+ if (!kobj)
+ return NULL;
if (!kref_get_unless_zero(&kobj->kref))
kobj = NULL;
return kobj;
}
+EXPORT_SYMBOL(kobject_get_unless_zero);
/*
* kobject_cleanup - free kobject resources.
--
2.10.2
^ permalink raw reply related
* [PATCH 11/11] block: Fix oops scsi_disk_get()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():
[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217] ? disk_get_part+0x22/0x1f0
[11863.059217] sd_open+0x39/0x130
[11863.059217] __blkdev_get+0x69/0x430
[11863.059217] ? bd_acquire+0x7f/0xc0
[11863.059217] ? bd_acquire+0x96/0xc0
[11863.059217] ? blkdev_get+0x350/0x350
[11863.059217] blkdev_get+0x126/0x350
[11863.059217] ? _raw_spin_unlock+0x2b/0x40
[11863.059217] ? bd_acquire+0x7f/0xc0
[11863.059217] ? blkdev_get+0x350/0x350
[11863.059217] blkdev_open+0x65/0x80
...
As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.
We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.
Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
block/genhd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
- kobj = kobject_get(&disk_to_dev(disk)->kobj);
+ kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
--
2.10.2
^ permalink raw reply related
* [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.
The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.
Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.
Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/block_dev.c | 8 ++------
include/linux/writeback.h | 1 +
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5ec8750f5332..c66f5dd4a02a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(&bdev_lock);
list_del_init(&bdev->bd_list);
spin_unlock(&bdev_lock);
+ /* Detach inode from wb early as bdi_put() may free bdi->wb */
+ inode_detach_wb(inode);
if (bdev->bd_bdi != &noop_backing_dev_info) {
bdi_put(bdev->bd_bdi);
bdev->bd_bdi = &noop_backing_dev_info;
@@ -1880,12 +1882,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
kill_bdev(bdev);
bdev_write_inode(bdev);
- /*
- * Detaching bdev inode from its wb in __destroy_inode()
- * is too late: the queue which embeds its bdi (along with
- * root wb) can be gone as soon as we put_disk() below.
- */
- inode_detach_wb(bdev->bd_inode);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
static inline void inode_detach_wb(struct inode *inode)
{
if (inode->i_wb) {
+ WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
wb_put(inode->i_wb);
inode->i_wb = NULL;
}
--
2.10.2
^ permalink raw reply related
* [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().
So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 03d4ba27c133..e3d56dba4da8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
return err;
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
}
spin_unlock_bh(&wb->work_lock);
+ cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
release_work);
struct backing_dev_info *bdi = wb->bdi;
- spin_lock_irq(&cgwb_lock);
- list_del_rcu(&wb->bdi_node);
- spin_unlock_irq(&cgwb_lock);
-
wb_shutdown(wb);
css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
percpu_ref_kill(&wb->refcnt);
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+ spin_lock_irq(&cgwb_lock);
+ list_del_rcu(&wb->bdi_node);
+ spin_unlock_irq(&cgwb_lock);
+}
+
static int cgwb_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css, gfp_t gfp)
{
@@ -766,6 +772,13 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
spin_unlock_irq(&cgwb_lock);
}
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+ spin_lock_irq(&cgwb_lock);
+ list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+ spin_unlock_irq(&cgwb_lock);
+}
+
#else /* CONFIG_CGROUP_WRITEBACK */
static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -793,6 +806,16 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
wb_congested_put(bdi->wb_congested);
}
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+ list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+}
+
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+ list_del_rcu(&wb->bdi_node);
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */
int bdi_init(struct backing_dev_info *bdi)
@@ -811,8 +834,6 @@ int bdi_init(struct backing_dev_info *bdi)
ret = cgwb_bdi_init(bdi);
- list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
-
return ret;
}
EXPORT_SYMBOL(bdi_init);
@@ -848,6 +869,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
if (IS_ERR(dev))
return PTR_ERR(dev);
+ cgwb_bdi_register(bdi);
bdi->dev = dev;
bdi_debug_register(bdi, dev_name(dev));
--
2.10.2
^ permalink raw reply related
* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 1 +
mm/backing-dev.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
*/
enum wb_state {
WB_registered, /* bdi_register() was done */
+ WB_shutting_down, /* wb_shutdown() in progress */
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
};
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(&wb->work_lock);
if (!test_and_clear_bit(WB_registered, &wb->state)) {
spin_unlock_bh(&wb->work_lock);
+ /*
+ * Wait for wb shutdown to finish if someone else is just
+ * running wb_shutdown(). Otherwise we could proceed to wb /
+ * bdi destruction before wb_shutdown() is finished.
+ */
+ wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
+ set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);
cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ /*
+ * Make sure bit gets cleared after shutdown is finished. Matches with
+ * the barrier provided by test_and_clear_bit() above.
+ */
+ smp_wmb();
+ clear_bit(WB_shutting_down, &wb->state);
}
static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
{
struct radix_tree_iter iter;
void **slot;
+ struct bdi_writeback *wb;
WARN_ON(test_bit(WB_registered, &bdi->wb.state));
spin_lock_irq(&cgwb_lock);
radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
cgwb_kill(*slot);
+
+ while (!list_empty(&bdi->wb_list)) {
+ wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+ bdi_node);
+ spin_unlock_irq(&cgwb_lock);
+ wb_shutdown(wb);
+ spin_lock_irq(&cgwb_lock);
+ }
spin_unlock_irq(&cgwb_lock);
/*
--
2.10.2
^ permalink raw reply related
* [PATCH 04/11] bdi: Make wb->bdi a proper reference
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
memset(wb, 0, sizeof(*wb));
+ if (wb != &bdi->wb)
+ bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
- if (!wb->congested)
- return -ENOMEM;
+ if (!wb->congested) {
+ err = -ENOMEM;
+ goto out_put_bdi;
+ }
err = fprop_local_init_percpu(&wb->completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
fprop_local_destroy_percpu(&wb->completions);
out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+ if (wb != &bdi->wb)
+ bdi_put(bdi);
return err;
}
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
fprop_local_destroy_percpu(&wb->completions);
wb_congested_put(wb->congested);
+ if (wb != &wb->bdi->wb)
+ bdi_put(wb->bdi);
}
#ifdef CONFIG_CGROUP_WRITEBACK
--
2.10.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox