* Re: [PATCH] blk-mq: don't complete un-started request in timeout handler
From: Ming Lei @ 2017-03-23 12:14 UTC (permalink / raw)
To: Keith Busch
Cc: Jens Axboe, linux-kernel, linux-block, Christoph Hellwig,
Yi Zhang, Bart Van Assche, Hannes Reinecke
In-Reply-To: <20170322155817.GA18960@localhost.localdomain>
On Wed, Mar 22, 2017 at 11:58:17AM -0400, Keith Busch wrote:
> On Tue, Mar 21, 2017 at 11:03:59PM -0400, Jens Axboe wrote:
> > On 03/21/2017 10:14 PM, Ming Lei wrote:
> > > When iterating busy requests in timeout handler,
> > > if the STARTED flag of one request isn't set, that means
> > > the request is being processed in block layer or driver, and
> > > isn't submitted to hardware yet.
> > >
> > > In current implementation of blk_mq_check_expired(),
> > > if the request queue becomes dying, un-started requests are
> > > handled as being completed/freed immediately. This way is
> > > wrong, and can cause rq corruption or double allocation[1][2],
> > > when doing I/O and removing&resetting NVMe device at the sametime.
> >
> > I agree, completing it looks bogus. If the request is in a scheduler or
> > on a software queue, this won't end well at all. Looks like it was
> > introduced by this patch:
> >
> > commit eb130dbfc40eabcd4e10797310bda6b9f6dd7e76
> > Author: Keith Busch <keith.busch@intel.com>
> > Date: Thu Jan 8 08:59:53 2015 -0700
> >
> > blk-mq: End unstarted requests on a dying queue
> >
> > Before that, we just ignored it. Keith?
>
> The above was intended for a stopped hctx on a dying queue such that
> there's nothing in flight to the driver. Nvme had been relying on this
> to end unstarted requests so we may progress when a controller dies.
So the brokenness started just from the begining.
>
> We've since obviated the need: we restart the hw queues to flush entered
> requests to failure, so we don't need that brokenness.
Looks the following commit need to be backported too if we port this patch.
commit 69d9a99c258eb1d6478fd9608a2070890797eed7
Author: Keith Busch <keith.busch@intel.com>
Date: Wed Feb 24 09:15:56 2016 -0700
NVMe: Move error handling to failed reset handler
Thanks,
Ming
^ permalink raw reply
* Re: [PATCH v2] block: trace completion of all bios.
From: Ming Lei @ 2017-03-23 10:43 UTC (permalink / raw)
To: NeilBrown
Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-raid, dm-devel,
Alasdair Kergon, Mike Snitzer, Shaohua Li, linux-kernel,
Martin K . Petersen
In-Reply-To: <87shm4a4lt.fsf@notabene.neil.brown.name>
On Thu, Mar 23, 2017 at 05:29:02PM +1100, NeilBrown wrote:
>
> Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
> Now that we have bio_chain(), it is not possible, in general, for a
> driver to know when the bio is really complete. Only bio_endio()
> knows that.
>
> So move the trace_block_bio_complete() call to bio_endio().
>
> Now trace_block_bio_complete() pairs with trace_block_bio_queue().
> Any bio for which a 'queue' event is traced, will subsequently
> generate a 'complete' event.
>
> There are a few cases where completion tracing is not wanted.
> 1/ If blk_update_request() has already generated a completion
> trace event at the 'request' level, there is no point generating
> one at the bio level too. In this case the bi_sector and bi_size
> will have changed, so the bio level event would be wrong
>
> 2/ If the bio hasn't actually been queued yet, but is being aborted
> early, then a trace event could be confusing. Some filesystems
> call bio_endio() and will need to use a different interface to
> avoid tracing
>
> 3/ The bio_integrity code interposes itself by replacing bi_end_io,
> then restores it and calls bio_endio() again. This would produce
> two identical trace events if left like that.
>
> To handle these, we provide bio_endio_notrace(). This patch only adds
> uses of this in core code. Separate patches will be needed to update
> the filesystems to avoid tracing.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> block/bio-integrity.c | 4 ++--
> block/bio.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> block/blk-core.c | 2 +-
> drivers/md/dm.c | 1 -
> drivers/md/raid5.c | 8 --------
> include/linux/bio.h | 1 +
> 6 files changed, 50 insertions(+), 12 deletions(-)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5384713d48bc..28581e2f68fb 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -370,7 +370,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
>
> /* Restore original bio completion handler */
> bio->bi_end_io = bip->bip_end_io;
> - bio_endio(bio);
> + bio_endio_notrace(bio);
> }
>
> /**
> @@ -397,7 +397,7 @@ void bio_integrity_endio(struct bio *bio)
> */
> if (bio->bi_error) {
> bio->bi_end_io = bip->bip_end_io;
> - bio_endio(bio);
> + bio_endio_notrace(bio);
>
> return;
> }
> diff --git a/block/bio.c b/block/bio.c
> index 5eec5e08417f..c8e5d24abd52 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1811,6 +1811,45 @@ static inline bool bio_remaining_done(struct bio *bio)
> }
>
> /**
> + * bio_endio_notrace - end I/O on a bio without tracing
> + * @bio: bio
> + *
> + * Description:
> + * bio_endio_notrace() will end I/O on the whole bio.
> + * bio_endio_notrace() should only be call if a completion trace
> + * event is not needed. This can be the case if a request-level
> + * completion event has already been generated, if the bio is
> + * being completed early, before it was even queued.
> + *
> + **/
> +void bio_endio_notrace(struct bio *bio)
> +{
> +again:
> + if (!bio_remaining_done(bio))
> + return;
> +
> + /*
> + * Need to have a real endio function for chained bios, otherwise
> + * various corner cases will break (like stacking block devices that
> + * save/restore bi_end_io) - however, we want to avoid unbounded
> + * recursion and blowing the stack. Tail call optimization would
> + * handle this, but compiling with frame pointers also disables
> + * gcc's sibling call optimization.
> + */
> + if (bio->bi_end_io == bio_chain_endio) {
> + bio = __bio_chain_endio(bio);
> + goto again;
> + }
> +
> + if (bio->bi_bdev)
> + trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
> + bio, bio->bi_error);
The notrace version still traces?
> + if (bio->bi_end_io)
> + bio->bi_end_io(bio);
> +}
> +EXPORT_SYMBOL(bio_endio_notrace);
It isn't a good idea to duplicate bio_endio here, and any change on
bio_endio() may be needed for this _notrace version too in future.
Thanks,
Ming
^ permalink raw reply
* [PATCH] block: make nr_iovecs unsigned in bio_alloc_bioset()
From: Dan Carpenter @ 2017-03-23 10:24 UTC (permalink / raw)
To: Jens Axboe
Cc: Ming Lei, Mike Christie, Kent Overstreet, Hannes Reinecke,
Chaitanya Kulkarni, Mike Snitzer, Adrian Hunter, linux-block,
kernel-janitors
There isn't a bug here, but Smatch is not smart enough to know that
"nr_iovecs" can't be negative so it complains about underflows.
Really, it's slightly cleaner to make this parameter unsigned.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..4931756d86d9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -383,7 +383,7 @@ extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
extern void bioset_free(struct bio_set *);
extern mempool_t *biovec_create_pool(int pool_entries);
-extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
+extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
extern void bio_put(struct bio *);
extern void __bio_clone_fast(struct bio *, struct bio *);
diff --git a/block/bio.c b/block/bio.c
index e75878f8b14a..6194a8cf2aab 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -427,7 +427,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
* RETURNS:
* Pointer to new bio on success, NULL on failure.
*/
-struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
+struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
+ struct bio_set *bs)
{
gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
^ permalink raw reply related
* [PATCH v2] block: trace completion of all bios.
From: NeilBrown @ 2017-03-23 6:29 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: linux-block, linux-raid, dm-devel, Alasdair Kergon, Mike Snitzer,
Shaohua Li, linux-kernel, Martin K . Petersen
In-Reply-To: <20170322125149.GA29606@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 7116 bytes --]
Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
Now that we have bio_chain(), it is not possible, in general, for a
driver to know when the bio is really complete. Only bio_endio()
knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() and will need to use a different interface to
avoid tracing
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restores it and calls bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we provide bio_endio_notrace(). This patch only adds
uses of this in core code. Separate patches will be needed to update
the filesystems to avoid tracing.
Signed-off-by: NeilBrown <neilb@suse.com>
---
block/bio-integrity.c | 4 ++--
block/bio.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
block/blk-core.c | 2 +-
drivers/md/dm.c | 1 -
drivers/md/raid5.c | 8 --------
include/linux/bio.h | 1 +
6 files changed, 50 insertions(+), 12 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713d48bc..28581e2f68fb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -370,7 +370,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
- bio_endio(bio);
+ bio_endio_notrace(bio);
}
/**
@@ -397,7 +397,7 @@ void bio_integrity_endio(struct bio *bio)
*/
if (bio->bi_error) {
bio->bi_end_io = bip->bip_end_io;
- bio_endio(bio);
+ bio_endio_notrace(bio);
return;
}
diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..c8e5d24abd52 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1811,6 +1811,45 @@ static inline bool bio_remaining_done(struct bio *bio)
}
/**
+ * bio_endio_notrace - end I/O on a bio without tracing
+ * @bio: bio
+ *
+ * Description:
+ * bio_endio_notrace() will end I/O on the whole bio.
+ * bio_endio_notrace() should only be call if a completion trace
+ * event is not needed. This can be the case if a request-level
+ * completion event has already been generated, if the bio is
+ * being completed early, before it was even queued.
+ *
+ **/
+void bio_endio_notrace(struct bio *bio)
+{
+again:
+ if (!bio_remaining_done(bio))
+ return;
+
+ /*
+ * Need to have a real endio function for chained bios, otherwise
+ * various corner cases will break (like stacking block devices that
+ * save/restore bi_end_io) - however, we want to avoid unbounded
+ * recursion and blowing the stack. Tail call optimization would
+ * handle this, but compiling with frame pointers also disables
+ * gcc's sibling call optimization.
+ */
+ if (bio->bi_end_io == bio_chain_endio) {
+ bio = __bio_chain_endio(bio);
+ goto again;
+ }
+
+ if (bio->bi_bdev)
+ trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+ bio, bio->bi_error);
+ if (bio->bi_end_io)
+ bio->bi_end_io(bio);
+}
+EXPORT_SYMBOL(bio_endio_notrace);
+
+/**
* bio_endio - end I/O on a bio
* @bio: bio
*
@@ -1818,6 +1857,10 @@ static inline bool bio_remaining_done(struct bio *bio)
* bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
* way to end I/O on a bio. No one should call bi_end_io() directly on a
* bio unless they own it and thus know that it has an end_io function.
+ *
+ * bio_endio() can be called several times on a bio that has been chained
+ * using bio_chain(). The ->bi_end_io() function will only be call the
+ * time. At this point the BLK_TA_COMPLETE tracing event will be generated.
**/
void bio_endio(struct bio *bio)
{
@@ -1838,6 +1881,9 @@ void bio_endio(struct bio *bio)
goto again;
}
+ if (bio->bi_bdev)
+ trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+ bio, bio->bi_error);
if (bio->bi_end_io)
bio->bi_end_io(bio);
}
diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..b6c76580a796 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -142,7 +142,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
/* don't actually finish bio if it's part of flush sequence */
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
- bio_endio(bio);
+ bio_endio_notrace(bio);
}
void blk_dump_rq_flags(struct request *rq, char *msg)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..f5f09ace690a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -810,7 +810,6 @@ static void dec_pending(struct dm_io *io, int error)
queue_io(md, bio);
} else {
/* done with normal IO or empty flush */
- trace_block_bio_complete(md->queue, bio, io_error);
bio->bi_error = io_error;
bio_endio(bio);
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9a3b7da34137..f684cb566721 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5141,8 +5141,6 @@ static void raid5_align_endio(struct bio *bi)
rdev_dec_pending(rdev, conf->mddev);
if (!error) {
- trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
- raid_bi, 0);
bio_endio(raid_bi);
if (atomic_dec_and_test(&conf->active_aligned_reads))
wake_up(&conf->wait_for_quiescent);
@@ -5727,10 +5725,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
md_write_end(mddev);
remaining = raid5_dec_bi_active_stripes(bi);
if (remaining == 0) {
-
-
- trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
- bi, 0);
bio_endio(bi);
}
}
@@ -6138,8 +6132,6 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
}
remaining = raid5_dec_bi_active_stripes(raid_bio);
if (remaining == 0) {
- trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
- raid_bio, 0);
bio_endio(raid_bio);
}
if (atomic_dec_and_test(&conf->active_aligned_reads))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..e0552bee227b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -418,6 +418,7 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
extern blk_qc_t submit_bio(struct bio *);
extern void bio_endio(struct bio *);
+extern void bio_endio_notrace(struct bio *);
static inline void bio_io_error(struct bio *bio)
{
--
2.12.0
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply related
* Re: [PATCH] block: trace completion of all bios.
From: NeilBrown @ 2017-03-23 6:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, linux-raid, dm-devel, Alasdair Kergon,
Mike Snitzer, Shaohua Li, linux-kernel
In-Reply-To: <20170322125149.GA29606@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 901 bytes --]
On Wed, Mar 22 2017, Christoph Hellwig wrote:
> On Wed, Mar 22, 2017 at 01:38:09PM +1100, NeilBrown wrote:
>>
>> Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
>> Now that we have bio_chain(), it is not possible, in general, for a
>> driver to know when the bio is really complete. Only bio_endio()
>> knows that.
>>
>> So move the trace_block_bio_complete() call to bio_endio().
>
> This will cause duplicate events for request based drivers. You'll
> need to have a bio_endio_notrace or similar without that the request
> completion path can call.
Ah... I hadn't noticed that the request completion was the same event
type as the bio completion... Thanks. Also after being processed by the
request handler, bi_sector and bi_size have changed so the trace
messsage would be wrong.
I've sorted that out and will repost.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: unify and streamline the blk-mq make_request implementations V3
From: Jens Axboe @ 2017-03-23 2:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Bart.VanAssche, linux-block
In-Reply-To: <20170322190153.12217-1-hch@lst.de>
On 03/22/2017 01:01 PM, Christoph Hellwig wrote:
> A bunch of cleanups to get us a nice I/O submission path.
>
> Changes since V2:
> - really address the comments from Bart
> - address another comment from Bart
>
> Changes since V1:
> - rebase on top of the recent blk_mq_try_issue_directly changes
> - incorporate comments from Bart
Added for 4.12, thanks.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 0/10 v5] block: Fix block device shutdown related races
From: Jens Axboe @ 2017-03-23 2:14 UTC (permalink / raw)
To: Jan Kara
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
On 03/22/2017 06:36 PM, Jan Kara wrote:
> Hello,
>
> this is a series with the remaining patches (on top of 4.11-rc2) to
> fix several different races and issues I've found when testing device
> shutdown and reuse. The first patch fixes possible (theoretical)
> problems when opening of a block device races with shutdown of a
> gendisk structure. Patches 2-8 fix oops that is triggered by
> __blkdev_put() calling inode_detach_wb() too early (the problem
> reported by Thiago). Patches 9 and 10 fix oops due to a bug in gendisk
> code where get_gendisk() can return already freed gendisk structure
> (again triggered by Omar's stress test).
>
> All patches got reviewed by Tejun and also tested by Thiago (thanks!).
> Jens, can you please queue these fixes for the next merge window?
> Thanks!
Done! Thanks a lot for working on this, Jan.
--
Jens Axboe
^ permalink raw reply
* [PATCH 07/10] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8c30b1a7aae5..3ea3bbd921d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return ret;
}
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
{
struct radix_tree_iter iter;
void **slot;
@@ -801,7 +801,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return 0;
}
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
static void cgwb_bdi_exit(struct backing_dev_info *bdi)
{
@@ -925,7 +925,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(&bdi->wb);
- cgwb_bdi_destroy(bdi);
+ cgwb_bdi_unregister(bdi);
if (bdi->dev) {
bdi_debug_unregister(bdi);
--
2.10.2
^ permalink raw reply related
* [PATCH 10/10] block: Fix oops scsi_disk_get()
From: Jan Kara @ 2017-03-23 0:37 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():
[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217] ? disk_get_part+0x22/0x1f0
[11863.059217] sd_open+0x39/0x130
[11863.059217] __blkdev_get+0x69/0x430
[11863.059217] ? bd_acquire+0x7f/0xc0
[11863.059217] ? bd_acquire+0x96/0xc0
[11863.059217] ? blkdev_get+0x350/0x350
[11863.059217] blkdev_get+0x126/0x350
[11863.059217] ? _raw_spin_unlock+0x2b/0x40
[11863.059217] ? bd_acquire+0x7f/0xc0
[11863.059217] ? blkdev_get+0x350/0x350
[11863.059217] blkdev_open+0x65/0x80
...
As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.
We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.
Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
block/genhd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/genhd.c b/block/genhd.c
index a9c516a8b37d..510aac1486cb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1352,7 +1352,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
- kobj = kobject_get(&disk_to_dev(disk)->kobj);
+ kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
--
2.10.2
^ permalink raw reply related
* [PATCH 08/10] block: Fix oops in locked_inode_to_wb_and_lock_list()
From: Jan Kara @ 2017-03-23 0:37 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.
The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.
Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.
Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/block_dev.c | 8 ++------
include/linux/writeback.h | 1 +
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..f2d59f143ef4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(&bdev_lock);
list_del_init(&bdev->bd_list);
spin_unlock(&bdev_lock);
+ /* Detach inode from wb early as bdi_put() may free bdi->wb */
+ inode_detach_wb(inode);
if (bdev->bd_bdi != &noop_backing_dev_info) {
bdi_put(bdev->bd_bdi);
bdev->bd_bdi = &noop_backing_dev_info;
@@ -1875,12 +1877,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
kill_bdev(bdev);
bdev_write_inode(bdev);
- /*
- * Detaching bdev inode from its wb in __destroy_inode()
- * is too late: the queue which embeds its bdi (along with
- * root wb) can be gone as soon as we put_disk() below.
- */
- inode_detach_wb(bdev->bd_inode);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
static inline void inode_detach_wb(struct inode *inode)
{
if (inode->i_wb) {
+ WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
wb_put(inode->i_wb);
inode->i_wb = NULL;
}
--
2.10.2
^ permalink raw reply related
* [PATCH 02/10] bdi: Mark congested->bdi as internal
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.
We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 4 +++-
mm/backing-dev.c | 10 +++++-----
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
atomic_t refcnt; /* nr of attached wb's and blkg */
#ifdef CONFIG_CGROUP_WRITEBACK
- struct backing_dev_info *bdi; /* the associated bdi */
+ struct backing_dev_info *__bdi; /* the associated bdi, set to NULL
+ * on bdi unregistration. For memcg-wb
+ * internal use only! */
int blkcg_id; /* ID of the associated blkcg */
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6f2a37028c2..12408f86783c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
return NULL;
atomic_set(&new_congested->refcnt, 0);
- new_congested->bdi = bdi;
+ new_congested->__bdi = bdi;
new_congested->blkcg_id = blkcg_id;
goto retry;
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested *congested)
}
/* bdi might already have been destroyed leaving @congested unlinked */
- if (congested->bdi) {
+ if (congested->__bdi) {
rb_erase(&congested->rb_node,
- &congested->bdi->cgwb_congested_tree);
- congested->bdi = NULL;
+ &congested->__bdi->cgwb_congested_tree);
+ congested->__bdi = NULL;
}
spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -752,7 +752,7 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
rb_entry(rbn, struct bdi_writeback_congested, rb_node);
rb_erase(rbn, &bdi->cgwb_congested_tree);
- congested->bdi = NULL; /* mark @congested unlinked */
+ congested->__bdi = NULL; /* mark @congested unlinked */
}
spin_unlock_irq(&cgwb_lock);
}
--
2.10.2
^ permalink raw reply related
* [PATCH 09/10] kobject: Export kobject_get_unless_zero()
From: Jan Kara @ 2017-03-23 0:37 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara, Greg Kroah-Hartman
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Make the function available for outside use and fortify it against NULL
kobject.
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/kobject.h | 2 ++
lib/kobject.c | 5 ++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
extern int __must_check kobject_move(struct kobject *, struct kobject *);
extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+ struct kobject *kobj);
extern void kobject_put(struct kobject *kobj);
extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
}
EXPORT_SYMBOL(kobject_get);
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
{
+ if (!kobj)
+ return NULL;
if (!kref_get_unless_zero(&kobj->kref))
kobj = NULL;
return kobj;
}
+EXPORT_SYMBOL(kobject_get_unless_zero);
/*
* kobject_cleanup - free kobject resources.
--
2.10.2
^ permalink raw reply related
* [PATCH 06/10] bdi: Do not wait for cgwbs release in bdi_unregister()
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 1 -
mm/backing-dev.c | 22 +---------------------
2 files changed, 1 insertion(+), 22 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
#ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
- atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
#else
struct bdi_writeback_congested *wb_congested;
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b67be4fc12c4..8c30b1a7aae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
/*
* cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
* blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
- * protected. cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
/**
* wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
- struct backing_dev_info *bdi = wb->bdi;
wb_shutdown(wb);
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
-
- if (atomic_dec_and_test(&bdi->usage_cnt))
- wake_up_all(&cgwb_release_wait);
}
static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
/* we might have raced another instance of this function */
ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
if (!ret) {
- atomic_inc(&bdi->usage_cnt);
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
list_add(&wb->memcg_node, memcg_cgwb_list);
list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
- atomic_set(&bdi->usage_cnt, 1);
ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
if (!ret) {
@@ -728,18 +720,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
spin_lock_irq(&cgwb_lock);
}
spin_unlock_irq(&cgwb_lock);
-
- /*
- * All cgwb's must be shutdown and released before returning. Drain
- * the usage counter to wait for all cgwb's ever created on @bdi.
- */
- atomic_dec(&bdi->usage_cnt);
- wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
- /*
- * Grab back our reference so that we hold it when @bdi gets
- * re-registered.
- */
- atomic_inc(&bdi->usage_cnt);
}
/**
--
2.10.2
^ permalink raw reply related
* [PATCH 05/10] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 1 +
mm/backing-dev.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
*/
enum wb_state {
WB_registered, /* bdi_register() was done */
+ WB_shutting_down, /* wb_shutdown() in progress */
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
};
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e3d56dba4da8..b67be4fc12c4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(&wb->work_lock);
if (!test_and_clear_bit(WB_registered, &wb->state)) {
spin_unlock_bh(&wb->work_lock);
+ /*
+ * Wait for wb shutdown to finish if someone else is just
+ * running wb_shutdown(). Otherwise we could proceed to wb /
+ * bdi destruction before wb_shutdown() is finished.
+ */
+ wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
+ set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);
cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ /*
+ * Make sure bit gets cleared after shutdown is finished. Matches with
+ * the barrier provided by test_and_clear_bit() above.
+ */
+ smp_wmb();
+ clear_bit(WB_shutting_down, &wb->state);
}
static void wb_exit(struct bdi_writeback *wb)
@@ -699,12 +712,21 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
{
struct radix_tree_iter iter;
void **slot;
+ struct bdi_writeback *wb;
WARN_ON(test_bit(WB_registered, &bdi->wb.state));
spin_lock_irq(&cgwb_lock);
radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
cgwb_kill(*slot);
+
+ while (!list_empty(&bdi->wb_list)) {
+ wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+ bdi_node);
+ spin_unlock_irq(&cgwb_lock);
+ wb_shutdown(wb);
+ spin_lock_irq(&cgwb_lock);
+ }
spin_unlock_irq(&cgwb_lock);
/*
--
2.10.2
^ permalink raw reply related
* [PATCH 04/10] bdi: Unify bdi->wb_list handling for root wb_writeback
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().
So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 03d4ba27c133..e3d56dba4da8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
return err;
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
}
spin_unlock_bh(&wb->work_lock);
+ cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
release_work);
struct backing_dev_info *bdi = wb->bdi;
- spin_lock_irq(&cgwb_lock);
- list_del_rcu(&wb->bdi_node);
- spin_unlock_irq(&cgwb_lock);
-
wb_shutdown(wb);
css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
percpu_ref_kill(&wb->refcnt);
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+ spin_lock_irq(&cgwb_lock);
+ list_del_rcu(&wb->bdi_node);
+ spin_unlock_irq(&cgwb_lock);
+}
+
static int cgwb_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css, gfp_t gfp)
{
@@ -766,6 +772,13 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
spin_unlock_irq(&cgwb_lock);
}
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+ spin_lock_irq(&cgwb_lock);
+ list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+ spin_unlock_irq(&cgwb_lock);
+}
+
#else /* CONFIG_CGROUP_WRITEBACK */
static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -793,6 +806,16 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
wb_congested_put(bdi->wb_congested);
}
+static void cgwb_bdi_register(struct backing_dev_info *bdi)
+{
+ list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+}
+
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+ list_del_rcu(&wb->bdi_node);
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */
int bdi_init(struct backing_dev_info *bdi)
@@ -811,8 +834,6 @@ int bdi_init(struct backing_dev_info *bdi)
ret = cgwb_bdi_init(bdi);
- list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
-
return ret;
}
EXPORT_SYMBOL(bdi_init);
@@ -848,6 +869,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
if (IS_ERR(dev))
return PTR_ERR(dev);
+ cgwb_bdi_register(bdi);
bdi->dev = dev;
bdi_debug_register(bdi, dev_name(dev));
--
2.10.2
^ permalink raw reply related
* [PATCH 03/10] bdi: Make wb->bdi a proper reference
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
memset(wb, 0, sizeof(*wb));
+ if (wb != &bdi->wb)
+ bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
- if (!wb->congested)
- return -ENOMEM;
+ if (!wb->congested) {
+ err = -ENOMEM;
+ goto out_put_bdi;
+ }
err = fprop_local_init_percpu(&wb->completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
fprop_local_destroy_percpu(&wb->completions);
out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+ if (wb != &bdi->wb)
+ bdi_put(bdi);
return err;
}
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
fprop_local_destroy_percpu(&wb->completions);
wb_congested_put(wb->congested);
+ if (wb != &wb->bdi->wb)
+ bdi_put(wb->bdi);
}
#ifdef CONFIG_CGROUP_WRITEBACK
--
2.10.2
^ permalink raw reply related
* [PATCH 01/10] block: Fix bdi assignment to bdev inode when racing with disk delete
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/block_dev.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
- if (bdev->bd_bdi == &noop_backing_dev_info)
- bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
if (!partno) {
ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
}
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
+
+ if (bdev->bd_bdi == &noop_backing_dev_info)
+ bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
- bdi_put(bdev->bd_bdi);
- bdev->bd_bdi = &noop_backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
--
2.10.2
^ permalink raw reply related
* [PATCH 0/10 v5] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-23 0:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
Jan Kara
Hello,
this is a series with the remaining patches (on top of 4.11-rc2) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first patch fixes possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 2-8 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 9 and 10 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).
All patches got reviewed by Tejun and also tested by Thiago (thanks!). Jens,
can you please queue these fixes for the next merge window? Thanks!
Changes since v4:
* Dropped patch fixing possible gendisk shutdown vs blkdev_open race - needs
more thinking
* Added Tejun's ack
Changes since v3:
* Rebased on top of 4.11-rc2
* Reworked patch 2 (block: Fix race of bdev open with gendisk shutdown) based
on Tejun's feedback
* Significantly updated patch 5 (and dropped previous Tejun's ack) to
accommodate for fixes to SCSI re-registration of BDI that went to 4.11-rc2
Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun
Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback
Honza
^ permalink raw reply
* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Jan Kara @ 2017-03-23 0:21 UTC (permalink / raw)
To: Tejun Heo
Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
Dan Williams, Thiago Jung Bauermann, Tahsin Erdogan,
Omar Sandoval
In-Reply-To: <20170321161907.GB30919@htj.duckdns.org>
Hello Tejun,
> On Mon, Mar 13, 2017 at 04:14:01PM +0100, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> >
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> >
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> >
> > Also add a warning that will tell us about unexpected inconsistencies
> > between bdi associated with the bdev inode and bdi associated with the
> > disk.
>
> Hmm... let's see if I got this straight.
>
> It used to be that blockdevs are essentially stateless while nobody
> has it open. It acquires all its actual associations during the
> initial open and loses them on the last release. The immediate
> release semantics got us into trouble because upper layers had nothing
> to serve as the proper sever point when the underlying qblock device
> goes away.
>
> So, we decided that bdi should serve that purpose, which makes perfect
> sense as it's what upper layers talk to when they wanna reach to the
> block device, so we decoupled its lifetime from the request_queue and
> implements sever there.
>
> Now that bdis are persistent, we can solve bdev-access-while-not-open
> problem by making bdi point to the respective bdi from the beginning
> until it's released, which means that bdevs are now stateful objects
> which are associated with specific bdis and thus request_queues.
Yes, perfect summary.
> Because there are multiple ways that these objects are looked up and
> handled, now we can get into situations where the request_queue
> (gendisk) we look up during open may not match the bdi that a bdev is
> associated with, and this patch solves that problem by detecting the
> conditions where these mismatches would take place. More
> specifically, we don't want to be using a dead bdev during open.
Yes.
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 53e2389ae4d4..5ec8750f5332 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> > if (!partno) {
> > ret = -ENXIO;
> > bdev->bd_part = disk_get_part(disk, partno);
> > - if (!bdev->bd_part)
> > + if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> > + inode_unhashed(bdev->bd_inode))
> > goto out_clear;
>
> This adds both GENHD_FL_UP and unhashed testing; however, I'm not sure
> whether the former is meaningful here. GENHD flag updates are not
> synchronized when seen from outside the party which is updating it.
> The actual synchronization against dead device should happen inside
> disk->fops->open().
Hum, now that I look at the code again it seems my patch is still racy if
we get fresh inode (lookup_bdev() already allocated new inode and returned
it to us) but get_gendisk() returned gendisk that is shutting down but
we'll still see GENHD_FL_UP set and thus associate new inode with gendisk
that is being destroyed.
> > ret = 0;
> > @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> > bdev->bd_contains = whole;
> > bdev->bd_part = disk_get_part(disk, partno);
> > if (!(disk->flags & GENHD_FL_UP) ||
> > - !bdev->bd_part || !bdev->bd_part->nr_sects) {
> > + !bdev->bd_part || !bdev->bd_part->nr_sects ||
> > + inode_unhashed(bdev->bd_inode)) {
> > ret = -ENXIO;
> > goto out_clear;
> > }
>
> Which is different from here. As the device is already open, we're
> just incrementing the ref before granting another open without
> consulting the driver. As the device is open already, whether
> granting a new ref or not can't break anything. Block layer is just
> doing a best effort thing to not give out new ref on a dead one with
> the UP test, which is completely fine.
Yeah.
> > @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> >
> > if (bdev->bd_bdi == &noop_backing_dev_info)
> > bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> > + else
> > + WARN_ON_ONCE(bdev->bd_bdi !=
> > + disk->queue->backing_dev_info);
>
> While I can't think of cases where this wouldn't work, it seems a bit
> roundabout to me.
>
> * A bdi is the object which knows which request_queue it's associated
> and when that association dies.
So bdi is associated with a request_queue but it does not have any
direct pointer or reference to it. It is the request_queue that points to
bdi. However you are right that bdi_unregister() call is telling bdi that
the device is going away.
> * bdev is permanently associated with a bdi.
Yes.
> So, it's kinda weird to look up the request_queue again on each open
> and then verify that the bd_bdi and request_queue match. I think it
> would make more sense to skip disk look up and just go through bdi to
> determine the associated request_queue as that's the primary nexus
> object tying everything up now.
So I was thinking about this as well. The problem is that you cannot really
keep a reference to request_queue (or gendisk) from block device inode or
bdi as that could unexpectedly pin the driver in memory after user thinks
everything should be cleaned up. So I think we really do have to establish
the connection between block device inode and request_queue / gendisk on
opening of the block device and drop the reference when the block device is
closed.
> That said, it's totally possible that that would take too much
> restructuring to implement right now with everything else, but if we
> think that that's the right long term direction, I think it would make
> more sense to just test whether bdev->bd_bdi matches the disk right
> after looking up the disk and fail the open if not. That's the
> ultimate condition we wanna avoid after all and it also would ease
> replacing it with going through bdi instead of looking up again.
The problem with this is that you could still associate fresh block device
inode with a bdi corresponding to gendisk that is going away (and that has
already went through bdev_unhash_inode()). Such block device inode may than
stay in case associated with that bdi even after the device number gets
reused and that will cause false the check for matching bdi to fail ever
after that moment.
So since I'm not actually able to trigger any of these races in my testing,
I guess I'll just drop this patch (it isn't needed by anything else in the
series) and let the reset of the series be merged. When I come up with some
better solution to this problem, I'll send a fix separately. Thanks for
review!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [Nbd] [RFC PATCH 1/1] nbd: replace kill_bdev() with __invalidate_device()
From: Josef Bacik @ 2017-03-22 21:43 UTC (permalink / raw)
To: Ming Lin
Cc: nbd-general@lists.sourceforge.net, Ratna Manoj Bolla, lkml,
linux-block@vger.kernel.org, jianshu.ljs@alibaba-inc.com,
LIU, Fei, Markus Pargmann, xiongwei.jiang@alibaba-inc.com
In-Reply-To: <CAF1ivSb2nwBAKjbsFhca3cvJk=-B_gtu-ZdqkyAetJSer5CuYw@mail.gmail.com>
SGV5IHNvcnJ5IEkganVzdCBnb3QgYmFjayBmcm9tIExTRiwgSeKAmWxsIGxvb2sgYXQgdGhpcyBp
biB0aGUgbW9ybmluZy4gIFRoYW5rcywNCg0KSm9zZWYNCg0KT24gMy8yMi8xNywgNDo0OCBQTSwg
Ik1pbmcgTGluIiA8bWxpbkBrZXJuZWwub3JnPiB3cm90ZToNCg0KT24gTW9uLCBNYXIgMjAsIDIw
MTcgYXQgMzo1OCBQTSwgTWluZyBMaW4gPG1saW5Aa2VybmVsLm9yZz4gd3JvdGU6DQo+IEZyb206
IFJhdG5hIE1hbm9qIEJvbGxhIDxtYW5vai5ickBnbWFpbC5jb20+DQo+DQo+IFdoZW4gYSBmaWxl
c3lzdGVtIGlzIG1vdW50ZWQgb24gYSBuYmQgZGV2aWNlIGFuZCBvbiBhIGRpc2Nvbm5lY3QsIGJl
Y2F1c2UNCj4gb2Yga2lsbF9iZGV2KCksIGFuZCByZXNldHRpbmcgYmRldiBzaXplIHRvIHplcm8s
IGJ1ZmZlcl9oZWFkIG1hcHBpbmdzIGFyZQ0KPiBnZXR0aW5nIGRlc3Ryb3llZCB1bmRlciBtb3Vu
dGVkIGZpbGVzeXN0ZW0uDQo+DQo+IEFmdGVyIGEgYmRldiBzaXplIHJlc2V0KGkuZSBiZGV2LT5i
ZF9pbm9kZS0+aV9zaXplID0gMCkgb24gYSBkaXNjb25uZWN0LA0KPiBmb2xsb3dlZCBieSBhIHN5
c191bW91bnQoKSwNCj4gICAgICAgICBnZW5lcmljX3NodXRkb3duX3N1cGVyKCktPi4uLg0KPiAg
ICAgICAgIC0+X19zeW5jX2Jsb2NrZGV2KCktPi4uLg0KPiAgICAgICAgIC1ibGtkZXZfd3JpdGVw
YWdlcygpLT4uLi4NCj4gICAgICAgICAtPmRvX2ludmFsaWRhdGVwYWdlKCktPi4uLg0KPiAgICAg
ICAgIC1kaXNjYXJkX2J1ZmZlcigpICAgaXMgZGlzY2FyZGluZyBzdXBlcmJsb2NrIGJ1ZmZlcl9o
ZWFkIGFzc3VtZWQNCj4gdG8gYmUgaW4gbWFwcGVkIHN0YXRlIGJ5IGV4dDRfY29tbWl0X3N1cGVy
KCkuDQo+DQo+IFttbGluOiBwb3J0ZWQgdG8gNC4xMS1yYzJdDQo+IFNpZ25lZC1vZmYtYnk6IFJh
dG5hIE1hbm9qIEJvbGxhIDxtYW5vai5ickBnbWFpbC5jb20NCj4gLS0tDQo+ICBkcml2ZXJzL2Js
b2NrL25iZC5jIHwgOCArKysrKystLQ0KPiAgMSBmaWxlIGNoYW5nZWQsIDYgaW5zZXJ0aW9ucygr
KSwgMiBkZWxldGlvbnMoLSkNCj4NCj4gZGlmZiAtLWdpdCBhL2RyaXZlcnMvYmxvY2svbmJkLmMg
Yi9kcml2ZXJzL2Jsb2NrL25iZC5jDQo+IGluZGV4IGNiNGNjZmMuLmE2YTM2NDMgMTAwNjQ0DQo+
IC0tLSBhL2RyaXZlcnMvYmxvY2svbmJkLmMNCj4gKysrIGIvZHJpdmVycy9ibG9jay9uYmQuYw0K
PiBAQCAtMTI1LDcgKzEyNSw4IEBAIHN0YXRpYyBjb25zdCBjaGFyICpuYmRjbWRfdG9fYXNjaWko
aW50IGNtZCkNCj4NCj4gIHN0YXRpYyBpbnQgbmJkX3NpemVfY2xlYXIoc3RydWN0IG5iZF9kZXZp
Y2UgKm5iZCwgc3RydWN0IGJsb2NrX2RldmljZSAqYmRldikNCj4gIHsNCj4gLSAgICAgICBiZF9z
ZXRfc2l6ZShiZGV2LCAwKTsNCj4gKyAgICAgICBpZiAoYmRldi0+YmRfb3BlbmVycyA8PSAxKQ0K
PiArICAgICAgICAgICAgICAgYmRfc2V0X3NpemUoYmRldiwgMCk7DQo+ICAgICAgICAgc2V0X2Nh
cGFjaXR5KG5iZC0+ZGlzaywgMCk7DQo+ICAgICAgICAga29iamVjdF91ZXZlbnQoJm5iZF90b19k
ZXYobmJkKS0+a29iaiwgS09CSl9DSEFOR0UpOw0KPg0KPiBAQCAtNjAzLDYgKzYwNCw4IEBAIHN0
YXRpYyB2b2lkIG5iZF9yZXNldChzdHJ1Y3QgbmJkX2RldmljZSAqbmJkKQ0KPg0KPiAgc3RhdGlj
IHZvaWQgbmJkX2JkZXZfcmVzZXQoc3RydWN0IGJsb2NrX2RldmljZSAqYmRldikNCj4gIHsNCj4g
KyAgICAgICBpZiAoYmRldi0+YmRfb3BlbmVycyA+IDEpDQo+ICsgICAgICAgICAgICAgICByZXR1
cm47DQo+ICAgICAgICAgc2V0X2RldmljZV9ybyhiZGV2LCBmYWxzZSk7DQo+ICAgICAgICAgYmRl
di0+YmRfaW5vZGUtPmlfc2l6ZSA9IDA7DQo+ICAgICAgICAgaWYgKG1heF9wYXJ0ID4gMCkgew0K
PiBAQCAtNjY2LDcgKzY2OSw4IEBAIHN0YXRpYyBpbnQgbmJkX2NsZWFyX3NvY2soc3RydWN0IG5i
ZF9kZXZpY2UgKm5iZCwgc3RydWN0IGJsb2NrX2RldmljZSAqYmRldikNCj4gIHsNCj4gICAgICAg
ICBzb2NrX3NodXRkb3duKG5iZCk7DQo+ICAgICAgICAgbmJkX2NsZWFyX3F1ZShuYmQpOw0KPiAt
ICAgICAgIGtpbGxfYmRldihiZGV2KTsNCj4gKw0KPiArICAgICAgIF9faW52YWxpZGF0ZV9kZXZp
Y2UoYmRldiwgdHJ1ZSk7DQo+ICAgICAgICAgbmJkX2JkZXZfcmVzZXQoYmRldik7DQo+ICAgICAg
ICAgLyoNCj4gICAgICAgICAgKiBXZSB3YW50IHRvIGdpdmUgdGhlIHJ1biB0aHJlYWQgYSBjaGFu
Y2UgdG8gd2FpdCBmb3IgZXZlcnlib2R5DQo+IC0tDQo+IDEuOC4zLjENCg0KSGkgSm9zZWYsDQoN
CkFueSBjb21tZW50cz8NCg0KVGhhbmtzLA0KTWluZw0KDQoNCg==
^ permalink raw reply
* Re: [Nbd] [RFC PATCH 1/1] nbd: replace kill_bdev() with __invalidate_device()
From: Ming Lin @ 2017-03-22 20:48 UTC (permalink / raw)
To: Josef Bacik
Cc: nbd-general, Ratna Manoj Bolla, lkml, linux-block, jianshu.ljs,
LIU, Fei, Markus Pargmann, xiongwei.jiang
In-Reply-To: <1490050729-3578-2-git-send-email-mlin@kernel.org>
On Mon, Mar 20, 2017 at 3:58 PM, Ming Lin <mlin@kernel.org> wrote:
> From: Ratna Manoj Bolla <manoj.br@gmail.com>
>
> When a filesystem is mounted on a nbd device and on a disconnect, because
> of kill_bdev(), and resetting bdev size to zero, buffer_head mappings are
> getting destroyed under mounted filesystem.
>
> After a bdev size reset(i.e bdev->bd_inode->i_size = 0) on a disconnect,
> followed by a sys_umount(),
> generic_shutdown_super()->...
> ->__sync_blockdev()->...
> -blkdev_writepages()->...
> ->do_invalidatepage()->...
> -discard_buffer() is discarding superblock buffer_head assumed
> to be in mapped state by ext4_commit_super().
>
> [mlin: ported to 4.11-rc2]
> Signed-off-by: Ratna Manoj Bolla <manoj.br@gmail.com
> ---
> drivers/block/nbd.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index cb4ccfc..a6a3643 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -125,7 +125,8 @@ static const char *nbdcmd_to_ascii(int cmd)
>
> static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
> {
> - bd_set_size(bdev, 0);
> + if (bdev->bd_openers <= 1)
> + bd_set_size(bdev, 0);
> set_capacity(nbd->disk, 0);
> kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
>
> @@ -603,6 +604,8 @@ static void nbd_reset(struct nbd_device *nbd)
>
> static void nbd_bdev_reset(struct block_device *bdev)
> {
> + if (bdev->bd_openers > 1)
> + return;
> set_device_ro(bdev, false);
> bdev->bd_inode->i_size = 0;
> if (max_part > 0) {
> @@ -666,7 +669,8 @@ static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
> {
> sock_shutdown(nbd);
> nbd_clear_que(nbd);
> - kill_bdev(bdev);
> +
> + __invalidate_device(bdev, true);
> nbd_bdev_reset(bdev);
> /*
> * We want to give the run thread a chance to wait for everybody
> --
> 1.8.3.1
Hi Josef,
Any comments?
Thanks,
Ming
^ permalink raw reply
* Re: [PATCH rfc 03/10] nvme-pci: open-code polling logic in nvme_poll
From: Christoph Hellwig @ 2017-03-22 19:09 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-block, linux-rdma, target-devel, linux-nvme
In-Reply-To: <1489065402-14757-4-git-send-email-sagi@grimberg.me>
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply
* Re: [PATCH rfc 02/10] nvme-pci: Add budget to __nvme_process_cq
From: Christoph Hellwig @ 2017-03-22 19:08 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-block, linux-nvme, linux-rdma, target-devel
In-Reply-To: <1489065402-14757-3-git-send-email-sagi@grimberg.me>
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH rfc 01/10] nvme-pci: Split __nvme_process_cq to poll and handle
From: Christoph Hellwig @ 2017-03-22 19:07 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-block, linux-rdma, target-devel, linux-nvme
In-Reply-To: <1489065402-14757-2-git-send-email-sagi@grimberg.me>
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply
* [PATCH 5/5] blk-mq: streamline blk_mq_make_request
From: Christoph Hellwig @ 2017-03-22 19:01 UTC (permalink / raw)
To: axboe; +Cc: Bart.VanAssche, linux-block
In-Reply-To: <20170322190153.12217-1-hch@lst.de>
Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos. Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 51 +++++++++++++++------------------------------------
1 file changed, 15 insertions(+), 36 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a9d2e02a8e02..45b9bebf8436 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1531,16 +1531,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
cookie = request_to_qc_t(data.hctx, rq);
+ plug = current->plug;
if (unlikely(is_flush_fua)) {
- if (q->elevator)
- goto elv_insert;
blk_mq_bio_to_request(rq, bio);
- blk_insert_flush(rq);
- goto run_queue;
- }
-
- plug = current->plug;
- if (plug && q->nr_hw_queues == 1) {
+ if (q->elevator) {
+ blk_mq_sched_insert_request(rq, false, true, true,
+ true);
+ } else {
+ blk_insert_flush(rq);
+ blk_mq_run_hw_queue(data.hctx, true);
+ }
+ } else if (plug && q->nr_hw_queues == 1) {
struct request *last = NULL;
blk_mq_bio_to_request(rq, bio);
@@ -1559,8 +1560,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
else
last = list_entry_rq(plug->mq_list.prev);
- blk_mq_put_ctx(data.ctx);
-
if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
blk_flush_plug_list(plug, false);
@@ -1568,7 +1567,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
list_add_tail(&rq->queuelist, &plug->mq_list);
- goto done;
} else if (plug && !blk_queue_nomerges(q)) {
blk_mq_bio_to_request(rq, bio);
@@ -1585,39 +1583,20 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
list_del_init(&same_queue_rq->queuelist);
list_add_tail(&rq->queuelist, &plug->mq_list);
- blk_mq_put_ctx(data.ctx);
if (same_queue_rq)
blk_mq_try_issue_directly(data.hctx, same_queue_rq,
&cookie);
- goto done;
- } else if (is_sync) {
+ } else if (q->nr_hw_queues > 1 && is_sync) {
blk_mq_bio_to_request(rq, bio);
-
- blk_mq_put_ctx(data.ctx);
blk_mq_try_issue_directly(data.hctx, rq, &cookie);
- goto done;
- }
-
- if (q->elevator) {
-elv_insert:
- blk_mq_put_ctx(data.ctx);
+ } else if (q->elevator) {
blk_mq_bio_to_request(rq, bio);
- blk_mq_sched_insert_request(rq, false, true,
- !is_sync || is_flush_fua, true);
- goto done;
- }
- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
- /*
- * For a SYNC request, send it to the hardware immediately. For
- * an ASYNC request, just ensure that we run it later on. The
- * latter allows for merging opportunities and more efficient
- * dispatching.
- */
-run_queue:
- blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+ blk_mq_sched_insert_request(rq, false, true, true, true);
+ } else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+ blk_mq_run_hw_queue(data.hctx, true);
}
+
blk_mq_put_ctx(data.ctx);
-done:
return cookie;
}
--
2.11.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox