Linux block layer
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>,
	Tahsin Erdogan <tahsin@google.com>,
	Omar Sandoval <osandov@osandov.com>
Subject: Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
Date: Tue, 21 Mar 2017 12:19:07 -0400	[thread overview]
Message-ID: <20170321161907.GB30919@htj.duckdns.org> (raw)
In-Reply-To: <20170313151410.5586-3-jack@suse.cz>

Hello, Jan.

On Mon, Mar 13, 2017 at 04:14:01PM +0100, Jan Kara wrote:
> blkdev_open() may race with gendisk shutdown in two different ways.
> Either del_gendisk() has already unhashed block device inode (and thus
> bd_acquire() will end up creating new block device inode) however
> gen_gendisk() will still return the gendisk that is being destroyed.
> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> before we get to get_gendisk() and get_gendisk() will return new gendisk
> that got allocated for device that reused the device number.
> 
> In both cases this will result in possible inconsistencies between
> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> destroyed and device number reused, in the second case immediately).
> 
> Fix the problem by checking whether the gendisk is still alive and inode
> hashed when associating bdev inode with it and its bdi. That way we are
> sure that we will not associate bdev inode with disk that got past
> blk_unregister_region() in del_gendisk() (and thus device number can get
> reused). Similarly, we will not associate bdev that was once associated
> with gendisk that is going away (and thus the corresponding bdev inode
> will get unhashed in del_gendisk()) with a new gendisk that is just
> reusing the device number.
> 
> Also add a warning that will tell us about unexpected inconsistencies
> between bdi associated with the bdev inode and bdi associated with the
> disk.

Hmm... let's see if I got this straight.

It used to be that blockdevs are essentially stateless while nobody
has it open.  It acquires all its actual associations during the
initial open and loses them on the last release.  The immediate
release semantics got us into trouble because upper layers had nothing
to serve as the proper sever point when the underlying qblock device
goes away.

So, we decided that bdi should serve that purpose, which makes perfect
sense as it's what upper layers talk to when they wanna reach to the
block device, so we decoupled its lifetime from the request_queue and
implements sever there.

Now that bdis are persistent, we can solve bdev-access-while-not-open
problem by making bdi point to the respective bdi from the beginning
until it's released, which means that bdevs are now stateful objects
which are associated with specific bdis and thus request_queues.

Because there are multiple ways that these objects are looked up and
handled, now we can get into situations where the request_queue
(gendisk) we look up during open may not match the bdi that a bdev is
associated with, and this patch solves that problem by detecting the
conditions where these mismatches would take place.  More
specifically, we don't want to be using a dead bdev during open.

If I misunderstood something, please let me know.

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 53e2389ae4d4..5ec8750f5332 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> -			if (!bdev->bd_part)
> +			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> +			    inode_unhashed(bdev->bd_inode))
>  				goto out_clear;

This adds both GENHD_FL_UP and unhashed testing; however, I'm not sure
whether the former is meaningful here.  GENHD flag updates are not
synchronized when seen from outside the party which is updating it.
The actual synchronization against dead device should happen inside
disk->fops->open().

>  			ret = 0;
> @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  			bdev->bd_contains = whole;
>  			bdev->bd_part = disk_get_part(disk, partno);
>  			if (!(disk->flags & GENHD_FL_UP) ||
> -			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
> +			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
> +			    inode_unhashed(bdev->bd_inode)) {
>  				ret = -ENXIO;
>  				goto out_clear;
>  			}

Which is different from here.  As the device is already open, we're
just incrementing the ref before granting another open without
consulting the driver.  As the device is open already, whether
granting a new ref or not can't break anything.  Block layer is just
doing a best effort thing to not give out new ref on a dead one with
the UP test, which is completely fine.

> @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  
>  		if (bdev->bd_bdi == &noop_backing_dev_info)
>  			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> +		else
> +			WARN_ON_ONCE(bdev->bd_bdi !=
> +				     disk->queue->backing_dev_info);

While I can't think of cases where this wouldn't work, it seems a bit
roundabout to me.

* A bdi is the object which knows which request_queue it's associated
  and when that association dies.

* bdev is permanently associated with a bdi.

So, it's kinda weird to look up the request_queue again on each open
and then verify that the bd_bdi and request_queue match.  I think it
would make more sense to skip disk look up and just go through bdi to
determine the associated request_queue as that's the primary nexus
object tying everything up now.

That said, it's totally possible that that would take too much
restructuring to implement right now with everything else, but if we
think that that's the right long term direction, I think it would make
more sense to just test whether bdev->bd_bdi matches the disk right
after looking up the disk and fail the open if not.  That's the
ultimate condition we wanna avoid after all and it also would ease
replacing it with going through bdi instead of looking up again.

Thanks.

-- 
tejun

  reply	other threads:[~2017-03-21 16:19 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-13 15:13 [PATCH 0/11 v4] block: Fix block device shutdown related races Jan Kara
2017-03-13 15:14 ` [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete Jan Kara
2017-03-13 15:14 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-21 16:19   ` Tejun Heo [this message]
2017-03-23  0:21     ` Jan Kara
2017-03-23 15:38       ` Tejun Heo
2017-11-17  6:51   ` Hou Tao
2017-11-20 16:43     ` Jan Kara
2017-11-24  3:12       ` Hou Tao
2017-03-13 15:14 ` [PATCH 03/11] bdi: Mark congested->bdi as internal Jan Kara
2017-03-13 15:14 ` [PATCH 04/11] bdi: Make wb->bdi a proper reference Jan Kara
2017-03-13 15:14 ` [PATCH 05/11] bdi: Unify bdi->wb_list handling for root wb_writeback Jan Kara
2017-03-21 16:21   ` Tejun Heo
2017-03-13 15:14 ` [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() Jan Kara
2017-03-13 15:14 ` [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() Jan Kara
2017-03-13 15:14 ` [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list() Jan Kara
2017-03-13 15:14 ` [PATCH 10/11] kobject: Export kobject_get_unless_zero() Jan Kara
2017-03-13 15:14 ` [PATCH 11/11] block: Fix oops scsi_disk_get() Jan Kara
2017-03-13 18:10 ` [PATCH 0/11 v4] block: Fix block device shutdown related races Dan Williams
2017-03-21  0:57 ` Thiago Jung Bauermann
  -- strict thread matches above, loose matches on Subject: below --
2017-03-06 16:33 [PATCH 0/11 v3] " Jan Kara
2017-03-06 16:33 ` [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown Jan Kara
2017-03-06 22:04   ` Tejun Heo
2017-03-07 11:14     ` Jan Kara
2017-03-07 19:43       ` Tejun Heo
2017-03-08 14:25         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170321161907.GB30919@htj.duckdns.org \
    --to=tj@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bauerman@linux.vnet.ibm.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=tahsin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox