md related oops triggered in bdev_inode_switch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* md related oops triggered in bdev_inode_switch_bdi
@ 2011-08-31  6:22 NeilBrown
  2011-09-01  3:30 ` Wu Fengguang
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2011-08-31  6:22 UTC (permalink / raw)
  To: Christoph Hellwig, Hugh Dickins, Andrew Morton, Wu Fengguang; +Cc: lkml

Hi Christoph et. al.,

 My testing recently triggered an oops in bdi_lock_two called from
 bdev_inode_switch_bdi.
 The bdi and the request_queue that contains it had been freed.

 This happens with md which can free the md device and request queue
 immediately after last close.

 It seems that this is caused by your patch f758eeabeb96f8.
 Prior to that the 'old' bdi was never dereferenced in
 bdev_inode_switch_bdi.  Now it is.

 I think we can fix that by simply moving the call to bdev_inode_switch_bdi
 before the call to ->release as in the patch below.

 Do you see any problem with this patch?

Thanks,
NeilBrown

>From bcc5851cbb6876c97cce214feddd0ec092f7d71c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Wed, 31 Aug 2011 15:37:22 +1000
Subject: [PATCH] Avoid dereferencing a 'request_queue' after last close.

On the last close of an 'md' device which as been stopped, the device
is destroyed and in particular the request_queue is freed.  The free
is done in a separate thread so it might happen a short time later.

__blkdev_put calls bdev_inode_switch_bdi *after* ->release has been
called.

Since commit f758eeabeb96f878c860e8f110f94ec8820822a9
bdev_inode_switch_bdi will dereference the 'old' bdi, which lives
inside a request_queue, to get a spin lock.  This causes the last
close on an md device to sometime take a spin_lock which lives in
freed memory - which results in an oops.

So move the called to bdev_inode_switch_bdi before the call to
->release.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/fs/block_dev.c b/fs/block_dev.c
index ff77262..d8a753f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1430,6 +1430,12 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		sync_blockdev(bdev);
 		kill_bdev(bdev);
 	}
+	if (!bdev->bd_openers)
+		/* ->release can cause the old bdi to disappear,
+		 * so must switch it out first
+		 */
+		bdev_inode_switch_bdi(bdev->bd_inode,
+					&default_backing_dev_info);
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
 			ret = disk->fops->release(disk, mode);
@@ -1442,8 +1448,6 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
-		bdev_inode_switch_bdi(bdev->bd_inode,
-					&default_backing_dev_info);
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: md related oops triggered in bdev_inode_switch_bdi
  2011-08-31  6:22 md related oops triggered in bdev_inode_switch_bdi NeilBrown
@ 2011-09-01  3:30 ` Wu Fengguang
  2011-09-01  5:49   ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Wu Fengguang @ 2011-09-01  3:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, lkml

Hi Neil,

> Subject: [PATCH] Avoid dereferencing a 'request_queue' after last close.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

with comments below.

> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1430,6 +1430,12 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>  		sync_blockdev(bdev);
>  		kill_bdev(bdev);
>  	}
> +	if (!bdev->bd_openers)
> +		/* ->release can cause the old bdi to disappear,
> +		 * so must switch it out first
> +		 */
> +		bdev_inode_switch_bdi(bdev->bd_inode,
> +					&default_backing_dev_info);
>  	if (bdev->bd_contains == bdev) {
>  		if (disk->fops->release)
>  			ret = disk->fops->release(disk, mode);

The bdev_inode_switch_bdi() call can be further moved into the
previous if block, like this:

        if (!--bdev->bd_openers) {
                WARN_ON_ONCE(bdev->bd_holders);
                sync_blockdev(bdev);
                kill_bdev(bdev);
+
+               /* ->release can cause the old bdi to disappear,
+                * so must switch it out first
+                */
+               bdev_inode_switch_bdi(bdev->bd_inode,
+                                       &default_backing_dev_info);
        }

Then it's obvious that kill_bdev() will truncate all inode pages
and there won't be further interactions with dirty writes.

Although there are dozens of disk->fops->release functions, however
it's very unlikely they need to access some inode on top of the disk
(which is illogical thing).

So I don't see any problems. It makes sense to push it to next for
broader test ASAP. Will you do it, or me?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: md related oops triggered in bdev_inode_switch_bdi
  2011-09-01  3:30 ` Wu Fengguang
@ 2011-09-01  5:49   ` NeilBrown
  2011-09-09  8:56     ` Lin Ming
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2011-09-01  5:49 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, lkml

On Thu, 1 Sep 2011 11:30:56 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> Hi Neil,
> 
> > Subject: [PATCH] Avoid dereferencing a 'request_queue' after last close.
> 
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks.


> 
> with comments below.
> 
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1430,6 +1430,12 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
> >  		sync_blockdev(bdev);
> >  		kill_bdev(bdev);
> >  	}
> > +	if (!bdev->bd_openers)
> > +		/* ->release can cause the old bdi to disappear,
> > +		 * so must switch it out first
> > +		 */
> > +		bdev_inode_switch_bdi(bdev->bd_inode,
> > +					&default_backing_dev_info);
> >  	if (bdev->bd_contains == bdev) {
> >  		if (disk->fops->release)
> >  			ret = disk->fops->release(disk, mode);
> 
> The bdev_inode_switch_bdi() call can be further moved into the
> previous if block, like this:
> 
>         if (!--bdev->bd_openers) {
>                 WARN_ON_ONCE(bdev->bd_holders);
>                 sync_blockdev(bdev);
>                 kill_bdev(bdev);
> +
> +               /* ->release can cause the old bdi to disappear,
> +                * so must switch it out first
> +                */
> +               bdev_inode_switch_bdi(bdev->bd_inode,
> +                                       &default_backing_dev_info);
>         }

Yes, and obvious improvement now that you have pointed it out - thanks.


> 
> Then it's obvious that kill_bdev() will truncate all inode pages
> and there won't be further interactions with dirty writes.
> 
> Although there are dozens of disk->fops->release functions, however
> it's very unlikely they need to access some inode on top of the disk
> (which is illogical thing).
> 
> So I don't see any problems. It makes sense to push it to next for
> broader test ASAP. Will you do it, or me?

I've just push it into my for-next.  If I heard nothing else by mid next week
I'll push it to Linus

Thanks,
NeilBrown


> 
> Thanks,
> Fengguang


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: md related oops triggered in bdev_inode_switch_bdi
  2011-09-01  5:49   ` NeilBrown
@ 2011-09-09  8:56     ` Lin Ming
  2011-09-11  7:01       ` Sitsofe Wheeler
  0 siblings, 1 reply; 5+ messages in thread
From: Lin Ming @ 2011-09-09  8:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Wu Fengguang, Christoph Hellwig, Hugh Dickins, Andrew Morton,
	lkml

On Thu, Sep 1, 2011 at 1:49 PM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 1 Sep 2011 11:30:56 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
>
>> Hi Neil,
>>
>> > Subject: [PATCH] Avoid dereferencing a 'request_queue' after last close.
>>
>> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>
> Thanks.
>
>
>>
>> with comments below.
>>
>> > --- a/fs/block_dev.c
>> > +++ b/fs/block_dev.c
>> > @@ -1430,6 +1430,12 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
>> >             sync_blockdev(bdev);
>> >             kill_bdev(bdev);
>> >     }
>> > +   if (!bdev->bd_openers)
>> > +           /* ->release can cause the old bdi to disappear,
>> > +            * so must switch it out first
>> > +            */
>> > +           bdev_inode_switch_bdi(bdev->bd_inode,
>> > +                                   &default_backing_dev_info);
>> >     if (bdev->bd_contains == bdev) {
>> >             if (disk->fops->release)
>> >                     ret = disk->fops->release(disk, mode);
>>
>> The bdev_inode_switch_bdi() call can be further moved into the
>> previous if block, like this:
>>
>>         if (!--bdev->bd_openers) {
>>                 WARN_ON_ONCE(bdev->bd_holders);
>>                 sync_blockdev(bdev);
>>                 kill_bdev(bdev);
>> +
>> +               /* ->release can cause the old bdi to disappear,
>> +                * so must switch it out first
>> +                */
>> +               bdev_inode_switch_bdi(bdev->bd_inode,
>> +                                       &default_backing_dev_info);
>>         }
>
> Yes, and obvious improvement now that you have pointed it out - thanks.
>
>
>>
>> Then it's obvious that kill_bdev() will truncate all inode pages
>> and there won't be further interactions with dirty writes.
>>
>> Although there are dozens of disk->fops->release functions, however
>> it's very unlikely they need to access some inode on top of the disk
>> (which is illogical thing).
>>
>> So I don't see any problems. It makes sense to push it to next for
>> broader test ASAP. Will you do it, or me?
>
> I've just push it into my for-next.  If I heard nothing else by mid next week
> I'll push it to Linus

Just FYI,
this patch also fixes below bug which can be reproduced by unplug the
usb disk before umount.

BUG blkdev_queue: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xffff880106640308-0xffff880106640309. First byte 0x6c instead of 0x6b
INFO: Allocated in blk_alloc_queue_node+0x1f/0x1a9 age=5794 cpu=0 pid=3449
        __slab_alloc+0x503/0x56e
        kmem_cache_alloc_node+0x61/0x18e
        blk_alloc_queue_node+0x1f/0x1a9
        blk_init_queue_node+0x24/0x61
        blk_init_queue+0xc/0xe
        __scsi_alloc_queue+0x21/0x145
        scsi_alloc_queue+0x18/0x64
        scsi_alloc_sdev+0x1bb/0x282
        scsi_probe_and_add_lun+0x13b/0xb6a
        __scsi_scan_target+0xc5/0x60c
        scsi_scan_channel+0x58/0x80
        scsi_scan_host_selected+0xed/0x136
        do_scsi_scan_host+0x6b/0x70
        scsi_scan_host+0x1a0/0x1c5
        usb_stor_scan_thread+0x16d/0x17b
        kthread+0x7d/0x85
INFO: Freed in blk_release_queue+0x60/0x65 age=2267 cpu=1 pid=3484
        __slab_free+0x2c/0x30d
        kmem_cache_free+0xbc/0x127
        blk_release_queue+0x60/0x65
        kobject_release+0x51/0x67
        kref_put+0x43/0x4d
        kobject_put+0x47/0x4b
        blk_put_queue+0x10/0x12
        scsi_device_dev_release_usercontext+0xbf/0x10a
        execute_in_process_context+0x2a/0x61
        scsi_device_dev_release+0x17/0x19
        device_release+0x49/0x7e
        kobject_release+0x51/0x67
        kref_put+0x43/0x4d
        kobject_put+0x47/0x4b
        put_device+0x12/0x14
        scsi_device_put+0x3d/0x42

Thanks,
Lin Ming

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: md related oops triggered in bdev_inode_switch_bdi
  2011-09-09  8:56     ` Lin Ming
@ 2011-09-11  7:01       ` Sitsofe Wheeler
  0 siblings, 0 replies; 5+ messages in thread
From: Sitsofe Wheeler @ 2011-09-11  7:01 UTC (permalink / raw)
  To: Lin Ming
  Cc: NeilBrown, Wu Fengguang, Christoph Hellwig, Hugh Dickins,
	Andrew Morton, lkml

On Fri, Sep 09, 2011 at 04:56:30PM +0800, Lin Ming wrote:
> On Thu, Sep 1, 2011 at 1:49 PM, NeilBrown <neilb@suse.de> wrote:
> > On Thu, 1 Sep 2011 11:30:56 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> >> Hi Neil,
> >>
> >> > Subject: [PATCH] Avoid dereferencing a 'request_queue' after last close.
> >>
> >> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> >
> > Thanks.
 
> Just FYI,
> this patch also fixes below bug which can be reproduced by unplug the
> usb disk before umount.

Agreed - this fixes the oops I was seeing pulling out a USB key with a
mounted FAT32 filesystem (https://lkml.org/lkml/2011/9/5/339 ). Thanks
to Yang RuiRui and akpm for point me to this patch.

Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-09-11  7:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-31  6:22 md related oops triggered in bdev_inode_switch_bdi NeilBrown
2011-09-01  3:30 ` Wu Fengguang
2011-09-01  5:49   ` NeilBrown
2011-09-09  8:56     ` Lin Ming
2011-09-11  7:01       ` Sitsofe Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox