Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.
       [not found]           ` <1299516418.15258.4.camel@mulgrave.site>
@ 2011-03-07 22:44             ` NeilBrown
  2011-03-07 22:56               ` James Bottomley
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2011-03-07 22:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: device-mapper development, Jens Axboe, linux-raid, linux-kernel,
	Christoph Hellwig, linux-fsdevel

(linux-fsdevel added - seems relevant)

On Mon, 07 Mar 2011 10:46:58 -0600 James Bottomley <James.Bottomley@suse.de>
wrote:

> On Sun, 2011-03-06 at 21:22 -0700, Andrew Patterson wrote:
> > On Sun, 2011-03-06 at 17:47 +1100, NeilBrown wrote:
> > >  Would you be uncomfortable if I asked Linus to revert both my fix and your
> > >  original patch??
> > 
> > James Bottomley wanted me to put this functionality in. I have no
> > problem with reverting it myself, especially if it is causing other
> > problems.  I would have to say that you need to ask him (or rather, I am
> > not qualified to render an opinion here).
> 
> So it seems we have a couple of problems: the first being that
> flush_disk() doesn't actually do what it says (flush the disk).  If it's
> just discarding all data, dirty or clean, then its use in the
> grow/shrink interface is definitely wrong.
> 
> The idea is that before we complete the grow/shrink, we make sure that
> the device doesn't have any errors, so we want to try to write out all
> dirty buffers to make sure they still have a home.  If flush_disk()
> doesn't do that, then we need to use a different interface ... what's
> the interface to flush a disk?
> 

Hi James,

I *always* want to make sure that my device doesn't have any errors, not just
when it changes size... but I'm not sure that regularly flushing out data is
the right way to do it.
But maybe I still misunderstand what the real point of this is.

As for the correct interface to flush a disk - there isn't one.
One doesn't flush a storage device, one flushes a cache - to a storage
device.  And there is not a 1-to-1 mapping.

A block device *is* associated with one cache - one which is used for
caching accesses through /dev/XXX and also by some filesystems to cache some
metadata.  You can flush this cache with sync_blockdev().  This just
flushes out dirty pages in that cache, it doesn't discard clean pages.
invalidate_bdev() can discard clean pages.  Call both, and get both outcomes.

If a filesystem is mounted directly on a given block_device, then it
should have a valid s_bdev pointer and it is possible to find that filesystem
from the block_device using get_active_super().  You could then call
sync_filesystem() to flush out dirty data.  There isn't really a good
interface to discard clean data. shrink_dcache_sb() does some of it, other
bits of code do other bits.

Note that a block_device also can have a pointer to a super_block (bd_super).
This does not seem to be widely used .... ext3 and ext4 use it so that memory
pressure felt by the block-device cache can transmitted to the fs, so the
fs can associate private data with the block device's cache I guess.
I don't think bd_super is sufficiently persistent reference to be usable
for sync_filesystems (it could race with unmount).

But if the block device is within a dm or md device, or is an external
journal for e.g. ext3, or is in some more complex relationship with a
filesystem, then there is no way to request that any particular cache get
flushed to the block device.  And if some user-space app has cached data
for the device .... there is no way to get at that either.

If we wanted a "proper" solution for this we probably need to leverage
the 'bd_claim' concept.  When a device is claimed, allow an 'operations'
structure to be provided with operations like:
   flush_caches  (writes out data)
   purge_caches  (discards clean data)
   discard_caches  (discards all data)
   prepare_resize (is allowed to fail)
   commit_resize
   freeze_io
   thaw_io

But flush_disk is definitely a wrong interface.  It has a meaningless name
(as discussed above), and purges some caches while discarding others.

What do we *really* lose if we just revert that original patch?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.
  2011-03-07 22:44             ` [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size NeilBrown
@ 2011-03-07 22:56               ` James Bottomley
  2011-03-08  0:04                 ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: James Bottomley @ 2011-03-07 22:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: device-mapper development, Jens Axboe, linux-raid, linux-kernel,
	Christoph Hellwig, linux-fsdevel

On Tue, 2011-03-08 at 09:44 +1100, NeilBrown wrote:
> (linux-fsdevel added - seems relevant)

Heh, Christoph will be pleased.

> On Mon, 07 Mar 2011 10:46:58 -0600 James Bottomley <James.Bottomley@suse.de>
> wrote:
> 
> > On Sun, 2011-03-06 at 21:22 -0700, Andrew Patterson wrote:
> > > On Sun, 2011-03-06 at 17:47 +1100, NeilBrown wrote:
> > > >  Would you be uncomfortable if I asked Linus to revert both my fix and your
> > > >  original patch??
> > > 
> > > James Bottomley wanted me to put this functionality in. I have no
> > > problem with reverting it myself, especially if it is causing other
> > > problems.  I would have to say that you need to ask him (or rather, I am
> > > not qualified to render an opinion here).
> > 
> > So it seems we have a couple of problems: the first being that
> > flush_disk() doesn't actually do what it says (flush the disk).  If it's
> > just discarding all data, dirty or clean, then its use in the
> > grow/shrink interface is definitely wrong.
> > 
> > The idea is that before we complete the grow/shrink, we make sure that
> > the device doesn't have any errors, so we want to try to write out all
> > dirty buffers to make sure they still have a home.  If flush_disk()
> > doesn't do that, then we need to use a different interface ... what's
> > the interface to flush a disk?
> > 
> 
> Hi James,
> 
> I *always* want to make sure that my device doesn't have any errors, not just
> when it changes size... but I'm not sure that regularly flushing out data is
> the right way to do it.
> But maybe I still misunderstand what the real point of this is.

I actually have no idea what you're talking about now.  Let me start at
the beginning:

The idea behind flushing a disk on size changes is to ensure we get
immediate detection of screw ups. (screw ups specifically means dirty
data in the lost region which we can no longer write back).

If the device is reducing in size and the FS supports this, some data
migrates from the soon to be lost blocks at the end.  The point about
running a full page cache flush for the device after it has been shrunk
is supposed to be to detect a screw up (i.e. a dirty page that just lost
its backing store).

My original thought was that this only needed to be done on shrink, but
HP had some pathological case where grow was implemented as shrink
first, so it got done on all size changes.

> As for the correct interface to flush a disk - there isn't one.
> One doesn't flush a storage device, one flushes a cache - to a storage
> device.  And there is not a 1-to-1 mapping.
> 
> A block device *is* associated with one cache - one which is used for
> caching accesses through /dev/XXX and also by some filesystems to cache some
> metadata.  You can flush this cache with sync_blockdev().  This just
> flushes out dirty pages in that cache, it doesn't discard clean pages.
> invalidate_bdev() can discard clean pages.  Call both, and get both outcomes.
> 
> If a filesystem is mounted directly on a given block_device, then it
> should have a valid s_bdev pointer and it is possible to find that filesystem
> from the block_device using get_active_super().  You could then call
> sync_filesystem() to flush out dirty data.  There isn't really a good
> interface to discard clean data. shrink_dcache_sb() does some of it, other
> bits of code do other bits.
> 
> Note that a block_device also can have a pointer to a super_block (bd_super).
> This does not seem to be widely used .... ext3 and ext4 use it so that memory
> pressure felt by the block-device cache can transmitted to the fs, so the
> fs can associate private data with the block device's cache I guess.
> I don't think bd_super is sufficiently persistent reference to be usable
> for sync_filesystems (it could race with unmount).
> 
> 
> But if the block device is within a dm or md device, or is an external
> journal for e.g. ext3, or is in some more complex relationship with a
> filesystem, then there is no way to request that any particular cache get
> flushed to the block device.  And if some user-space app has cached data
> for the device .... there is no way to get at that either.
> 
> If we wanted a "proper" solution for this we probably need to leverage
> the 'bd_claim' concept.  When a device is claimed, allow an 'operations'
> structure to be provided with operations like:
>    flush_caches  (writes out data)
>    purge_caches  (discards clean data)
>    discard_caches  (discards all data)
>    prepare_resize (is allowed to fail)
>    commit_resize
>    freeze_io
>    thaw_io

What makes you think we participate in the resize?  Most often we just
get notified.  There may or may not have been some preparation, but it's
mostly not something we participate in.

> But flush_disk is definitely a wrong interface.  It has a meaningless name
> (as discussed above), and purges some caches while discarding others.
> 
> What do we *really* lose if we just revert that original patch?

Synchronous notification of errors.  If we don't try to write everything
back immediately after the size change, we don't see dirty pages in
zapped regions until the writeout/page cache management takes it into
its head to try to clean the pages.

James

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.
  2011-03-07 22:56               ` James Bottomley
@ 2011-03-08  0:04                 ` NeilBrown
  2011-03-16 20:30                   ` Jeff Moyer
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2011-03-08  0:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: device-mapper development, Jens Axboe, linux-raid, linux-kernel,
	Christoph Hellwig, linux-fsdevel

On Mon, 07 Mar 2011 16:56:12 -0600 James Bottomley <James.Bottomley@suse.de>
wrote:

> On Tue, 2011-03-08 at 09:44 +1100, NeilBrown wrote:
> > (linux-fsdevel added - seems relevant)
> 
> Heh, Christoph will be pleased.
> 
> > On Mon, 07 Mar 2011 10:46:58 -0600 James Bottomley <James.Bottomley@suse.de>
> > wrote:
> > 
> > > On Sun, 2011-03-06 at 21:22 -0700, Andrew Patterson wrote:
> > > > On Sun, 2011-03-06 at 17:47 +1100, NeilBrown wrote:
> > > > >  Would you be uncomfortable if I asked Linus to revert both my fix and your
> > > > >  original patch??
> > > > 
> > > > James Bottomley wanted me to put this functionality in. I have no
> > > > problem with reverting it myself, especially if it is causing other
> > > > problems.  I would have to say that you need to ask him (or rather, I am
> > > > not qualified to render an opinion here).
> > > 
> > > So it seems we have a couple of problems: the first being that
> > > flush_disk() doesn't actually do what it says (flush the disk).  If it's
> > > just discarding all data, dirty or clean, then its use in the
> > > grow/shrink interface is definitely wrong.
> > > 
> > > The idea is that before we complete the grow/shrink, we make sure that
> > > the device doesn't have any errors, so we want to try to write out all
> > > dirty buffers to make sure they still have a home.  If flush_disk()
> > > doesn't do that, then we need to use a different interface ... what's
> > > the interface to flush a disk?
> > > 
> > 
> > Hi James,
> > 
> > I *always* want to make sure that my device doesn't have any errors, not just
> > when it changes size... but I'm not sure that regularly flushing out data is
> > the right way to do it.
> > But maybe I still misunderstand what the real point of this is.
> 
> I actually have no idea what you're talking about now.  Let me start at
> the beginning:

Thanks!

> 
> The idea behind flushing a disk on size changes is to ensure we get
> immediate detection of screw ups. (screw ups specifically means dirty
> data in the lost region which we can no longer write back).
> 
> If the device is reducing in size and the FS supports this, some data
> migrates from the soon to be lost blocks at the end.  The point about
> running a full page cache flush for the device after it has been shrunk
> is supposed to be to detect a screw up (i.e. a dirty page that just lost
> its backing store).

Make sense.  So what you *really* want to do is actively tell the filesystem
that the device is now smaller and that is should complain loudly if that is
a problem.
You cannot directly do that, so a second best is to tell it to flush all
data in the hope that if there is a problem, then this has at least some
chance of causing it to be reported.

So you don't need to purge any caches - just flush them.

I'm having trouble seeing the relevant of the bit about "some data migrates
from the soon to be lost blocks at the end".
If you have previously requested the fs to resize downwards, then I would
expect all of this to be fully completed - maybe you are trying to detect
bugs in the filesystem's 'shrink' function?  Maybe it isn't directly relevant
and you are just providing it as general context ??

> 
> My original thought was that this only needed to be done on shrink, but
> HP had some pathological case where grow was implemented as shrink
> first, so it got done on all size changes.

I cannot see the relevance of this to the "flush caches" approach.
If the size of the device shrinks and then grows while data is in a cache,
and then you flush, it will simply result in random data appearing
in the "new" part of the device, with no reason to expect errors.

It could be relevant if you were purging caches as well as it makes sure
you don't see stale data from before a shrink.

So you seem to be saying that "purge" is important too - correct?

> 
> > As for the correct interface to flush a disk - there isn't one.
> > One doesn't flush a storage device, one flushes a cache - to a storage
> > device.  And there is not a 1-to-1 mapping.
> > 
> > A block device *is* associated with one cache - one which is used for
> > caching accesses through /dev/XXX and also by some filesystems to cache some
> > metadata.  You can flush this cache with sync_blockdev().  This just
> > flushes out dirty pages in that cache, it doesn't discard clean pages.
> > invalidate_bdev() can discard clean pages.  Call both, and get both outcomes.
> > 
> > If a filesystem is mounted directly on a given block_device, then it
> > should have a valid s_bdev pointer and it is possible to find that filesystem
> > from the block_device using get_active_super().  You could then call
> > sync_filesystem() to flush out dirty data.  There isn't really a good
> > interface to discard clean data. shrink_dcache_sb() does some of it, other
> > bits of code do other bits.
> > 
> > Note that a block_device also can have a pointer to a super_block (bd_super).
> > This does not seem to be widely used .... ext3 and ext4 use it so that memory
> > pressure felt by the block-device cache can transmitted to the fs, so the
> > fs can associate private data with the block device's cache I guess.
> > I don't think bd_super is sufficiently persistent reference to be usable
> > for sync_filesystems (it could race with unmount).
> > 
> > 
> > But if the block device is within a dm or md device, or is an external
> > journal for e.g. ext3, or is in some more complex relationship with a
> > filesystem, then there is no way to request that any particular cache get
> > flushed to the block device.  And if some user-space app has cached data
> > for the device .... there is no way to get at that either.
> > 
> > If we wanted a "proper" solution for this we probably need to leverage
> > the 'bd_claim' concept.  When a device is claimed, allow an 'operations'
> > structure to be provided with operations like:
> >    flush_caches  (writes out data)
> >    purge_caches  (discards clean data)
> >    discard_caches  (discards all data)
> >    prepare_resize (is allowed to fail)
> >    commit_resize
> >    freeze_io
> >    thaw_io
> 
> What makes you think we participate in the resize?  Most often we just
> get notified.  There may or may not have been some preparation, but it's
> mostly not something we participate in.

You may not participate in the resize, but dm and md do and the
"prepare_resize" part of the interface would be useful to them.  I agree
that it might not be useful to SCSI devices.

> 
> > But flush_disk is definitely a wrong interface.  It has a meaningless name
> > (as discussed above), and purges some caches while discarding others.
> > 
> > What do we *really* lose if we just revert that original patch?
> 
> Synchronous notification of errors.  If we don't try to write everything
> back immediately after the size change, we don't see dirty pages in
> zapped regions until the writeout/page cache management takes it into
> its head to try to clean the pages.
> 

So if you just want synchronous errors, I think you want:
    fsync_bdev()

which calls sync_filesystem() if it can find a filesystem, else
sync_blockdev();  (sync_filesystem itself calls sync_blockdev too).

However this doesn't purge clean pages from the caches so there
might be the stale data issue.

Do you see that as a real issue?

(Just a bit of clarification:  fsync_bdev flushes but does not purge.
 flush_disk purges but does not flush.  So with flush_disk there I don't
 think you will have been getting the errors you want.
 Also - fsync_bdev() takes a lot longer than flush_disk.  I hope
 revalidate_disk is only called from places where it is safe to block
 for FS writeout ... I think it is.).

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.
  2011-03-08  0:04                 ` NeilBrown
@ 2011-03-16 20:30                   ` Jeff Moyer
  2011-03-17  1:28                     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Moyer @ 2011-03-16 20:30 UTC (permalink / raw)
  To: NeilBrown
  Cc: James Bottomley, device-mapper development, Jens Axboe,
	linux-raid, linux-kernel, Christoph Hellwig, linux-fsdevel

NeilBrown <neilb@suse.de> writes:

>> Synchronous notification of errors.  If we don't try to write everything
>> back immediately after the size change, we don't see dirty pages in
>> zapped regions until the writeout/page cache management takes it into
>> its head to try to clean the pages.
>> 
>
> So if you just want synchronous errors, I think you want:
>     fsync_bdev()
>
> which calls sync_filesystem() if it can find a filesystem, else
> sync_blockdev();  (sync_filesystem itself calls sync_blockdev too).

... which deadlocks md.  ;-)  writeback_inodes_sb_nr is waiting for the
flusher thread to write back the dirty data.  The flusher thread is
stuck in md_write_start, here:

        wait_event(mddev->sb_wait,
                   !test_bit(MD_CHANGE_PENDING, &mddev->flags));

This is after reverting your change, and replacing the flush_disk call
in check_disk_size_change with a call to fsync_bdev.  I'm not familiar
enough with md to really suggest a way forward.  Neil?

Cheers,
Jeff

md127: detected capacity change from 267386880 to 401080320
INFO: task md127_raid5:2255 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
md127_raid5     D ffff88011d010690  5416  2255      2 0x00000080
 ffff88010fcc7990 0000000000000046 ffff880100000000 ffffffff812070c9
 0000000000014d00 ffff88011d010100 ffff88011d010690 ffff88010fcc7fd8
 ffff88011d010698 0000000000014d00 ffff88010fcc6010 0000000000014d00
Call Trace:
 [<ffffffff812070c9>] ? cpumask_next_and+0x29/0x50
 [<ffffffff81493df5>] schedule_timeout+0x265/0x2d0
 [<ffffffff8104b341>] ? enqueue_task+0x61/0x80
 [<ffffffff81493a25>] wait_for_common+0x115/0x180
 [<ffffffff81057850>] ? default_wake_function+0x0/0x10
 [<ffffffff81493b38>] wait_for_completion+0x18/0x20
 [<ffffffff8115cce2>] writeback_inodes_sb_nr+0x72/0xa0
 [<ffffffff8115cfad>] writeback_inodes_sb+0x4d/0x60
 [<ffffffff81162499>] __sync_filesystem+0x49/0x90
 [<ffffffff81162592>] sync_filesystem+0x32/0x60
 [<ffffffff8116bc59>] fsync_bdev+0x29/0x70
 [<ffffffff8116bcea>] check_disk_size_change+0x4a/0xb0
 [<ffffffff81208e27>] ? kobject_put+0x27/0x60
 [<ffffffff8116bdaf>] revalidate_disk+0x5f/0x90
 [<ffffffffa031155a>] raid5_finish_reshape+0x9a/0x1e0 [raid456]
 [<ffffffff8138a933>] reap_sync_thread+0x63/0x130
 [<ffffffff8138c8a6>] md_check_recovery+0x1f6/0x6f0
 [<ffffffffa03150ab>] raid5d+0x3b/0x610 [raid456]
 [<ffffffff810804c9>] ? prepare_to_wait+0x59/0x90
 [<ffffffff81387ee9>] md_thread+0x119/0x150
 [<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81387dd0>] ? md_thread+0x0/0x150
 [<ffffffff8107fb56>] kthread+0x96/0xa0
 [<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
 [<ffffffff8107fac0>] ? kthread+0x0/0xa0
 [<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10
INFO: task flush-9:127:2288 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
flush-9:127     D ffff88011cee30a0  4664  2288      2 0x00000080
 ffff88011b0af6a0 0000000000000046 0000000000000000 0000000000000000
 0000000000014d00 ffff88011cee2b10 ffff88011cee30a0 ffff88011b0affd8
 ffff88011cee30a8 0000000000014d00 ffff88011b0ae010 0000000000014d00
Call Trace:
 [<ffffffff8138bbb5>] md_write_start+0xa5/0x1c0
 [<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0316435>] make_request+0x45/0x6c0 [raid456]
 [<ffffffff811fbfcb>] ? blkiocg_update_dispatch_stats+0x8b/0xd0
 [<ffffffff81385ca3>] md_make_request+0xd3/0x210
 [<ffffffff811ee9da>] generic_make_request+0x2ea/0x5d0
 [<ffffffff810e9cde>] ? mempool_alloc+0x5e/0x140
 [<ffffffff811eed41>] submit_bio+0x81/0x110
 [<ffffffff811699c6>] ? bio_alloc_bioset+0x56/0xf0
 [<ffffffff81163ef6>] submit_bh+0xe6/0x140
 [<ffffffff81165ad0>] __block_write_full_page+0x200/0x390
 [<ffffffff811655a0>] ? end_buffer_async_write+0x0/0x1a0
 [<ffffffff8116667e>] block_write_full_page_endio+0xde/0x110
 [<ffffffffa037d3b0>] ? buffer_unmapped+0x0/0x20 [ext3]
 [<ffffffff811666c0>] block_write_full_page+0x10/0x20
 [<ffffffffa037de6d>] ext3_writeback_writepage+0x11d/0x170 [ext3]
 [<ffffffff810f0152>] __writepage+0x12/0x40
 [<ffffffff810f12b4>] write_cache_pages+0x1a4/0x490
 [<ffffffff810f0140>] ? __writepage+0x0/0x40
 [<ffffffff810f15bf>] generic_writepages+0x1f/0x30
 [<ffffffff810f15f5>] do_writepages+0x25/0x30
 [<ffffffff8115d5f0>] writeback_single_inode+0x90/0x220
 [<ffffffff8115d9b6>] writeback_sb_inodes+0xc6/0x170
 [<ffffffff8115dd3f>] wb_writeback+0x17f/0x430
 [<ffffffff8106e217>] ? lock_timer_base+0x37/0x70
 [<ffffffff8115e08d>] wb_do_writeback+0x9d/0x270
 [<ffffffff8106e330>] ? process_timeout+0x0/0x10
 [<ffffffff8115e302>] bdi_writeback_thread+0xa2/0x280
 [<ffffffff8115e260>] ? bdi_writeback_thread+0x0/0x280
 [<ffffffff8115e260>] ? bdi_writeback_thread+0x0/0x280
 [<ffffffff8107fb56>] kthread+0x96/0xa0
 [<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
 [<ffffffff8107fac0>] ? kthread+0x0/0xa0
 [<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10
INFO: task updatedb:2342 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
updatedb        D ffff88011bd77af0  5136  2342   2323 0x00000080
 ffff88011c877cb8 0000000000000086 0000000000000000 ffff88011c1829a8
 0000000000014d00 ffff88011bd77560 ffff88011bd77af0 ffff88011c877fd8
 ffff88011bd77af8 0000000000014d00 ffff88011c876010 0000000000014d00
Call Trace:
 [<ffffffff81165130>] ? sync_buffer+0x0/0x50
 [<ffffffff8149382b>] io_schedule+0x6b/0xb0
 [<ffffffff8116516b>] sync_buffer+0x3b/0x50
 [<ffffffff81494057>] __wait_on_bit+0x57/0x80
 [<ffffffff811699c6>] ? bio_alloc_bioset+0x56/0xf0
 [<ffffffff81165130>] ? sync_buffer+0x0/0x50
 [<ffffffff814940f3>] out_of_line_wait_on_bit+0x73/0x90
 [<ffffffff81080210>] ? wake_bit_function+0x0/0x40
 [<ffffffff81165126>] __wait_on_buffer+0x26/0x30
 [<ffffffffa038006c>] ext3_bread+0x5c/0x80 [ext3]
 [<ffffffffa037ba63>] ext3_readdir+0x1f3/0x600 [ext3]
 [<ffffffff8114a650>] ? filldir+0x0/0xe0
 [<ffffffff8114a650>] ? filldir+0x0/0xe0
 [<ffffffff8114a7e0>] vfs_readdir+0xb0/0xd0
 [<ffffffff8114a964>] sys_getdents+0x84/0xf0
 [<ffffffff8100bdd2>] system_call_fastpath+0x16/0x1b


diff --git a/block/genhd.c b/block/genhd.c
index cbf1112..6a5b772 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1355,7 +1355,7 @@ int invalidate_partition(struct gendisk *disk, int partno)
 	struct block_device *bdev = bdget_disk(disk, partno);
 	if (bdev) {
 		fsync_bdev(bdev);
-		res = __invalidate_device(bdev, true);
+		res = __invalidate_device(bdev);
 		bdput(bdev);
 	}
 	return res;
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 77fc76f..b9ba04f 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3281,7 +3281,7 @@ static int set_geometry(unsigned int cmd, struct floppy_struct *g,
 			struct block_device *bdev = opened_bdev[cnt];
 			if (!bdev || ITYPE(drive_state[cnt].fd_device) != type)
 				continue;
-			__invalidate_device(bdev, true);
+			__invalidate_device(bdev);
 		}
 		mutex_unlock(&open_lock);
 	} else {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8892870..5aae241 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -934,9 +934,9 @@ EXPORT_SYMBOL_GPL(bd_unlink_disk_holder);
  * when a disk has been changed -- either by a media change or online
  * resize.
  */
-static void flush_disk(struct block_device *bdev, bool kill_dirty)
+static void flush_disk(struct block_device *bdev)
 {
-	if (__invalidate_device(bdev, kill_dirty)) {
+	if (__invalidate_device(bdev)) {
 		char name[BDEVNAME_SIZE] = "";
 
 		if (bdev->bd_disk)
@@ -973,7 +973,7 @@ void check_disk_size_change(struct gendisk *disk, struct block_device *bdev)
 		       "%s: detected capacity change from %lld to %lld\n",
 		       name, bdev_size, disk_size);
 		i_size_write(bdev->bd_inode, disk_size);
-		flush_disk(bdev, false);
+		fsync_bdev(bdev);
 	}
 }
 EXPORT_SYMBOL(check_disk_size_change);
@@ -1026,7 +1026,7 @@ int check_disk_change(struct block_device *bdev)
 	if (!(events & DISK_EVENT_MEDIA_CHANGE))
 		return 0;
 
-	flush_disk(bdev, true);
+	flush_disk(bdev);
 	if (bdops->revalidate_disk)
 		bdops->revalidate_disk(bdev->bd_disk);
 	return 1;
@@ -1607,7 +1607,7 @@ fail:
 }
 EXPORT_SYMBOL(lookup_bdev);
 
-int __invalidate_device(struct block_device *bdev, bool kill_dirty)
+int __invalidate_device(struct block_device *bdev)
 {
 	struct super_block *sb = get_super(bdev);
 	int res = 0;
@@ -1620,7 +1620,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 		 * hold).
 		 */
 		shrink_dcache_sb(sb);
-		res = invalidate_inodes(sb, kill_dirty);
+		res = invalidate_inodes(sb);
 		drop_super(sb);
 	}
 	invalidate_bdev(bdev);
diff --git a/fs/inode.c b/fs/inode.c
index 0647d80..9c2b795 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -548,14 +548,11 @@ void evict_inodes(struct super_block *sb)
 /**
  * invalidate_inodes	- attempt to free all inodes on a superblock
  * @sb:		superblock to operate on
- * @kill_dirty: flag to guide handling of dirty inodes
  *
  * Attempts to free all inodes for a given superblock.  If there were any
  * busy inodes return a non-zero value, else zero.
- * If @kill_dirty is set, discard dirty inodes too, otherwise treat
- * them as busy.
  */
-int invalidate_inodes(struct super_block *sb, bool kill_dirty)
+int invalidate_inodes(struct super_block *sb)
 {
 	int busy = 0;
 	struct inode *inode, *next;
@@ -567,10 +564,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))
 			continue;
-		if (inode->i_state & I_DIRTY && !kill_dirty) {
-			busy = 1;
-			continue;
-		}
 		if (atomic_read(&inode->i_count)) {
 			busy = 1;
 			continue;
diff --git a/fs/internal.h b/fs/internal.h
index f3d15de..bee95ea 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -125,4 +125,4 @@ extern long do_handle_open(int mountdirfd,
  */
 extern int get_nr_dirty_inodes(void);
 extern void evict_inodes(struct super_block *);
-extern int invalidate_inodes(struct super_block *, bool);
+extern int invalidate_inodes(struct super_block *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13df14e..ff9a159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2155,7 +2155,7 @@ extern void check_disk_size_change(struct gendisk *disk,
 				   struct block_device *bdev);
 extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
-extern int __invalidate_device(struct block_device *, bool);
+extern int __invalidate_device(struct block_device *);
 extern int invalidate_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.
  2011-03-16 20:30                   ` Jeff Moyer
@ 2011-03-17  1:28                     ` NeilBrown
  2011-03-17 17:33                       ` Jeff Moyer
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2011-03-17  1:28 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: James Bottomley, device-mapper development, Jens Axboe,
	linux-raid, linux-kernel, Christoph Hellwig, linux-fsdevel

On Wed, 16 Mar 2011 16:30:22 -0400 Jeff Moyer <jmoyer@redhat.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> >> Synchronous notification of errors.  If we don't try to write everything
> >> back immediately after the size change, we don't see dirty pages in
> >> zapped regions until the writeout/page cache management takes it into
> >> its head to try to clean the pages.
> >> 
> >
> > So if you just want synchronous errors, I think you want:
> >     fsync_bdev()
> >
> > which calls sync_filesystem() if it can find a filesystem, else
> > sync_blockdev();  (sync_filesystem itself calls sync_blockdev too).
> 
> ... which deadlocks md.  ;-)  writeback_inodes_sb_nr is waiting for the
> flusher thread to write back the dirty data.  The flusher thread is
> stuck in md_write_start, here:
> 
>         wait_event(mddev->sb_wait,
>                    !test_bit(MD_CHANGE_PENDING, &mddev->flags));
> 
> This is after reverting your change, and replacing the flush_disk call
> in check_disk_size_change with a call to fsync_bdev.  I'm not familiar
> enough with md to really suggest a way forward.  Neil?

That would be quite easy to avoid.
Just call
   md_write_start()
before revalidate_disk, and
   md_write_end()
afterwards.
You wouldn't have a 'bio' to pass in - but it is rather ugly requiring
one anyway - I should fix that.
For testing, just pass in NULL, and change
	if (bio_data_dir(bi) != WRITE)
		return;
to
	if (bi && bio_data_dir(bi) != WRITE)
		return;

NeilBrown


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.
  2011-03-17  1:28                     ` NeilBrown
@ 2011-03-17 17:33                       ` Jeff Moyer
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Moyer @ 2011-03-17 17:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: James Bottomley, device-mapper development, Jens Axboe,
	linux-raid, linux-kernel, Christoph Hellwig, linux-fsdevel

NeilBrown <neilb@suse.de> writes:

> On Wed, 16 Mar 2011 16:30:22 -0400 Jeff Moyer <jmoyer@redhat.com> wrote:
>
>> NeilBrown <neilb@suse.de> writes:
>> 
>> >> Synchronous notification of errors.  If we don't try to write everything
>> >> back immediately after the size change, we don't see dirty pages in
>> >> zapped regions until the writeout/page cache management takes it into
>> >> its head to try to clean the pages.
>> >> 
>> >
>> > So if you just want synchronous errors, I think you want:
>> >     fsync_bdev()
>> >
>> > which calls sync_filesystem() if it can find a filesystem, else
>> > sync_blockdev();  (sync_filesystem itself calls sync_blockdev too).
>> 
>> ... which deadlocks md.  ;-)  writeback_inodes_sb_nr is waiting for the
>> flusher thread to write back the dirty data.  The flusher thread is
>> stuck in md_write_start, here:
>> 
>>         wait_event(mddev->sb_wait,
>>                    !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>> 
>> This is after reverting your change, and replacing the flush_disk call
>> in check_disk_size_change with a call to fsync_bdev.  I'm not familiar
>> enough with md to really suggest a way forward.  Neil?
>
> That would be quite easy to avoid.
> Just call
>    md_write_start()
> before revalidate_disk, and
>    md_write_end()
> afterwards.

That does not avoid the problem (if I understood your suggestion).  You
instead end up with the following:

INFO: task md127_raid5:2282 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md127_raid5     D ffff88011c72d0a0  5688  2282      2 0x00000080
 ffff880118997c20 0000000000000046 ffff880100000000 0000000000000246
 0000000000014d00 ffff88011c72cb10 ffff88011c72d0a0 ffff880118997fd8
 ffff88011c72d0a8 0000000000014d00 ffff880118996010 0000000000014d00
Call Trace:
 [<ffffffff8138bbbd>] md_write_start+0xad/0x1d0
 [<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0311558>] raid5_finish_reshape+0x98/0x1e0 [raid456]
 [<ffffffff8138a933>] reap_sync_thread+0x63/0x130
 [<ffffffff8138c8b6>] md_check_recovery+0x1f6/0x6f0
 [<ffffffffa03150ab>] raid5d+0x3b/0x610 [raid456]
 [<ffffffff810804c9>] ? prepare_to_wait+0x59/0x90
 [<ffffffff81387ee9>] md_thread+0x119/0x150
 [<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81387dd0>] ? md_thread+0x0/0x150
 [<ffffffff8107fb56>] kthread+0x96/0xa0
 [<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
 [<ffffffff8107fac0>] ? kthread+0x0/0xa0
 [<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10

I'll leave this to you to work out when you have time.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-03-17 17:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20110217165057.5c50e566@notabene.brown>
     [not found] ` <20110303143120.GA8134@infradead.org>
     [not found]   ` <20110304111624.4be27aaf@notabene.brown>
     [not found]     ` <1299259506.2118.24.camel@grinch>
     [not found]       ` <20110306174755.49404c8e@notabene.brown>
     [not found]         ` <1299471771.2228.11.camel@grinch>
     [not found]           ` <1299516418.15258.4.camel@mulgrave.site>
2011-03-07 22:44             ` [dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size NeilBrown
2011-03-07 22:56               ` James Bottomley
2011-03-08  0:04                 ` NeilBrown
2011-03-16 20:30                   ` Jeff Moyer
2011-03-17  1:28                     ` NeilBrown
2011-03-17 17:33                       ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).