linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Patterson <andrew.patterson@hp.com>
To: NeilBrown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	dm-devel@redhat.com, James.Bottomley@suse.de
Subject: Re: [PATCH] Fix over-zealous flush_disk when changing device size.
Date: Sun, 06 Mar 2011 21:22:51 -0700	[thread overview]
Message-ID: <1299471771.2228.11.camel@grinch> (raw)
In-Reply-To: <20110306174755.49404c8e@notabene.brown>

On Sun, 2011-03-06 at 17:47 +1100, NeilBrown wrote:
> On Fri, 04 Mar 2011 10:25:06 -0700 Andrew Patterson <andrew.patterson@hp.com>
> wrote:
> 
> > On Fri, 2011-03-04 at 11:16 +1100, NeilBrown wrote:
> > > On Thu, 3 Mar 2011 09:31:20 -0500 Christoph Hellwig <hch@infradead.org> wrote:
> > > 
> > > > On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote:
> > > > > 
> > > > > Hi Andrew (and others)
> > > > >  I wonder if you would review the following for me and comment.
> > > > 
> > > > Please send think in this area through -fsdevel next time, thanks!
> > > 
> > > Will try to remember - it is sometimes hard to get this sort of patch before
> > > the right audience ... I thought "block layer" rather than "file systems" :-(
> > > 
> > > Thanks for finding it anyway.
> > > 
> > > > 
> > > > > There are two cases when we call flush_disk.
> > > > > In one, the device has disappeared (check_disk_change) so any
> > > > > data will hold becomes irrelevant.
> > > > > In the oter, the device has changed size (check_disk_size_change)
> > > > > so data we hold may be irrelevant.
> > > > > 
> > > > > In both cases it makes sense to discard any 'clean' buffers,
> > > > > so they will be read back from the device if needed.
> > > > 
> > > > Does it?  If the device has disappeared we can't read them back anyway.
> > > 
> > > I think that is the point - return an error rather than stale data.
> > > 
> > > > If the device has resized to a smaller size the same is true about
> > > > those buffers that have gone away, and if it has resized to a larger
> > > > size invalidating anything doesn't make sense at all.  I think this
> > > > area needs more love than a quick kill_dirty hackjob.
> > > 
> > > I tend to agree.  I wasn't entirely convinced by the changelog comments on
> > > the original offending patch, but I couldn't convince myself there was no
> > > justification either, and I wanted to fix the corruption I saw - while close
> > > to the end of a release cycle - without introducing any new regressions.
> > > 
> > > > 
> > > > > In the former case it makes sense to discard 'dirty' buffers
> > > > > as there will never be anywhere safe to write the data.  In the
> > > > > second case it *does*not* make sense to discard dirty buffers
> > > > > as that will lead to file system corruption when you simply enlarge
> > > > > the containing devices.
> > > > 
> > > > Doing anything like this at the buffer cache layer or inode cache layer
> > > > doesn't make any sense.  If a device goes away or shrinks below the
> > > > filesystem size the filesystem simply needs to be shut down and in te
> > > > former size the admin needs to start a manual repair.  Trying to do
> > > > any botch jobs in lower layer never works in practice.
> > > 
> > > Amen.
> > > What I personally would really like to see is an interface for the block
> > > device to say to the filesystem (or more specifically: whatever has bdclaimed
> > > it) "I am about to resize to $X - is that OK?" and also "I have resized -
> > > deal with it".
> > > 
> > > > 
> > > > For now I think the best short term fix is to simply revert commit
> > > > 608aeef17a91747d6303de4df5e2c2e6899a95e8
> > > > 
> > > > 	"Call flush_disk() after detecting an online resize."
> > > 
> > > You may be right, but I suspect that Andrew Patterson had a real issue to
> > > solve which lead to submitting it, and I'd really like to understand that
> > > issue before I would feel confident just reverting it.
> > > 
> > > Andrew:  are you out there?  Can you provide some background for your patch?
> > 
> > I put in the flush disk stuff at the suggestion of James Bottomley. In
> > fact the text for the justification in 608aeef17a91747d6303 is mostly
> > his.  The idea is to get errors reported immediately rather than waiting
> > around for them to eventually get flushed and to make sure stale data is
> > not kept around.  Certainly, at a minimum, not keeping stale data around
> > seems valuable to me. 
> > 
> > What parts of the original justification did you think are unconvincing?
> > Note that the flush for growing the device is really only there for the
> > degenerate case where someone might shrink then grow a device
> > (admittedly, the user probably deserves to get data corruption/security
> > holes in such a case).
> 
> hi Andrew.
> 
>  One of the things that I didn't like about the change log is that it didn't
>  give clear context - what exactly is the problem it is trying to fix?

Well true, but the rest of the patch set might have given the context.

>  When you talk about "disks" changing size (reduced radius:-?) I think first
>  of 'dm' and  'md' - yet it clearly isn't dm related as dm doesn't even use
>  that code, and if it was 'md' related I would have thought I would have
>  heard about it....
>  So presumably these are some SCSI-attached devices that do internal volume
>  management and can change the size of .... targets?  LUNs?  something like
>  that.

Yes.  The idea is that you want to increase/decrease the size of the
underlying block device (usually a FC or iSCSI LUN).  Short of a reboot,
the OS will not detect the size changed.  I added some code that rescans
the LUN and gets the new size and report the new size to the block
layer.

>  So can these things change size without the SCSI layer immediately knowing
>  about it??  Don't you get some sort of "unit attention" or something?
> 

In some cases with some hardware yes.  In most, no.

>  The idea that the device might reduce in size and then grow again seems just
>  plain wrong. 

This would usually be due to some user error.  The user may have shrunk
the device too much, realized there error and then corrected it. I admit
that it is pretty far-fetched.

>  If that is possible it could return to it's original size and
>  you would never notice? 

Yes.

>  And if there are cases that you know you will never
>  notice, then it seems like it is the wrong solution.
>  It would have been more credible if it *only* tried to flush when the size
>  was reduced ... which you do suggest as a possibility above I think.
> 
> 
>  The "potential security hole" seems completely bogus.
>   If you give me permission to read something, and I do, then you remove that
>   permission, the fact that I still remember what I read is not a security
>   hole.  It is a natural fact of life

I think the case here might be where you add a LUN onto the space where
the previous LUN space existed. Again, not likely.

> 
>  As for the two cases: I would describe them:
>    1. planned.  The fs is already shrunk to within the new boundary so there
>       is nothing to be gained by flushing, and we could have to reload a
>       pile of data from storage which would be pointless

This is the security case.

>    2. unplanned.  The fs is probably toast, so whether we invalidate or not
>       isn't going to make a whole lot of difference.

Agreed.

> 
>  So I would vote for not flushing.
> 
>  The "not keeping stale data around" argument ... the only data that is
>  clearly stale is data that is in the block-device cache and beyond the new
>  EOF.  Most filesystems keep most data in the page cache rather than the
>  block device cache so this would just be some metadata - maybe inode tables
>  or block usage bitmaps or similar.  And caches regularly keep data around
>  that isn't actually going to be used - so keeping a bit more on the rare
>  occasion of a block-device-resize doesn't seem like an important cost.
> 
>  Getting errors a little bit earlier in the case of an unplanned shrink is
>  possibly a credible argument.  This would be read errors only of course -
>  write errors would still arrive at the same time.  I'm not sure it would
>  really be very much earlier...maybe a bit in some cases.
> 
>  On the whole, the arguments both for and against this change - in principle
>  - seem rather weak:  "maybe" and "possibly" rather than something concrete.
> 
>  So given that (due to an oversight) it actually causes filesystem
>  corruption, I tend to agree with Christoph that the best approach at this
>  stage is the revert the original patch, and then review all the related code
>  and come up with a "correct" approach.
> 
> 
>  And just to clarify: when I first found that description "unconvincing" I
>  hadn't thought through all of these issues - I think it was just that the
>  justification seems vague rather than concrete, so it was hard to reason
>  about it.
> 
> 
>  Would you be uncomfortable if I asked Linus to revert both my fix and your
>  original patch??

James Bottomley wanted me to put this functionality in. I have no
problem with reverting it myself, especially if it is causing other
problems.  I would have to say that you need to ask him (or rather, I am
not qualified to render an opinion here).

Andrew

  reply	other threads:[~2011-03-07  4:22 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-17  5:50 [PATCH] Fix over-zealous flush_disk when changing device size NeilBrown
2011-02-17 17:03 ` Jeff Moyer
2011-02-23  8:48   ` Kwolek, Adam
2011-02-23 10:01     ` NeilBrown
2011-02-21 19:36 ` Jeff Moyer
2011-02-21 21:14   ` NeilBrown
2011-03-03 14:31 ` Christoph Hellwig
2011-03-04  0:16   ` NeilBrown
2011-03-04 17:25     ` Andrew Patterson
2011-03-06  6:47       ` NeilBrown
2011-03-07  4:22         ` Andrew Patterson [this message]
2011-03-07 16:46           ` [dm-devel] " James Bottomley
2011-03-07 22:44             ` NeilBrown
2011-03-07 22:56               ` James Bottomley
2011-03-08  0:04                 ` NeilBrown
2011-03-16 20:30                   ` Jeff Moyer
2011-03-17  1:28                     ` NeilBrown
2011-03-17 17:33                       ` Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1299471771.2228.11.camel@grinch \
    --to=andrew.patterson@hp.com \
    --cc=James.Bottomley@suse.de \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).