linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Chris Mason <chris.mason@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>,
	Linux Kernel Developers List <linux-kernel@vger.kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH] Add ext3 data=guarded mode
Date: Mon, 20 Apr 2009 17:50:11 +0200	[thread overview]
Message-ID: <20090420155010.GG14699@duck.suse.cz> (raw)
In-Reply-To: <1240239490.16213.57.camel@think.oraclecorp.com>

On Mon 20-04-09 10:58:10, Chris Mason wrote:
> > > >   3) Currently truncate() does filemap_write_and_wait() - is it really
> > > > needed? Each guarded bh could carry with itself i_disksize it should update
> > > > to when IO is finished. Extending truncate will just update this i_disksize
> > > > at the last member of the list (or update i_disksize when the list is
> > > > empty). 
> > > >
> > > > Shortening truncate will walk the list of guarded bh's, removing from
> > > > the list those beyond new i_size, then it will behave like the extending
> > > > truncate (it works even if current i_disksize is larger than new i_size).
> > > > Note, that before we get to ext3_truncate() mm walks all the pages beyond
> > > > i_size and waits for page writeback so by the time ext3_truncate() is
> > > > called, all the IO is finished and dirty pages are canceled.
> > > 
> > > The problem here was the disk i_size being updated by ext3_setattr
> > > before the vmtruncate calls calls ext3_truncate().  So the guarded IO
> > > might wander in and change the i_disksize update done by setattr.
> > > 
> > > It all made me a bit dizzy and I just tossed the write_and_wait in
> > > instead.
> > > 
> > > At the end of the day, we're waiting for guarded writes only, and we
> > > probably would have ended up waiting on those exact same pages in
> > > vmtruncate anyway.  So, I do agree we could avoid the write with more
> > > code, but is this really a performance critical section?
> >   Well, not really critical but also not negligible - mainly because with
> > your approach we end up *submitting* new writes we could just be canceled
> > otherwise. Without fdatawrite(), data of short-lived files need not ever
> > reach the disk similarly as in writeback mode (OK, this is connected with
> > the fact that you actually don't have fdatawrite() before ext3_truncate()
> > in ext3_delete_inode() and that's what initially puzzled me).
> 
> When we're going down to zero, we don't need it.  The i_disksize gets
> updated again by ext3_truncate.  I'll toss in a special case for that
> before the write_and_wait.
  I'm sorry but why truncate to zero does not need it? If we assume that
IO completion can still happen while ext3_truncate() is running which is
what you're afraid of, then I don't see a big difference between truncate
to zero, truncate to i_disksize (which is from where you do fdatawrite) or
truncate to anything else.
  Also two other comments:
...
@@ -915,14 +1042,19 @@ int ext3_get_blocks_handle(handle_t *handle, struct
inode *inode,
         * i_disksize growing is protected by truncate_mutex.  Don't forget
         * to
         * protect it if you're about to implement concurrent
         * ext3_get_block() -bzzz
+        *
+        * FIXME, I think this only needs to extend the disk i_size when
+        * we're filling holes that came from using ftruncate to increase
+        * i_size.  Need to verify.
        */
-       if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-               ei->i_disksize = inode->i_size;
+       if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+               maybe_update_disk_isize(inode, inode->i_size);
This is kind of confusing. extend_disksize is used only from ext3_getblk()
which is called only for directories so the first condition is always true
- and if it wasn't sometime in future, you'd have a hard time tracking why
i_disksize is not updated.  So I'd rather add something like
  WARN_ON(extend_disksize && ext3_should_guard_data(inode));
if you wish to keep the check.

  Also I think you can have races between direct IO writing to the file
(updating i_disksize) and your completion handler updating i_disksize -
direct IO plays tricks with i_disksize to truncate allocated blocks in case
of failed write... It's all nasty ;( Probably we should somehow set clear
rules about i_disksize updates and clean up the code to obey them.
Otherwise we'll be hunting nasty races another two years...

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  reply	other threads:[~2009-04-20 15:50 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
2009-04-15 17:22   ` [PATCH 2/3] Add block_write_full_page_endio for passing endio handler Chris Mason
2009-04-15 17:22   ` [PATCH 3/3] Add ext3 data=guarded mode Chris Mason
2009-04-16 19:42     ` [PATCH] " Chris Mason
2009-04-17 11:04       ` Mike Galbraith
2009-04-17 18:09       ` Amit Shah
2009-04-17 20:13         ` Theodore Tso
2009-04-18  6:03           ` Amit Shah
     [not found]           ` <20090418060312.GA10943@amit-x200.pnq.redhat.com>
2009-04-18  7:28             ` Mike Galbraith
2009-04-19  6:24               ` Amit Shah
2009-04-20  9:07                 ` Mike Galbraith
2009-04-20  9:26                   ` Jan Kara
2009-04-20 12:15                     ` Mike Galbraith
2009-04-20 12:56                       ` Amit Shah
2009-04-20 13:06                         ` Mike Galbraith
2009-04-20 13:44       ` Jan Kara
2009-04-20 14:18         ` Chris Mason
2009-04-20 14:42           ` Jan Kara
2009-04-20 14:58             ` Chris Mason
2009-04-20 15:50               ` Jan Kara [this message]
2009-04-15 19:10 ` [PATCH RFC] ext3 data=guarded v3 Eric Sandeen
2009-04-15 20:35 ` Linus Torvalds
2009-04-15 21:09   ` Theodore Tso
2009-04-16  8:44   ` Jan Kara
2009-04-16 18:09   ` Nick Piggin
2009-04-16 11:39 ` Mike Galbraith
2009-04-16 11:40   ` Mike Galbraith
2009-04-16 14:56   ` Chris Mason
2009-04-16 17:12     ` Chris Mason
2009-04-16 18:25       ` Mike Galbraith
2009-04-16 18:37       ` Linus Torvalds
2009-04-16 19:38         ` Chris Mason
2009-04-16 18:00     ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090420155010.GG14699@duck.suse.cz \
    --to=jack@suse.cz \
    --cc=chris.mason@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).