From: Jan Kara <jack@suse.cz>
To: Chris Mason <chris.mason@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Theodore Ts'o <tytso@mit.edu>,
Linux Kernel Developers List <linux-kernel@vger.kernel.org>,
Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH] Add ext3 data=guarded mode
Date: Mon, 20 Apr 2009 17:50:11 +0200 [thread overview]
Message-ID: <20090420155010.GG14699@duck.suse.cz> (raw)
In-Reply-To: <1240239490.16213.57.camel@think.oraclecorp.com>
On Mon 20-04-09 10:58:10, Chris Mason wrote:
> > > > 3) Currently truncate() does filemap_write_and_wait() - is it really
> > > > needed? Each guarded bh could carry with itself i_disksize it should update
> > > > to when IO is finished. Extending truncate will just update this i_disksize
> > > > at the last member of the list (or update i_disksize when the list is
> > > > empty).
> > > >
> > > > Shortening truncate will walk the list of guarded bh's, removing from
> > > > the list those beyond new i_size, then it will behave like the extending
> > > > truncate (it works even if current i_disksize is larger than new i_size).
> > > > Note, that before we get to ext3_truncate() mm walks all the pages beyond
> > > > i_size and waits for page writeback so by the time ext3_truncate() is
> > > > called, all the IO is finished and dirty pages are canceled.
> > >
> > > The problem here was the disk i_size being updated by ext3_setattr
> > > before the vmtruncate calls calls ext3_truncate(). So the guarded IO
> > > might wander in and change the i_disksize update done by setattr.
> > >
> > > It all made me a bit dizzy and I just tossed the write_and_wait in
> > > instead.
> > >
> > > At the end of the day, we're waiting for guarded writes only, and we
> > > probably would have ended up waiting on those exact same pages in
> > > vmtruncate anyway. So, I do agree we could avoid the write with more
> > > code, but is this really a performance critical section?
> > Well, not really critical but also not negligible - mainly because with
> > your approach we end up *submitting* new writes we could just be canceled
> > otherwise. Without fdatawrite(), data of short-lived files need not ever
> > reach the disk similarly as in writeback mode (OK, this is connected with
> > the fact that you actually don't have fdatawrite() before ext3_truncate()
> > in ext3_delete_inode() and that's what initially puzzled me).
>
> When we're going down to zero, we don't need it. The i_disksize gets
> updated again by ext3_truncate. I'll toss in a special case for that
> before the write_and_wait.
I'm sorry but why truncate to zero does not need it? If we assume that
IO completion can still happen while ext3_truncate() is running which is
what you're afraid of, then I don't see a big difference between truncate
to zero, truncate to i_disksize (which is from where you do fdatawrite) or
truncate to anything else.
Also two other comments:
...
@@ -915,14 +1042,19 @@ int ext3_get_blocks_handle(handle_t *handle, struct
inode *inode,
* i_disksize growing is protected by truncate_mutex. Don't forget
* to
* protect it if you're about to implement concurrent
* ext3_get_block() -bzzz
+ *
+ * FIXME, I think this only needs to extend the disk i_size when
+ * we're filling holes that came from using ftruncate to increase
+ * i_size. Need to verify.
*/
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
- ei->i_disksize = inode->i_size;
+ if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+ maybe_update_disk_isize(inode, inode->i_size);
This is kind of confusing. extend_disksize is used only from ext3_getblk()
which is called only for directories so the first condition is always true
- and if it wasn't sometime in future, you'd have a hard time tracking why
i_disksize is not updated. So I'd rather add something like
WARN_ON(extend_disksize && ext3_should_guard_data(inode));
if you wish to keep the check.
Also I think you can have races between direct IO writing to the file
(updating i_disksize) and your completion handler updating i_disksize -
direct IO plays tricks with i_disksize to truncate allocated blocks in case
of failed write... It's all nasty ;( Probably we should somehow set clear
rules about i_disksize updates and clean up the code to obey them.
Otherwise we'll be hunting nasty races another two years...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2009-04-20 15:50 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
2009-04-15 17:22 ` [PATCH 2/3] Add block_write_full_page_endio for passing endio handler Chris Mason
2009-04-15 17:22 ` [PATCH 3/3] Add ext3 data=guarded mode Chris Mason
2009-04-16 19:42 ` [PATCH] " Chris Mason
2009-04-17 11:04 ` Mike Galbraith
2009-04-17 18:09 ` Amit Shah
2009-04-17 20:13 ` Theodore Tso
2009-04-18 6:03 ` Amit Shah
[not found] ` <20090418060312.GA10943@amit-x200.pnq.redhat.com>
2009-04-18 7:28 ` Mike Galbraith
2009-04-19 6:24 ` Amit Shah
2009-04-20 9:07 ` Mike Galbraith
2009-04-20 9:26 ` Jan Kara
2009-04-20 12:15 ` Mike Galbraith
2009-04-20 12:56 ` Amit Shah
2009-04-20 13:06 ` Mike Galbraith
2009-04-20 13:44 ` Jan Kara
2009-04-20 14:18 ` Chris Mason
2009-04-20 14:42 ` Jan Kara
2009-04-20 14:58 ` Chris Mason
2009-04-20 15:50 ` Jan Kara [this message]
2009-04-15 19:10 ` [PATCH RFC] ext3 data=guarded v3 Eric Sandeen
2009-04-15 20:35 ` Linus Torvalds
2009-04-15 21:09 ` Theodore Tso
2009-04-16 8:44 ` Jan Kara
2009-04-16 18:09 ` Nick Piggin
2009-04-16 11:39 ` Mike Galbraith
2009-04-16 11:40 ` Mike Galbraith
2009-04-16 14:56 ` Chris Mason
2009-04-16 17:12 ` Chris Mason
2009-04-16 18:25 ` Mike Galbraith
2009-04-16 18:37 ` Linus Torvalds
2009-04-16 19:38 ` Chris Mason
2009-04-16 18:00 ` Mike Galbraith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090420155010.GG14699@duck.suse.cz \
--to=jack@suse.cz \
--cc=chris.mason@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).