From: Dmitry Monakhov <dmonakhov@openvz.org>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, tytso@mit.edu, jack@suse.cz,
wenqing.lz@taobao.com
Subject: Re: [PATCH 4/7] ext4: fsync should wait for DIO writers
Date: Mon, 10 Sep 2012 14:56:04 +0400 [thread overview]
Message-ID: <87bohegn97.fsf@openvz.org> (raw)
In-Reply-To: <20120910095135.GF22903@quack.suse.cz>
On Mon, 10 Sep 2012 11:51:35 +0200, Jan Kara <jack@suse.cz> wrote:
> On Sun 09-09-12 21:27:11, Dmitry Monakhov wrote:
> > fsync and punch_hole are the places where we have to wait for all
> > existing writers (writeback, aio, dio), but currently we simply
> > flush pended end_io request which is not sufficient.
> Why not? I guess you mean the fact that there can be DIO in flight for
> which end_io() was not called so it is not queued in the queue? But that is
> OK - we have not yet called aio_complete() for that IO so for userspace the
> write has not happened yet. Thus there's no need to flush it to disk -
> fsync() does not say anything about writes in progress while fsync is
> called.
>From posix point of view(which is good one) may be wait for aio-dio's is
overwhelming guarantee.
> > Even more i_mutex is not holded while punch_hole which obviously
> > result in dangerous data corruption due to write-after-free.
> Yes, that's a bug. I also noticed that but didn't get to fixing it (I'm
> actually working on a more long term fix using range locking but that's
> more of a research project so having somehow fixed at least the most
> blatant locking problems is good).
Yes you right. In order to do things right we should block:
1) direct io
2) pagecache /mmap users (writeback, readpage)
A assumes I've fixed (1) but (2) is still exist
My current assumption is to do actions similar to writeback
down_write(EXT4_I(inode)->i_data_sem)
while (index <= end && pagevec_lookup(&pvec, mapping, index,...) {
lock_page(pvec[i]);
zero_user_page(pvec[i], 0, PAGE_SIZE);
ret = try_to_release_page(pvec[i]);
}
/* At this moment we know that we locked all pages in range,
* NOTE!!!! currently ext_remove_space may drop i_data_sem internally
* so it should be modified to exit once i_mutex was dropped
*/
ret = ext4_ext_remove_space(inode, from, to, NO_RELOCK)
while (pvec_num)
unlock_page(pvec[i])
}
up_write(EXT4_I(inode)->i_data_sem)
Number of locked pages should not be too large
Or even more instead of massive page locking, we can lock page
one by one, and simulate fake writeback, so all new writers will
wait on that bit, but readers will see zeroes.
down_write(EXT4_I(inode)->i_data_sem)
while (index <= end && pagevec_lookup(&pvec, mapping, index,...) {
lock_page(pvec[i]);
zero_user_page(pvec[i], 0, PAGE_SIZE);
ret = try_to_release_page(pvec[i]);
set_page_writeback(pvec[i]);
unlock_page(pvec[i])
}
ret = ext4_ext_remove_space(inode, from, to, NO_RELOCK)
while (pvec_num) {
end_page_writeback(pvec[i])
}
up_write(EXT4_I(inode)->i_data_sem)
>
> Honza
>
> >
> > This patch performs following changes:
> >
> > - Guard punch_hole with i_mutex
> > - fsync and punch_hole now wait for all writers in flight
> > NOTE: XXX write-after-free race is still possible because
> > truncate_pagecache_range() is not completely reliable and where
> > is no easy way to stop writeback while punch_hole is in progress.
> >
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> > fs/ext4/extents.c | 10 ++++++++--
> > fs/ext4/fsync.c | 1 +
> > 2 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index e993879..8252651 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4845,6 +4845,7 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> > return err;
> > }
> >
> > + mutex_lock(&inode->i_mutex);
> > /* Now release the pages */
> > if (last_page_offset > first_page_offset) {
> > truncate_pagecache_range(inode, first_page_offset,
> > @@ -4852,12 +4853,15 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
> > }
> >
> > /* finish any pending end_io work */
> > + inode_dio_wait(inode);
> > ext4_flush_completed_IO(inode);
> >
> > credits = ext4_writepage_trans_blocks(inode);
> > handle = ext4_journal_start(inode, credits);
> > - if (IS_ERR(handle))
> > - return PTR_ERR(handle);
> > + if (IS_ERR(handle)) {
> > + err = PTR_ERR(handle);
> > + goto out_mutex;
> > + }
> >
> > err = ext4_orphan_add(handle, inode);
> > if (err)
> > @@ -4951,6 +4955,8 @@ out:
> > inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> > ext4_mark_inode_dirty(handle, inode);
> > ext4_journal_stop(handle);
> > +out_mutex:
> > + mutex_unlock(&inode->i_mutex);
> > return err;
> > }
> > int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> > index 24f3719..290c5cf 100644
> > --- a/fs/ext4/fsync.c
> > +++ b/fs/ext4/fsync.c
> > @@ -204,6 +204,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> > if (inode->i_sb->s_flags & MS_RDONLY)
> > goto out;
> >
> > + inode_dio_wait(inode);
> > ret = ext4_flush_completed_IO(inode);
> > if (ret < 0)
> > goto out;
> > --
> > 1.7.7.6
> >
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2012-09-10 10:56 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-09 17:27 [PATCH 0/7] ext4: Bunch of DIO/AIO fixes Dmitry Monakhov
2012-09-09 17:27 ` [PATCH 1/7] ext4: ext4_inode_info diet Dmitry Monakhov
2012-09-13 10:50 ` Zheng Liu
2012-09-13 11:15 ` Dmitry Monakhov
2012-09-15 15:53 ` Theodore Ts'o
2012-09-09 17:27 ` [PATCH 2/7] ext4: completed_io locking cleanup Dmitry Monakhov
2012-09-10 9:23 ` Jan Kara
2012-09-10 10:19 ` Dmitry Monakhov
2012-09-13 10:48 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 3/7] ext4: serialize dio nolocked reads with defrag workers V2 Dmitry Monakhov
2012-09-10 9:31 ` Jan Kara
2012-09-10 10:00 ` Jan Kara
2012-09-09 17:27 ` [PATCH 4/7] ext4: fsync should wait for DIO writers Dmitry Monakhov
2012-09-10 9:51 ` Jan Kara
2012-09-10 10:56 ` Dmitry Monakhov [this message]
2012-09-12 14:02 ` Jan Kara
2012-09-12 5:40 ` Zheng Liu
2012-09-13 10:46 ` Zheng Liu
2012-09-13 11:01 ` Dmitry Monakhov
2012-09-13 12:36 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 5/7] ext4: serialize unlocked dio reads with truncate Dmitry Monakhov
2012-09-10 9:54 ` Jan Kara
2012-09-09 17:27 ` [PATCH 6/7] ext4: endless truncate due to nonlocked dio readers V2 Dmitry Monakhov
2012-09-13 10:41 ` Zheng Liu
2012-09-13 12:07 ` Jan Kara
2012-09-13 12:57 ` Zheng Liu
2012-09-13 14:34 ` Jan Kara
2012-09-13 23:31 ` Zheng Liu
2012-09-09 17:27 ` [PATCH 7/7] ext4: serialize truncate with owerwrite DIO workers V2 Dmitry Monakhov
2012-09-13 10:37 ` Zheng Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bohegn97.fsf@openvz.org \
--to=dmonakhov@openvz.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.