From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>, Ted Tso <tytso@mit.edu>,
linux-ext4@vger.kernel.org
Subject: Re: [PATCH] ext4: Fix data corruption in inodes with journalled data
Date: Mon, 25 Jul 2011 16:26:14 +0200 [thread overview]
Message-ID: <20110725142614.GA6107@quack.suse.cz> (raw)
In-Reply-To: <CAOQ4uxjef-LrZvJkhw=2HvUN6UGtteW30gNUi2yU3LPP_oQhzw@mail.gmail.com>
Hello Amir,
On Sat 23-07-11 16:21:55, Amir Goldstein wrote:
> On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara <jack@suse.cz> wrote:
> > When journalling data for an inode (either because it is a symlink or
> > because the filesystem is mounted in data=journal mode),
> > ext4_evict_inode() can discard unwritten data by calling
> > truncate_inode_pages(). This is because we don't mark the buffer / page
> > dirty when journalling data but only add the buffer to the running
> > transaction and thus mm does not know there are still unwritten data.
> >
> > Fix the problem by carefully tracking transaction containing inode's
> > data, committing this transaction, and writing uncheckpointed buffers
> > when inode should be reaped.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 29
> > +++++++++++++++++++++++++++++ 1 files changed, 29 insertions(+), 0
> > deletions(-)
> >
> > This is ext4 version of an ext3 fix I sent a while ago. It received
> > only light testing but I figured you might want get the patch earlier
> > rather than later given the merge window is open.
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..019995b
> > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,33 @@
> > void ext4_evict_inode(struct inode *inode)
> >
> > trace_ext4_evict_inode(inode); if (inode->i_nlink) { +
> > /* + * When journalling data dirty buffers
> > are tracked only in the + * journal. So although mm
> > thinks everything is clean and + * ready for reaping the
> > inode might still have some pages to + * write in the
> > running transaction or waiting to be + * checkpointed.
> > Thus calling jbd2_journal_invalidatepage() + * (via
> > truncate_inode_pages()) to discard these buffers can + *
> > cause data loss. Also even if we did not discard these +
> > * buffers, we would have no way to find them after the inode +
> > * is reaped and thus user could see stale data if he tries to +
> > * read them before the transaction is checkpointed. So
> > be + * careful and force everything to disk here... We
> > use + * ei->i_datasync_tid to store the newest
> > transaction + * containing inode's data. +
> > * + * Note that directories do not have this problem
> > because they + * don't use page cache. +
> > */ + if (ext4_should_journal_data(inode) && +
> > (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) { +
> > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; +
> > tid_t commit_tid = EXT4_I(inode)->i_datasync_tid; +
> > + jbd2_log_start_commit(journal, commit_tid); +
> > jbd2_log_wait_commit(journal, commit_tid); +
> > filemap_write_and_wait(&inode->i_data); +
> > } truncate_inode_pages(&inode->i_data, 0);
> > goto no_delete; } @@ -1863,6 +1890,7 @@ static int
> > ext4_journalled_write_end(struct file *file, if (new_i_size >
> > inode->i_size) i_size_write(inode, pos+copied);
> > ext4_set_inode_state(inode, EXT4_STATE_JDATA); +
> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid; if
> > (new_i_size > EXT4_I(inode)->i_disksize) {
> > ext4_update_i_disksize(inode, new_i_size); ret2 =
> > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static int
> > __ext4_journalled_writepage(struct page *page,
> > write_end_fn); if (ret == 0) ret = err; +
> > EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> > err = ext4_journal_stop(handle); if (!ret) ret
> > = err; -- 1.7.1
> >
> Patch looks correct to me, but I am uncomfortable with i_datasync_tid
> being treated differently in journalled write - that is, being set on
> different places in the write paths.
>
> How about setting i_datasync_tid in a more generic place like
> ext4_{,da_}write_begin()? I know it's a bit redundant to setting dirty
> pages, but at least this way i_datasync_tid can be checked in all journal
> modes and have a consistent meaning.
Well, I kept the meaning that i_datasync_tid is ID of a transaction that
must be committed for a data of an inode to be safely on disk. It is true
that in data=journal mode, we need to update this number differently than
in other journaling modes but that's not important I think. Currently, we
just force commit in data=journal mode in every case and thus we do not
really care about the value of i_datasync_tid for fsync. In future we could
be more clever and avoid transaction commits for fsync in data=journal mode
in some cases. So in fact I'd say the code is now *more* consistent than
it used to be. The only thing that isn't quite consistent is that I didn't
bother with updating i_sync_tid because we currently do not use it. If
people want, that might be a useful cleanup which I can do.
> Perhaps we can even use i_datasync_tid to optimize away things like
> fiemap checks for dirty pages.
Umm, I'm not sure which checks do you mean...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2011-07-25 14:26 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-23 0:39 [PATCH] ext4: Fix data corruption in inodes with journalled data Jan Kara
2011-07-23 13:21 ` Amir Goldstein
2011-07-25 14:26 ` Jan Kara [this message]
2011-07-25 14:58 ` Amir Goldstein
2011-07-25 15:47 ` Jan Kara
2011-07-27 1:27 ` Ted Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110725142614.GA6107@quack.suse.cz \
--to=jack@suse.cz \
--cc=amir73il@gmail.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).