From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH] ext4: Fix data corruption in inodes with journalled data Date: Mon, 25 Jul 2011 16:26:14 +0200 Message-ID: <20110725142614.GA6107@quack.suse.cz> References: <1311381577-31406-1-git-send-email-jack@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jan Kara , Ted Tso , linux-ext4@vger.kernel.org To: Amir Goldstein Return-path: Received: from cantor2.suse.de ([195.135.220.15]:55875 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752205Ab1GYO0Q (ORCPT ); Mon, 25 Jul 2011 10:26:16 -0400 Content-Disposition: inline In-Reply-To: Sender: linux-ext4-owner@vger.kernel.org List-ID: Hello Amir, On Sat 23-07-11 16:21:55, Amir Goldstein wrote: > On Sat, Jul 23, 2011 at 3:39 AM, Jan Kara wrote: > > When journalling data for an inode (either because it is a symlink = or > > because the filesystem is mounted in data=3Djournal mode), > > ext4_evict_inode() can discard unwritten data by calling > > truncate_inode_pages(). This is because we don't mark the buffer / = page > > dirty when journalling data but only add the buffer to the running > > transaction and thus mm does not know there are still unwritten dat= a. > > > > Fix the problem by carefully tracking transaction containing inode'= s > > data, committing this transaction, and writing uncheckpointed buffe= rs > > when inode should be reaped. > > > > Signed-off-by: Jan Kara --- =A0fs/ext4/inode.c | =A0= 29 > > +++++++++++++++++++++++++++++ =A01 files changed, 29 insertions(+),= 0 > > deletions(-) > > > > =A0This is ext4 version of an ext3 fix I sent a while ago. It recei= ved > > only light testing but I figured you might want get the patch earli= er > > rather than later given the merge window is open. > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e3126c0..01999= 5b > > 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -190,6 +190,3= 3 @@ > > void ext4_evict_inode(struct inode *inode) > > > > =A0 =A0 =A0 =A0trace_ext4_evict_inode(inode); =A0 =A0 =A0 =A0if (in= ode->i_nlink) { + =A0 > > =A0 =A0 =A0 =A0 =A0 =A0 /* + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* When = journalling data dirty buffers > > are tracked only in the + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* journal.= So although mm > > thinks everything is clean and + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* r= eady for reaping the > > inode might still have some pages to + =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0* write in the > > running transaction or waiting to be + =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0* checkpointed. > > Thus calling jbd2_journal_invalidatepage() + =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0* (via > > truncate_inode_pages()) to discard these buffers can + =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0* > > cause data loss. Also even if we did not discard these + =A0 =A0 =A0= =A0 =A0 =A0 =A0 > > =A0* buffers, we would have no way to find them after the inode + =A0= =A0 =A0 =A0 > > =A0 =A0 =A0 =A0* is reaped and thus user could see stale data if he= tries to + > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* read them before the transaction i= s checkpointed. So > > be + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* careful and force everything = to disk here... We > > use + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* ei->i_datasync_tid to store = the newest > > transaction + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* containing inode's d= ata. + =A0 =A0 =A0 =A0 =A0 =A0 > > =A0 =A0* + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Note that directories d= o not have this problem > > because they + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* don't use page cach= e. + =A0 =A0 =A0 =A0 =A0 =A0 =A0 > > =A0*/ + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (ext4_should_journal_data(in= ode) && + =A0 =A0 =A0 =A0 =A0 > > =A0 =A0 =A0 =A0 (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode)))= { + =A0 =A0 =A0 =A0 > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 journal_t *journal =3D EXT4_SB(inode->i= _sb)->s_journal; + =A0 > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tid_t commit_tid =3D EXT4_I= (inode)->i_datasync_tid; + > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 jbd2_log_start_commit= (journal, commit_tid); + =A0 > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 jbd2_log_wait_commit(journa= l, commit_tid); + =A0 =A0 =A0 > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 filemap_write_and_wait(&inode->i_da= ta); + =A0 =A0 =A0 =A0 =A0 =A0 =A0 > > } =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0truncate_inode_pages(&inode->i_dat= a, 0); =A0 =A0 =A0 =A0 =A0 =A0 =A0 > > =A0goto no_delete; =A0 =A0 =A0 =A0} @@ -1863,6 +1890,7 @@ static in= t > > ext4_journalled_write_end(struct file *file, =A0 =A0 =A0 =A0if (new= _i_size > > > inode->i_size) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0i_size_write(inode, p= os+copied); =A0 =A0 =A0 > > =A0ext4_set_inode_state(inode, EXT4_STATE_JDATA); + =A0 =A0 =A0 > > EXT4_I(inode)->i_datasync_tid =3D handle->h_transaction->t_tid; =A0= =A0 =A0 =A0if > > (new_i_size > EXT4_I(inode)->i_disksize) { =A0 =A0 =A0 =A0 =A0 =A0 = =A0 > > =A0ext4_update_i_disksize(inode, new_i_size); =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0ret2 =3D > > ext4_mark_inode_dirty(handle, inode); @@ -2571,6 +2599,7 @@ static = int > > __ext4_journalled_writepage(struct page *page, =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 > > =A0 =A0 =A0 =A0write_end_fn); =A0 =A0 =A0 =A0if (ret =3D=3D 0) =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D err; + > > =A0 =A0 =A0 EXT4_I(inode)->i_datasync_tid =3D handle->h_transaction= ->t_tid; =A0 =A0 > > =A0 =A0err =3D ext4_journal_stop(handle); =A0 =A0 =A0 =A0if (!ret) = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret > > =3D err; -- 1.7.1 > > > Patch looks correct to me, but I am uncomfortable with i_datasync_tid > being treated differently in journalled write - that is, being set on > different places in the write paths. >=20 > How about setting i_datasync_tid in a more generic place like > ext4_{,da_}write_begin()? I know it's a bit redundant to setting dir= ty > pages, but at least this way i_datasync_tid can be checked in all jou= rnal > modes and have a consistent meaning. Well, I kept the meaning that i_datasync_tid is ID of a transaction t= hat must be committed for a data of an inode to be safely on disk. It is tr= ue that in data=3Djournal mode, we need to update this number differently = than in other journaling modes but that's not important I think. Currently, = we just force commit in data=3Djournal mode in every case and thus we do n= ot really care about the value of i_datasync_tid for fsync. In future we c= ould be more clever and avoid transaction commits for fsync in data=3Djourna= l mode in some cases. So in fact I'd say the code is now *more* consistent th= an it used to be. The only thing that isn't quite consistent is that I di= dn't bother with updating i_sync_tid because we currently do not use it. If people want, that might be a useful cleanup which I can do. > Perhaps we can even use i_datasync_tid to optimize away things like > fiemap checks for dirty pages. Umm, I'm not sure which checks do you mean... Honza --=20 Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html