From: Liu Bo <bo.li.liu@oracle.com>
To: Filipe David Manana <fdmanana@gmail.com>
Cc: Filipe Manana <fdmanana@suse.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
stable@vger.kernel.org
Subject: Re: [PATCH v2] Btrfs: fix data loss in the fast fsync path
Date: Tue, 3 Mar 2015 21:31:52 +0800 [thread overview]
Message-ID: <20150303133151.GD18292@localhost.localdomain> (raw)
In-Reply-To: <CAL3q7H5eETLdSmvsdgazkND2oR+NrA3LPheZUjdQvg1ghayObg@mail.gmail.com>
On Tue, Mar 03, 2015 at 11:15:17AM +0000, Filipe David Manana wrote:
> On Tue, Mar 3, 2015 at 12:41 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Sun, Mar 01, 2015 at 09:08:38AM +0000, Filipe Manana wrote:
> >> When using the fast file fsync code path we can miss the fact that new
> >> writes happened since the last file fsync and therefore return without
> >> waiting for the IO to finish and write the new extents to the fsync log.
> >>
> >> Here's an example scenario where the fsync will miss the fact that new
> >> file data exists that wasn't yet durably persisted:
> >>
> >> 1. fs_info->last_trans_committed == N - 1 and current transaction is
> >> transaction N (fs_info->generation == N);
> >>
> >> 2. do a buffered write;
> >>
> >> 3. fsync our inode, this clears our inode's full sync flag, starts
> >> an ordered extent and waits for it to complete - when it completes
> >> at btrfs_finish_ordered_io(), the inode's last_trans is set to the
> >> value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
> >> btrfs_set_inode_last_trans);
> >>
> >> 4. transaction N is committed, so fs_info->last_trans_committed is now
> >> set to the value N and fs_info->generation remains with the value N;
> >>
> >> 5. do another buffered write, when this happens btrfs_file_write_iter
> >> sets our inode's last_trans to the value N + 1 (that is
> >> fs_info->generation + 1 == N + 1);
> >>
> >> 6. transaction N + 1 is started and fs_info->generation now has the
> >> value N + 1;
> >>
> >> 7. transaction N + 1 is committed, so fs_info->last_trans_committed
> >> is set to the value N + 1;
> >>
> >> 8. fsync our inode - because it doesn't have the full sync flag set,
> >> we only start the ordered extent, we don't wait for it to complete
> >> (only in a later phase) therefore its last_trans field has the
> >> value N + 1 set previously by btrfs_file_write_iter(), and so we
> >> have:
> >>
> >> inode->last_trans <= fs_info->last_trans_committed
> >> (N + 1) (N + 1)
> >>
> >> Which made us not log the last buffered write and exit the fsync
> >> handler immediately, returning success (0) to user space and resulting
> >> in data loss after a crash.
> >>
> >> This can actually be triggered deterministically and the following excerpt
> >> from a testcase I made for xfstests triggers the issue. It moves a dummy
> >> file across directories and then fsyncs the old parent directory - this
> >> is just to trigger a transaction commit, so moving files around isn't
> >> directly related to the issue but it was chosen because running 'sync' for
> >> example does more than just committing the current transaction, as it
> >> flushes/waits for all file data to be persisted. The issue can also happen
> >> at random periods, since the transaction kthread periodicaly commits the
> >> current transaction (about every 30 seconds by default).
> >> The body of the test is:
> >>
> >> _scratch_mkfs >> $seqres.full 2>&1
> >> _init_flakey
> >> _mount_flakey
> >>
> >> # Create our main test file 'foo', the one we check for data loss.
> >> # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
> >> # bit from its flags (btrfs inode specific flags).
> >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
> >> -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
> >>
> >> # Now create one other file and 2 directories. We will move this second file
> >> # from one directory to the other later because it forces btrfs to commit its
> >> # currently open transaction if we fsync the old parent directory. This is
> >> # necessary to trigger the data loss bug that affected btrfs.
> >> mkdir $SCRATCH_MNT/testdir_1
> >> touch $SCRATCH_MNT/testdir_1/bar
> >> mkdir $SCRATCH_MNT/testdir_2
> >>
> >> # Make sure everything is durably persisted.
> >> sync
> >>
> >> # Write more 8Kb of data to our file.
> >> $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
> >>
> >> # Move our 'bar' file into a new directory.
> >> mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
> >>
> >> # Fsync our first directory. Because it had a file moved into some other
> >> # directory, this made btrfs commit the currently open transaction. This is
> >> # a condition necessary to trigger the data loss bug.
> >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
> >>
> >> # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
> >> # data we wrote previously to be persisted and available if a crash happens.
> >> # This did not happen with btrfs, because of the transaction commit that
> >> # happened when we fsynced the parent directory.
> >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
> >>
> >> # Simulate a crash/power loss.
> >> _load_flakey_table $FLAKEY_DROP_WRITES
> >> _unmount_flakey
> >>
> >> _load_flakey_table $FLAKEY_ALLOW_WRITES
> >> _mount_flakey
> >>
> >> # Now check that all data we wrote before are available.
> >> echo "File content after log replay:"
> >> od -t x1 $SCRATCH_MNT/foo
> >>
> >> status=0
> >> exit
> >>
> >> The expected golden output for the test, which is what we get with this
> >> fix applied (or when running against ext3/4 and xfs), is:
> >>
> >> wrote 8192/8192 bytes at offset 0
> >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >> wrote 8192/8192 bytes at offset 8192
> >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >> File content after log replay:
> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >> *
> >> 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> >> *
> >> 0040000
> >>
> >> Without this fix applied, the output shows the test file does not have
> >> the second 8Kb extent that we successfully fsynced:
> >>
> >> wrote 8192/8192 bytes at offset 0
> >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >> wrote 8192/8192 bytes at offset 8192
> >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >> File content after log replay:
> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >> *
> >> 0020000
> >>
> >> So fix this by skipping the fsync only if we're doing a full sync and
> >> if the inode's last_trans is <= fs_info->last_trans_committed, or if
> >> the inode is already in the log. Also remove setting the inode's
> >> last_trans in btrfs_file_write_iter since it's useless/unreliable.
> >>
> >> A test case for xfstests will be sent soon.
> >>
> >> CC: <stable@vger.kernel.org>
> >> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>
> >> V2: Removed dead assignment of inode->last_trans in btrfs_file_write_iter
> >> (and the respective comment) since it's useless now. Added stable to
> >> cc because it's a data loss fix.
> >>
> >> fs/btrfs/file.c | 45 ++++++++++++++++++++++++++++-----------------
> >> 1 file changed, 28 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> >> index 2bd72cd..b7334c9 100644
> >> --- a/fs/btrfs/file.c
> >> +++ b/fs/btrfs/file.c
> >> @@ -1811,22 +1811,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >> mutex_unlock(&inode->i_mutex);
> >>
> >> /*
> >> - * we want to make sure fsync finds this change
> >> - * but we haven't joined a transaction running right now.
> >> - *
> >> - * Later on, someone is sure to update the inode and get the
> >> - * real transid recorded.
> >> - *
> >> - * We set last_trans now to the fs_info generation + 1,
> >> - * this will either be one more than the running transaction
> >> - * or the generation used for the next transaction if there isn't
> >> - * one running right now.
> >> - *
> >> * We also have to set last_sub_trans to the current log transid,
> >> * otherwise subsequent syncs to a file that's been synced in this
> >> * transaction will appear to have already occured.
> >> */
> >> - BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
> >
> > By thinking twice about it, how about setting ->last_trans with (-1ULL)?
> >
> > So the benefit is that if new writes have already finished its endio where
> > calling btrfs_set_inode_last_trans() to set ->last_trans with a transid
> > at that age, we may get a win for skipping log part if someone else has
> > updated ->last_trans_committed.
> >
> > By limiting it to 'full_sync' case we lose the above opportunity.
>
> That still won't work.
>
> Imagine the following the scenario:
>
> 1) do 2 buffered writes to 2 different ranges of the inode - the
> inode's last_trans is set to (u64)-1;
>
> 2) writepages is called against the first range only (either the VM
> called it due to memory pressure or a ranged fsync like msync for
> example);
>
> 3) the ordered extent started by the previous writepages calls
> completes and sets inode->last_trans to N (N == current transaction
> id/generation);
>
> 4) transaction N commits;
>
> 5) fsync the file (either whole range or a range covering only the
> second dirty range) - this will bail out since last_trans ==
> last_trans_committed, not logging the second dirty range.
Good explanation, it's true.
Thanks,
-liubo
>
> thanks
>
>
> >
> > Thanks,
> >
> > -liubo
> >> BTRFS_I(inode)->last_sub_trans = root->log_transid;
> >> if (num_written > 0) {
> >> err = generic_write_sync(file, pos, num_written);
> >> @@ -1971,14 +1959,37 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >> }
> >>
> >> /*
> >> - * if the last transaction that changed this file was before
> >> - * the current transaction, we can bail out now without any
> >> - * syncing
> >> + * If the last transaction that changed this file was before the current
> >> + * transaction and we have the full sync flag set in our inode, we can
> >> + * bail out now without any syncing.
> >> + *
> >> + * Note that we can't bail out if the full sync flag isn't set. This is
> >> + * because when the full sync flag is set we start all ordered extents
> >> + * and wait for them to fully complete - when they complete they update
> >> + * the inode's last_trans field through:
> >> + *
> >> + * btrfs_finish_ordered_io() ->
> >> + * btrfs_update_inode_fallback() ->
> >> + * btrfs_update_inode() ->
> >> + * btrfs_set_inode_last_trans()
> >> + *
> >> + * So we are sure that last_trans is up to date and can do this check to
> >> + * bail out safely. For the fast path, when the full sync flag is not
> >> + * set in our inode, we can not do it because we start only our ordered
> >> + * extents and don't wait for them to complete (that is when
> >> + * btrfs_finish_ordered_io runs), so here at this point their last_trans
> >> + * value might be less than or equals to fs_info->last_trans_committed,
> >> + * and setting a speculative last_trans for an inode when a buffered
> >> + * write is made (such as fs_info->generation + 1 for example) would not
> >> + * be reliable since after setting the value and before fsync is called
> >> + * any number of transactions can start and commit (transaction kthread
> >> + * commits the current transaction periodically), and a transaction
> >> + * commit does not start nor waits for ordered extents to complete.
> >> */
> >> smp_mb();
> >> if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
> >> - BTRFS_I(inode)->last_trans <=
> >> - root->fs_info->last_trans_committed) {
> >> + (full_sync && BTRFS_I(inode)->last_trans <=
> >> + root->fs_info->last_trans_committed)) {
> >> BTRFS_I(inode)->last_trans = 0;
> >>
> >> /*
> >> --
> >> 2.1.3
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
> Unreasonable men adapt the world to themselves.
> That's why all progress depends on unreasonable men."
next prev parent reply other threads:[~2015-03-03 13:32 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-26 19:52 [PATCH] Btrfs: fix data loss in the fast fsync path Filipe Manana
2015-02-28 15:04 ` Liu Bo
2015-03-01 9:08 ` [PATCH v2] " Filipe Manana
2015-03-03 0:41 ` Liu Bo
2015-03-03 11:15 ` Filipe David Manana
2015-03-03 13:31 ` Liu Bo [this message]
2015-03-01 20:36 ` [PATCH v3] " Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150303133151.GD18292@localhost.localdomain \
--to=bo.li.liu@oracle.com \
--cc=fdmanana@gmail.com \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).