From: Chris Mason <clm@fb.com>
To: Filipe Manana <fdmanana@suse.com>
Cc: <linux-btrfs@vger.kernel.org>, Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate
Date: Mon, 27 Oct 2014 15:57:19 -0400 [thread overview]
Message-ID: <1414439839.2166.2@mail.thefacebook.com> (raw)
In-Reply-To: <1413886363-12442-1-git-send-email-fdmanana@suse.com>
On Tue, Oct 21, 2014 at 6:12 AM, Filipe Manana <fdmanana@suse.com>
wrote:
> If right after starting the snapshot creation ioctl we perform a
> write against a
> file followed by a truncate, with both operations increasing the
> file's size, we
> can get a snapshot tree that reflects a state of the source
> subvolume's tree where
> the file truncation happened but the write operation didn't. This
> leaves a gap
> between 2 file extent items of the inode, which makes btrfs' fsck
> complain about it.
>
> For example, if we perform the following file operations:
>
> $ mkfs.btrfs -f /dev/vdd
> $ mount /dev/vdd /mnt
> $ xfs_io -f \
> -c "pwrite -S 0xaa -b 32K 0 32K" \
> -c "fsync" \
> -c "pwrite -S 0xbb -b 32770 16K 32770" \
> -c "truncate 90123" \
> /mnt/foobar
>
> and the snapshot creation ioctl was just called before the second
> write, we often
> can get the following inode items in the snapshot's btree:
>
> item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
> inode generation 146 transid 7 size 90123 block group
> 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
> item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
> inode ref index 282 namelen 10 name: foobar
> item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
> extent data disk byte 1104855040 nr 32768
> extent data offset 0 nr 32768 ram 32768
> extent compression 0
> item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
> extent data disk byte 0 nr 0
> extent data offset 0 nr 40960 ram 40960
> extent compression 0
>
> There's a file range, corresponding to the interval [32K; ALIGN(16K +
> 32770, 4096)[
> for which there's no file extent item covering it. This is because
> the file write
> and file truncate operations happened both right after the snapshot
> creation ioctl
> called btrfs_start_delalloc_inodes(), which means we didn't start and
> wait for the
> ordered extent that matches the write and, in btrfs_setsize(), we
> were able to call
> btrfs_cont_expand() before being able to commit the current
> transaction in the
> snapshot creation ioctl. So this made it possibe to insert the hole
> file extent
> item in the source subvolume (which represents the region added by
> the truncate)
> right before the transaction commit from the snapshot creation ioctl.
>
> Btrfs' fsck tool complains about such cases with a message like the
> following:
>
> "root 331 inode 257 errors 100, file extent discount"
>
> From a user perspective, the expectation when a snapshot is created
> while those
> file operations are being performed is that the snapshot will have a
> file that
> either:
>
> 1) is empty
> 2) only the first write was captured
> 3) only the 2 writes were captured
> 4) both writes and the truncation were captured
>
> But never capture a state where only the first write and the
> truncation were
> captured (since the second write was performed before the truncation).
>
> A test case for xfstests follows.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/inode.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0d41741..c28b78f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode,
> struct iattr *attr)
> }
>
> if (newsize > oldsize) {
> + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> + if (ret)
> + return ret;
Expanding truncates aren't my favorite operation, but we don't want
them to imply fsync. I'm holding off on this one while I work out the
rest of the vacation backlog ;)
-chris
next prev parent reply other threads:[~2014-10-27 19:57 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
2014-10-27 19:57 ` Chris Mason [this message]
2014-10-28 10:10 ` Filipe David Manana
2014-10-29 0:35 ` [PATCH v2] " Filipe Manana
2014-10-29 8:21 ` [PATCH v3] " Filipe Manana
2014-10-29 9:13 ` Miao Xie
2014-10-29 11:57 ` [PATCH v4] " Filipe Manana
2014-10-29 12:19 ` Chris Mason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1414439839.2166.2@mail.thefacebook.com \
--to=clm@fb.com \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox