All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Mason <clm@fb.com>
To: Filipe Manana <fdmanana@suse.com>
Cc: <linux-btrfs@vger.kernel.org>, Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate
Date: Mon, 27 Oct 2014 15:57:19 -0400	[thread overview]
Message-ID: <1414439839.2166.2@mail.thefacebook.com> (raw)
In-Reply-To: <1413886363-12442-1-git-send-email-fdmanana@suse.com>

On Tue, Oct 21, 2014 at 6:12 AM, Filipe Manana <fdmanana@suse.com> 
wrote:
> If right after starting the snapshot creation ioctl we perform a 
> write against a
> file followed by a truncate, with both operations increasing the 
> file's size, we
> can get a snapshot tree that reflects a state of the source 
> subvolume's tree where
> the file truncation happened but the write operation didn't. This 
> leaves a gap
> between 2 file extent items of the inode, which makes btrfs' fsck 
> complain about it.
> 
> For example, if we perform the following file operations:
> 
>     $ mkfs.btrfs -f /dev/vdd
>     $ mount /dev/vdd /mnt
>     $ xfs_io -f \
>           -c "pwrite -S 0xaa -b 32K 0 32K" \
>           -c "fsync" \
>           -c "pwrite -S 0xbb -b 32770 16K 32770" \
>           -c "truncate 90123" \
>           /mnt/foobar
> 
> and the snapshot creation ioctl was just called before the second 
> write, we often
> can get the following inode items in the snapshot's btree:
> 
>         item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
>                 inode generation 146 transid 7 size 90123 block group 
> 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
>         item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
>                 inode ref index 282 namelen 10 name: foobar
>         item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
>                 extent data disk byte 1104855040 nr 32768
>                 extent data offset 0 nr 32768 ram 32768
>                 extent compression 0
>         item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
>                 extent data disk byte 0 nr 0
>                 extent data offset 0 nr 40960 ram 40960
>                 extent compression 0
> 
> There's a file range, corresponding to the interval [32K; ALIGN(16K + 
> 32770, 4096)[
> for which there's no file extent item covering it. This is because 
> the file write
> and file truncate operations happened both right after the snapshot 
> creation ioctl
> called btrfs_start_delalloc_inodes(), which means we didn't start and 
> wait for the
> ordered extent that matches the write and, in btrfs_setsize(), we 
> were able to call
> btrfs_cont_expand() before being able to commit the current 
> transaction in the
> snapshot creation ioctl. So this made it possibe to insert the hole 
> file extent
> item in the source subvolume (which represents the region added by 
> the truncate)
> right before the transaction commit from the snapshot creation ioctl.
> 
> Btrfs' fsck tool complains about such cases with a message like the 
> following:
> 
>     "root 331 inode 257 errors 100, file extent discount"
> 
> From a user perspective, the expectation when a snapshot is created 
> while those
> file operations are being performed is that the snapshot will have a 
> file that
> either:
> 
> 1) is empty
> 2) only the first write was captured
> 3) only the 2 writes were captured
> 4) both writes and the truncation were captured
> 
> But never capture a state where only the first write and the 
> truncation were
> captured (since the second write was performed before the truncation).
> 
> A test case for xfstests follows.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0d41741..c28b78f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode, 
> struct iattr *attr)
>  	}
> 
>  	if (newsize > oldsize) {
> +		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> +		if (ret)
> +			return ret;

Expanding truncates aren't my favorite operation, but we don't want 
them to imply fsync.  I'm holding off on this one while I work out the 
rest of the vacation backlog ;)

-chris




  reply	other threads:[~2014-10-27 19:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
2014-10-27 19:57 ` Chris Mason [this message]
2014-10-28 10:10   ` Filipe David Manana
2014-10-29  0:35 ` [PATCH v2] " Filipe Manana
2014-10-29  8:21 ` [PATCH v3] " Filipe Manana
2014-10-29  9:13   ` Miao Xie
2014-10-29 11:57 ` [PATCH v4] " Filipe Manana
2014-10-29 12:19   ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1414439839.2166.2@mail.thefacebook.com \
    --to=clm@fb.com \
    --cc=fdmanana@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.