[PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate
@ 2014-10-21 10:12 Filipe Manana
  2014-10-27 19:57 ` Chris Mason
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Filipe Manana @ 2014-10-21 10:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0d41741..c28b78f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 	}

 	if (newsize > oldsize) {
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+		if (ret)
+			return ret;
 		truncate_pagecache(inode, newsize);
 		ret = btrfs_cont_expand(inode, oldsize, newsize);
 		if (ret)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
@ 2014-10-27 19:57 ` Chris Mason
  2014-10-28 10:10   ` Filipe David Manana
  2014-10-29  0:35 ` [PATCH v2] " Filipe Manana
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2014-10-27 19:57 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, Filipe Manana

On Tue, Oct 21, 2014 at 6:12 AM, Filipe Manana <fdmanana@suse.com> 
wrote:
> If right after starting the snapshot creation ioctl we perform a 
> write against a
> file followed by a truncate, with both operations increasing the 
> file's size, we
> can get a snapshot tree that reflects a state of the source 
> subvolume's tree where
> the file truncation happened but the write operation didn't. This 
> leaves a gap
> between 2 file extent items of the inode, which makes btrfs' fsck 
> complain about it.
> 
> For example, if we perform the following file operations:
> 
>     $ mkfs.btrfs -f /dev/vdd
>     $ mount /dev/vdd /mnt
>     $ xfs_io -f \
>           -c "pwrite -S 0xaa -b 32K 0 32K" \
>           -c "fsync" \
>           -c "pwrite -S 0xbb -b 32770 16K 32770" \
>           -c "truncate 90123" \
>           /mnt/foobar
> 
> and the snapshot creation ioctl was just called before the second 
> write, we often
> can get the following inode items in the snapshot's btree:
> 
>         item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
>                 inode generation 146 transid 7 size 90123 block group 
> 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
>         item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
>                 inode ref index 282 namelen 10 name: foobar
>         item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
>                 extent data disk byte 1104855040 nr 32768
>                 extent data offset 0 nr 32768 ram 32768
>                 extent compression 0
>         item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
>                 extent data disk byte 0 nr 0
>                 extent data offset 0 nr 40960 ram 40960
>                 extent compression 0
> 
> There's a file range, corresponding to the interval [32K; ALIGN(16K + 
> 32770, 4096)[
> for which there's no file extent item covering it. This is because 
> the file write
> and file truncate operations happened both right after the snapshot 
> creation ioctl
> called btrfs_start_delalloc_inodes(), which means we didn't start and 
> wait for the
> ordered extent that matches the write and, in btrfs_setsize(), we 
> were able to call
> btrfs_cont_expand() before being able to commit the current 
> transaction in the
> snapshot creation ioctl. So this made it possibe to insert the hole 
> file extent
> item in the source subvolume (which represents the region added by 
> the truncate)
> right before the transaction commit from the snapshot creation ioctl.
> 
> Btrfs' fsck tool complains about such cases with a message like the 
> following:
> 
>     "root 331 inode 257 errors 100, file extent discount"
> 
> From a user perspective, the expectation when a snapshot is created 
> while those
> file operations are being performed is that the snapshot will have a 
> file that
> either:
> 
> 1) is empty
> 2) only the first write was captured
> 3) only the 2 writes were captured
> 4) both writes and the truncation were captured
> 
> But never capture a state where only the first write and the 
> truncation were
> captured (since the second write was performed before the truncation).
> 
> A test case for xfstests follows.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0d41741..c28b78f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode, 
> struct iattr *attr)
>  	}
> 
>  	if (newsize > oldsize) {
> +		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> +		if (ret)
> +			return ret;

Expanding truncates aren't my favorite operation, but we don't want 
them to imply fsync.  I'm holding off on this one while I work out the 
rest of the vacation backlog ;)

-chris




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-27 19:57 ` Chris Mason
@ 2014-10-28 10:10   ` Filipe David Manana
  0 siblings, 0 replies; 8+ messages in thread
From: Filipe David Manana @ 2014-10-28 10:10 UTC (permalink / raw)
  To: Chris Mason; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org

On Mon, Oct 27, 2014 at 7:57 PM, Chris Mason <clm@fb.com> wrote:
> On Tue, Oct 21, 2014 at 6:12 AM, Filipe Manana <fdmanana@suse.com> wrote:
>>
>> If right after starting the snapshot creation ioctl we perform a write
>> against a
>> file followed by a truncate, with both operations increasing the file's
>> size, we
>> can get a snapshot tree that reflects a state of the source subvolume's
>> tree where
>> the file truncation happened but the write operation didn't. This leaves a
>> gap
>> between 2 file extent items of the inode, which makes btrfs' fsck complain
>> about it.
>>
>> For example, if we perform the following file operations:
>>
>>     $ mkfs.btrfs -f /dev/vdd
>>     $ mount /dev/vdd /mnt
>>     $ xfs_io -f \
>>           -c "pwrite -S 0xaa -b 32K 0 32K" \
>>           -c "fsync" \
>>           -c "pwrite -S 0xbb -b 32770 16K 32770" \
>>           -c "truncate 90123" \
>>           /mnt/foobar
>>
>> and the snapshot creation ioctl was just called before the second write,
>> we often
>> can get the following inode items in the snapshot's btree:
>>
>>         item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
>>                 inode generation 146 transid 7 size 90123 block group 0
>> mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
>>         item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
>>                 inode ref index 282 namelen 10 name: foobar
>>         item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
>>                 extent data disk byte 1104855040 nr 32768
>>                 extent data offset 0 nr 32768 ram 32768
>>                 extent compression 0
>>         item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
>>                 extent data disk byte 0 nr 0
>>                 extent data offset 0 nr 40960 ram 40960
>>                 extent compression 0
>>
>> There's a file range, corresponding to the interval [32K; ALIGN(16K +
>> 32770, 4096)[
>> for which there's no file extent item covering it. This is because the
>> file write
>> and file truncate operations happened both right after the snapshot
>> creation ioctl
>> called btrfs_start_delalloc_inodes(), which means we didn't start and wait
>> for the
>> ordered extent that matches the write and, in btrfs_setsize(), we were
>> able to call
>> btrfs_cont_expand() before being able to commit the current transaction in
>> the
>> snapshot creation ioctl. So this made it possibe to insert the hole file
>> extent
>> item in the source subvolume (which represents the region added by the
>> truncate)
>> right before the transaction commit from the snapshot creation ioctl.
>>
>> Btrfs' fsck tool complains about such cases with a message like the
>> following:
>>
>>     "root 331 inode 257 errors 100, file extent discount"
>>
>> From a user perspective, the expectation when a snapshot is created while
>> those
>> file operations are being performed is that the snapshot will have a file
>> that
>> either:
>>
>> 1) is empty
>> 2) only the first write was captured
>> 3) only the 2 writes were captured
>> 4) both writes and the truncation were captured
>>
>> But never capture a state where only the first write and the truncation
>> were
>> captured (since the second write was performed before the truncation).
>>
>> A test case for xfstests follows.
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 0d41741..c28b78f 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode, struct
>> iattr *attr)
>>         }
>>
>>         if (newsize > oldsize) {
>> +               ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
>> +               if (ret)
>> +                       return ret;
>
>
> Expanding truncates aren't my favorite operation, but we don't want them to
> imply fsync.  I'm holding off on this one while I work out the rest of the
> vacation backlog ;)

Yes, it's very close to an fsync.
Any suggestion for a more lightweight approach?

thanks

>
> -chris
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
  2014-10-27 19:57 ` Chris Mason
@ 2014-10-29  0:35 ` Filipe Manana
  2014-10-29  8:21 ` [PATCH v3] " Filipe Manana
  2014-10-29 11:57 ` [PATCH v4] " Filipe Manana
  3 siblings, 0 replies; 8+ messages in thread
From: Filipe Manana @ 2014-10-29  0:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Use different approach to solve the problem. Don't start and wait for all
    dellaloc to finish after every expanding truncate, instead add an additional
    flush at transaction commit time if we're doing a transaction commit that
    creates snapshots.

 fs/btrfs/transaction.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 396ae8b..18c356e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
 		btrfs_wait_ordered_roots(fs_info, -1);
 }
 
+static int
+start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
+				      struct list_head *splice)
+{
+	struct btrfs_pending_snapshot *pending_snapshot;
+	int ret = 0;
+
+	if (btrfs_test_opt(trans->root, FLUSHONCOMMIT))
+		return 0;
+
+	spin_lock(&trans->root->fs_info->trans_lock);
+	list_splice_init(&trans->transaction->pending_snapshots, splice);
+	spin_unlock(&trans->root->fs_info->trans_lock);
+
+	/*
+	 * Start again delalloc for the roots our pending snapshots are made
+	 * from. We did it before starting/joining a transaction and we do it
+	 * here again because new inode operations might have happened since
+	 * then and we want to make sure the snapshot captures a fully
+	 * consistent state of the source root tree. For example, if after the
+	 * first delalloc flush a write is made against an inode followed by
+	 * an expanding truncate, we want to make sure the snapshot captured
+	 * both the write and the truncation, and not just the truncation.
+	 * Here we shouldn't have much delalloc work to do, as the bulk of it
+	 * was done before and outside the transaction.
+	 */
+	list_for_each_entry(pending_snapshot, splice, list) {
+		ret = btrfs_start_delalloc_inodes(pending_snapshot->root, 1);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static void
+wait_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
+				     struct list_head *splice)
+{
+	struct btrfs_pending_snapshot *pending_snapshot;
+
+	if (list_empty(splice) || btrfs_test_opt(trans->root, FLUSHONCOMMIT))
+		return;
+
+	list_for_each_entry(pending_snapshot, splice, list)
+		btrfs_wait_ordered_extents(pending_snapshot->root, -1);
+
+	spin_lock(&trans->root->fs_info->trans_lock);
+	list_splice_init(splice, &trans->transaction->pending_snapshots);
+	spin_unlock(&trans->root->fs_info->trans_lock);
+}
+
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root)
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
 	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
+	LIST_HEAD(pending_snapshots);
 	int ret;
 
 	/* Stop the commit early if ->aborted is set */
@@ -1802,6 +1855,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto cleanup_transaction;
 
+	ret = start_pending_snapshot_roots_delalloc(trans, &pending_snapshots);
+	if (ret)
+		goto cleanup_transaction;
+
 	ret = btrfs_run_delayed_items(trans, root);
 	if (ret)
 		goto cleanup_transaction;
@@ -1816,6 +1873,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	btrfs_wait_delalloc_flush(root->fs_info);
 
+	wait_pending_snapshot_roots_delalloc(trans, &pending_snapshots);
+
 	btrfs_scrub_pause(root);
 	/*
 	 * Ok now we need to make sure to block out any other joins while we
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
  2014-10-27 19:57 ` Chris Mason
  2014-10-29  0:35 ` [PATCH v2] " Filipe Manana
@ 2014-10-29  8:21 ` Filipe Manana
  2014-10-29  9:13   ` Miao Xie
  2014-10-29 11:57 ` [PATCH v4] " Filipe Manana
  3 siblings, 1 reply; 8+ messages in thread
From: Filipe Manana @ 2014-10-29  8:21 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Use different approach to solve the problem. Don't start and wait for all
    dellaloc to finish after every expanding truncate, instead add an additional
    flush at transaction commit time if we're doing a transaction commit that
    creates snapshots.

V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc().

 fs/btrfs/transaction.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 396ae8b..5e7f004 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
 		btrfs_wait_ordered_roots(fs_info, -1);
 }
 
+static int
+start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
+				      struct list_head *splice)
+{
+	struct btrfs_pending_snapshot *pending_snapshot;
+	int ret = 0;
+
+	if (btrfs_test_opt(trans->root, FLUSHONCOMMIT))
+		return 0;
+
+	spin_lock(&trans->root->fs_info->trans_lock);
+	list_splice_init(&trans->transaction->pending_snapshots, splice);
+	spin_unlock(&trans->root->fs_info->trans_lock);
+
+	/*
+	 * Start again delalloc for the roots our pending snapshots are made
+	 * from. We did it before starting/joining a transaction and we do it
+	 * here again because new inode operations might have happened since
+	 * then and we want to make sure the snapshot captures a fully
+	 * consistent state of the source root tree. For example, if after the
+	 * first delalloc flush a write is made against an inode followed by
+	 * an expanding truncate, we want to make sure the snapshot captured
+	 * both the write and the truncation, and not just the truncation.
+	 * Here we shouldn't have much delalloc work to do, as the bulk of it
+	 * was done before and outside the transaction.
+	 */
+	list_for_each_entry(pending_snapshot, splice, list) {
+		ret = btrfs_start_delalloc_inodes(pending_snapshot->root, 1);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static void
+wait_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
+				     struct list_head *splice)
+{
+	struct btrfs_pending_snapshot *pending_snapshot;
+
+	if (list_empty(splice))
+		return;
+
+	list_for_each_entry(pending_snapshot, splice, list)
+		btrfs_wait_ordered_extents(pending_snapshot->root, -1);
+
+	spin_lock(&trans->root->fs_info->trans_lock);
+	list_splice_init(splice, &trans->transaction->pending_snapshots);
+	spin_unlock(&trans->root->fs_info->trans_lock);
+}
+
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root)
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
 	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
+	LIST_HEAD(pending_snapshots);
 	int ret;
 
 	/* Stop the commit early if ->aborted is set */
@@ -1802,6 +1855,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto cleanup_transaction;
 
+	ret = start_pending_snapshot_roots_delalloc(trans, &pending_snapshots);
+	if (ret)
+		goto cleanup_transaction;
+
 	ret = btrfs_run_delayed_items(trans, root);
 	if (ret)
 		goto cleanup_transaction;
@@ -1816,6 +1873,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	btrfs_wait_delalloc_flush(root->fs_info);
 
+	wait_pending_snapshot_roots_delalloc(trans, &pending_snapshots);
+
 	btrfs_scrub_pause(root);
 	/*
 	 * Ok now we need to make sure to block out any other joins while we
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-29  8:21 ` [PATCH v3] " Filipe Manana
@ 2014-10-29  9:13   ` Miao Xie
  0 siblings, 0 replies; 8+ messages in thread
From: Miao Xie @ 2014-10-29  9:13 UTC (permalink / raw)
  To: Filipe Manana, linux-btrfs

On Wed, 29 Oct 2014 08:21:12 +0000, Filipe Manana wrote:
> If right after starting the snapshot creation ioctl we perform a write against a
> file followed by a truncate, with both operations increasing the file's size, we
> can get a snapshot tree that reflects a state of the source subvolume's tree where
> the file truncation happened but the write operation didn't. This leaves a gap
> between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
> 
> For example, if we perform the following file operations:
> 
>     $ mkfs.btrfs -f /dev/vdd
>     $ mount /dev/vdd /mnt
>     $ xfs_io -f \
>           -c "pwrite -S 0xaa -b 32K 0 32K" \
>           -c "fsync" \
>           -c "pwrite -S 0xbb -b 32770 16K 32770" \
>           -c "truncate 90123" \
>           /mnt/foobar
> 
> and the snapshot creation ioctl was just called before the second write, we often
> can get the following inode items in the snapshot's btree:
> 
>         item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
>                 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
>         item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
>                 inode ref index 282 namelen 10 name: foobar
>         item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
>                 extent data disk byte 1104855040 nr 32768
>                 extent data offset 0 nr 32768 ram 32768
>                 extent compression 0
>         item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
>                 extent data disk byte 0 nr 0
>                 extent data offset 0 nr 40960 ram 40960
>                 extent compression 0
> 
> There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
> for which there's no file extent item covering it. This is because the file write
> and file truncate operations happened both right after the snapshot creation ioctl
> called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
> ordered extent that matches the write and, in btrfs_setsize(), we were able to call
> btrfs_cont_expand() before being able to commit the current transaction in the
> snapshot creation ioctl. So this made it possibe to insert the hole file extent
> item in the source subvolume (which represents the region added by the truncate)
> right before the transaction commit from the snapshot creation ioctl.
> 
> Btrfs' fsck tool complains about such cases with a message like the following:
> 
>     "root 331 inode 257 errors 100, file extent discount"
> 
>>From a user perspective, the expectation when a snapshot is created while those
> file operations are being performed is that the snapshot will have a file that
> either:
> 
> 1) is empty
> 2) only the first write was captured
> 3) only the 2 writes were captured
> 4) both writes and the truncation were captured
> 
> But never capture a state where only the first write and the truncation were
> captured (since the second write was performed before the truncation).
> 
> A test case for xfstests follows.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Use different approach to solve the problem. Don't start and wait for all
>     dellaloc to finish after every expanding truncate, instead add an additional
>     flush at transaction commit time if we're doing a transaction commit that
>     creates snapshots.

This method will make the transaction commit spend more time, why not use
i_disk_size to expand the file size in btrfs_setsize()? Or we might rename
btrfs_{start, end}_nocow_write(), and use them in btrfs_setsize()?

Thanks
Miao

> 
> V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc().
> 
>  fs/btrfs/transaction.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 396ae8b..5e7f004 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
>  		btrfs_wait_ordered_roots(fs_info, -1);
>  }
>  
> +static int
> +start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
> +				      struct list_head *splice)
> +{
> +	struct btrfs_pending_snapshot *pending_snapshot;
> +	int ret = 0;
> +
> +	if (btrfs_test_opt(trans->root, FLUSHONCOMMIT))
> +		return 0;
> +
> +	spin_lock(&trans->root->fs_info->trans_lock);
> +	list_splice_init(&trans->transaction->pending_snapshots, splice);
> +	spin_unlock(&trans->root->fs_info->trans_lock);
> +
> +	/*
> +	 * Start again delalloc for the roots our pending snapshots are made
> +	 * from. We did it before starting/joining a transaction and we do it
> +	 * here again because new inode operations might have happened since
> +	 * then and we want to make sure the snapshot captures a fully
> +	 * consistent state of the source root tree. For example, if after the
> +	 * first delalloc flush a write is made against an inode followed by
> +	 * an expanding truncate, we want to make sure the snapshot captured
> +	 * both the write and the truncation, and not just the truncation.
> +	 * Here we shouldn't have much delalloc work to do, as the bulk of it
> +	 * was done before and outside the transaction.
> +	 */
> +	list_for_each_entry(pending_snapshot, splice, list) {
> +		ret = btrfs_start_delalloc_inodes(pending_snapshot->root, 1);
> +		if (ret)
> +			break;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +wait_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
> +				     struct list_head *splice)
> +{
> +	struct btrfs_pending_snapshot *pending_snapshot;
> +
> +	if (list_empty(splice))
> +		return;
> +
> +	list_for_each_entry(pending_snapshot, splice, list)
> +		btrfs_wait_ordered_extents(pending_snapshot->root, -1);
> +
> +	spin_lock(&trans->root->fs_info->trans_lock);
> +	list_splice_init(splice, &trans->transaction->pending_snapshots);
> +	spin_unlock(&trans->root->fs_info->trans_lock);
> +}
> +
>  int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
>  			     struct btrfs_root *root)
>  {
>  	struct btrfs_transaction *cur_trans = trans->transaction;
>  	struct btrfs_transaction *prev_trans = NULL;
>  	struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode);
> +	LIST_HEAD(pending_snapshots);
>  	int ret;
>  
>  	/* Stop the commit early if ->aborted is set */
> @@ -1802,6 +1855,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
>  	if (ret)
>  		goto cleanup_transaction;
>  
> +	ret = start_pending_snapshot_roots_delalloc(trans, &pending_snapshots);
> +	if (ret)
> +		goto cleanup_transaction;
> +
>  	ret = btrfs_run_delayed_items(trans, root);
>  	if (ret)
>  		goto cleanup_transaction;
> @@ -1816,6 +1873,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
>  
>  	btrfs_wait_delalloc_flush(root->fs_info);
>  
> +	wait_pending_snapshot_roots_delalloc(trans, &pending_snapshots);
> +
>  	btrfs_scrub_pause(root);
>  	/*
>  	 * Ok now we need to make sure to block out any other joins while we
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
                   ` (2 preceding siblings ...)
  2014-10-29  8:21 ` [PATCH v3] " Filipe Manana
@ 2014-10-29 11:57 ` Filipe Manana
  2014-10-29 12:19   ` Chris Mason
  3 siblings, 1 reply; 8+ messages in thread
From: Filipe Manana @ 2014-10-29 11:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Use different approach to solve the problem. Don't start and wait for all
    dellaloc to finish after every expanding truncate, instead add an additional
    flush at transaction commit time if we're doing a transaction commit that
    creates snapshots.

V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc().

V4: Use another approach that doesn't imply starting delalloc work and wait
    for it to finish at transaction commit time.

 fs/btrfs/ctree.h       |  4 ++--
 fs/btrfs/extent-tree.c | 16 +++++++++-------
 fs/btrfs/file.c        | 10 +++++-----
 fs/btrfs/inode.c       | 47 ++++++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/ioctl.c       |  7 ++++---
 5 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b72b358..36f82ba 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3427,8 +3427,8 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
 					 struct btrfs_fs_info *fs_info);
 int __get_raid_index(u64 flags);
-int btrfs_start_nocow_write(struct btrfs_root *root);
-void btrfs_end_nocow_write(struct btrfs_root *root);
+int btrfs_start_write_no_snapshoting(struct btrfs_root *root);
+void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
 		     int level, int *slot);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a84e00d..9ba886c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9657,12 +9657,14 @@ int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
 }
 
 /*
- * btrfs_{start,end}_write() is similar to mnt_{want, drop}_write(),
- * they are used to prevent the some tasks writing data into the page cache
- * by nocow before the subvolume is snapshoted, but flush the data into
- * the disk after the snapshot creation.
+ * btrfs_{start,end}_write_no_snapshoting() are similar to
+ * mnt_{want,drop}_write(), they are used to prevent some tasks from writing
+ * data into the page cache through nocow before the subvolume is snapshoted,
+ * but flush the data into disk after the snapshot creation, or to prevent
+ * operations while snapshoting is ongoing and that cause the snapshot to be
+ * inconsistent (writes followed by expanding truncates for example).
  */
-void btrfs_end_nocow_write(struct btrfs_root *root)
+void btrfs_end_write_no_snapshoting(struct btrfs_root *root)
 {
 	percpu_counter_dec(&root->subv_writers->counter);
 	/*
@@ -9674,7 +9676,7 @@ void btrfs_end_nocow_write(struct btrfs_root *root)
 		wake_up(&root->subv_writers->wait);
 }
 
-int btrfs_start_nocow_write(struct btrfs_root *root)
+int btrfs_start_write_no_snapshoting(struct btrfs_root *root)
 {
 	if (atomic_read(&root->will_be_snapshoted))
 		return 0;
@@ -9685,7 +9687,7 @@ int btrfs_start_nocow_write(struct btrfs_root *root)
 	 */
 	smp_mb();
 	if (atomic_read(&root->will_be_snapshoted)) {
-		btrfs_end_nocow_write(root);
+		btrfs_end_write_no_snapshoting(root);
 		return 0;
 	}
 	return 1;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0fbf0e7..e409025 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1428,7 +1428,7 @@ static noinline int check_can_nocow(struct inode *inode, loff_t pos,
 	u64 num_bytes;
 	int ret;
 
-	ret = btrfs_start_nocow_write(root);
+	ret = btrfs_start_write_no_snapshoting(root);
 	if (!ret)
 		return -ENOSPC;
 
@@ -1451,7 +1451,7 @@ static noinline int check_can_nocow(struct inode *inode, loff_t pos,
 	ret = can_nocow_extent(inode, lockstart, &num_bytes, NULL, NULL, NULL);
 	if (ret <= 0) {
 		ret = 0;
-		btrfs_end_nocow_write(root);
+		btrfs_end_write_no_snapshoting(root);
 	} else {
 		*write_bytes = min_t(size_t, *write_bytes ,
 				     num_bytes - pos + lockstart);
@@ -1543,7 +1543,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 				btrfs_free_reserved_data_space(inode,
 							       reserve_bytes);
 			else
-				btrfs_end_nocow_write(root);
+				btrfs_end_write_no_snapshoting(root);
 			break;
 		}
 
@@ -1632,7 +1632,7 @@ again:
 
 		release_bytes = 0;
 		if (only_release_metadata)
-			btrfs_end_nocow_write(root);
+			btrfs_end_write_no_snapshoting(root);
 
 		if (only_release_metadata && copied > 0) {
 			u64 lockstart = round_down(pos, root->sectorsize);
@@ -1661,7 +1661,7 @@ again:
 
 	if (release_bytes) {
 		if (only_release_metadata) {
-			btrfs_end_nocow_write(root);
+			btrfs_end_write_no_snapshoting(root);
 			btrfs_delalloc_release_metadata(inode, release_bytes);
 		} else {
 			btrfs_delalloc_release_space(inode, release_bytes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0d41741..455cfdf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1337,7 +1337,7 @@ next_slot:
 			 * we fall into common COW way.
 			 */
 			if (!nolock) {
-				err = btrfs_start_nocow_write(root);
+				err = btrfs_start_write_no_snapshoting(root);
 				if (!err)
 					goto out_check;
 			}
@@ -1361,7 +1361,7 @@ out_check:
 		if (extent_end <= start) {
 			path->slots[0]++;
 			if (!nolock && nocow)
-				btrfs_end_nocow_write(root);
+				btrfs_end_write_no_snapshoting(root);
 			goto next_slot;
 		}
 		if (!nocow) {
@@ -1381,7 +1381,7 @@ out_check:
 					     page_started, nr_written, 1);
 			if (ret) {
 				if (!nolock && nocow)
-					btrfs_end_nocow_write(root);
+					btrfs_end_write_no_snapshoting(root);
 				goto error;
 			}
 			cow_start = (u64)-1;
@@ -1432,7 +1432,7 @@ out_check:
 						      num_bytes);
 			if (ret) {
 				if (!nolock && nocow)
-					btrfs_end_nocow_write(root);
+					btrfs_end_write_no_snapshoting(root);
 				goto error;
 			}
 		}
@@ -1443,7 +1443,7 @@ out_check:
 					     EXTENT_DELALLOC, PAGE_UNLOCK |
 					     PAGE_SET_PRIVATE2);
 		if (!nolock && nocow)
-			btrfs_end_nocow_write(root);
+			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
 		if (cur_offset > end)
 			break;
@@ -4599,6 +4599,26 @@ next:
 	return err;
 }
 
+static int wait_snapshoting_atomic_t(atomic_t *a)
+{
+	schedule();
+	return 0;
+}
+
+static void wait_for_snapshot_creation(struct btrfs_root *root)
+{
+	while (true) {
+		int ret;
+
+		ret = btrfs_start_write_no_snapshoting(root);
+		if (ret)
+			break;
+		wait_on_atomic_t(&root->will_be_snapshoted,
+				 wait_snapshoting_atomic_t,
+				 TASK_UNINTERRUPTIBLE);
+	}
+}
+
 static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -4623,17 +4643,30 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 
 	if (newsize > oldsize) {
 		truncate_pagecache(inode, newsize);
+		/*
+		 * Don't do an expanding truncate while snapshoting is ongoing.
+		 * This is to ensure the snapshot captures a fully consistent
+		 * state of this file - if the snapshot captures this expanding
+		 * truncation, it must capture all writes that happened before
+		 * this truncation.
+		 */
+		wait_for_snapshot_creation(root);
 		ret = btrfs_cont_expand(inode, oldsize, newsize);
-		if (ret)
+		if (ret) {
+			btrfs_end_write_no_snapshoting(root);
 			return ret;
+		}
 
 		trans = btrfs_start_transaction(root, 1);
-		if (IS_ERR(trans))
+		if (IS_ERR(trans)) {
+			btrfs_end_write_no_snapshoting(root);
 			return PTR_ERR(trans);
+		}
 
 		i_size_write(inode, newsize);
 		btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL);
 		ret = btrfs_update_inode(trans, root, inode);
+		btrfs_end_write_no_snapshoting(root);
 		btrfs_end_transaction(trans, root);
 	} else {
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 994c573..eb7e65f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -617,7 +617,7 @@ fail:
 	return ret;
 }
 
-static void btrfs_wait_nocow_write(struct btrfs_root *root)
+static void btrfs_wait_for_no_snapshoting_writes(struct btrfs_root *root)
 {
 	s64 writers;
 	DEFINE_WAIT(wait);
@@ -649,7 +649,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 
 	atomic_inc(&root->will_be_snapshoted);
 	smp_mb__after_atomic();
-	btrfs_wait_nocow_write(root);
+	btrfs_wait_for_no_snapshoting_writes(root);
 
 	ret = btrfs_start_delalloc_inodes(root, 0);
 	if (ret)
@@ -732,7 +732,8 @@ fail:
 free:
 	kfree(pending_snapshot);
 out:
-	atomic_dec(&root->will_be_snapshoted);
+	if (atomic_dec_and_test(&root->will_be_snapshoted))
+		wake_up_atomic_t(&root->will_be_snapshoted);
 	return ret;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] Btrfs: fix snapshot inconsistency after a file write followed by truncate
  2014-10-29 11:57 ` [PATCH v4] " Filipe Manana
@ 2014-10-29 12:19   ` Chris Mason
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2014-10-29 12:19 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, Filipe Manana



On Wed, Oct 29, 2014 at 7:57 AM, Filipe Manana <fdmanana@suse.com> 
wrote:
> If right after starting the snapshot creation ioctl we perform a 
> write against a
> file followed by a truncate, with both operations increasing the 
> file's size, we
> can get a snapshot tree that reflects a state of the source 
> subvolume's tree where
> the file truncation happened but the write operation didn't. This 
> leaves a gap
> between 2 file extent items of the inode, which makes btrfs' fsck 
> complain about it.
> 
> For example, if we perform the following file operations:
> 
>     $ mkfs.btrfs -f /dev/vdd
>     $ mount /dev/vdd /mnt
>     $ xfs_io -f \
>           -c "pwrite -S 0xaa -b 32K 0 32K" \
>           -c "fsync" \
>           -c "pwrite -S 0xbb -b 32770 16K 32770" \
>           -c "truncate 90123" \
>           /mnt/foobar
> 
> and the snapshot creation ioctl was just called before the second 
> write, we often
> can get the following inode items in the snapshot's btree:
> 
>         item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
>                 inode generation 146 transid 7 size 90123 block group 
> 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
>         item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
>                 inode ref index 282 namelen 10 name: foobar
>         item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
>                 extent data disk byte 1104855040 nr 32768
>                 extent data offset 0 nr 32768 ram 32768
>                 extent compression 0
>         item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
>                 extent data disk byte 0 nr 0
>                 extent data offset 0 nr 40960 ram 40960
>                 extent compression 0
> 
> There's a file range, corresponding to the interval [32K; ALIGN(16K + 
> 32770, 4096)[
> for which there's no file extent item covering it. This is because 
> the file write
> and file truncate operations happened both right after the snapshot 
> creation ioctl
> called btrfs_start_delalloc_inodes(), which means we didn't start and 
> wait for the
> ordered extent that matches the write and, in btrfs_setsize(), we 
> were able to call
> btrfs_cont_expand() before being able to commit the current 
> transaction in the
> snapshot creation ioctl. So this made it possibe to insert the hole 
> file extent
> item in the source subvolume (which represents the region added by 
> the truncate)
> right before the transaction commit from the snapshot creation ioctl.
> 
> Btrfs' fsck tool complains about such cases with a message like the 
> following:
> 
>     "root 331 inode 257 errors 100, file extent discount"
> 
> From a user perspective, the expectation when a snapshot is created 
> while those
> file operations are being performed is that the snapshot will have a 
> file that
> either:
> 
> 1) is empty
> 2) only the first write was captured
> 3) only the 2 writes were captured
> 4) both writes and the truncation were captured
> 
> But never capture a state where only the first write and the 
> truncation were
> captured (since the second write was performed before the truncation).
> 
> A test case for xfstests follows.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Use different approach to solve the problem. Don't start and wait 
> for all
>     dellaloc to finish after every expanding truncate, instead add an 
> additional
>     flush at transaction commit time if we're doing a transaction 
> commit that
>     creates snapshots.
> 
> V3: Removed useless test condition in 
> +wait_pending_snapshot_roots_delalloc().
> 
> V4: Use another approach that doesn't imply starting delalloc work 
> and wait
>     for it to finish at transaction commit time.

I like this one better ;)  Taking it for a spin here.

-chris



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-29 12:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-21 10:12 [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate Filipe Manana
2014-10-27 19:57 ` Chris Mason
2014-10-28 10:10   ` Filipe David Manana
2014-10-29  0:35 ` [PATCH v2] " Filipe Manana
2014-10-29  8:21 ` [PATCH v3] " Filipe Manana
2014-10-29  9:13   ` Miao Xie
2014-10-29 11:57 ` [PATCH v4] " Filipe Manana
2014-10-29 12:19   ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox