* [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled
@ 2015-06-25 3:17 fdmanana
2015-06-25 14:20 ` Liu Bo
0 siblings, 1 reply; 6+ messages in thread
From: fdmanana @ 2015-06-25 3:17 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
From: Filipe Manana <fdmanana@suse.com>
When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.
Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.
Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.
The issue is easy to reproduce with the following test case for fstests:
. ./common/rc
. ./common/filter
. ./common/dmflakey
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
# This test was motivated by an issue found in btrfs when the btrfs
# no-holes feature is enabled (introduced in kernel 3.14). So enable
# the feature if the fs being tested is btrfs.
if [ $FSTYP == "btrfs" ]; then
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
fi
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test files and make sure everything is durably persisted.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
-c "pwrite -S 0xbb 64K 61K" \
$SCRATCH_MNT/foo | _filter_xfs_io
$XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
-c "pwrite -S 0xff 64K 61K" \
$SCRATCH_MNT/bar | _filter_xfs_io
sync
# Now truncate our file foo to a smaller size (64Kb) and then truncate
# it to the size it had before the shrinking truncate (125Kb). Then
# fsync our file. If a power failure happens after the fsync, we expect
# our file to have a size of 125Kb, with the first 64Kb of data having
# the value 0xaa and the second 61Kb of data having the value 0x00.
$XFS_IO_PROG -c "truncate 64K" \
-c "truncate 125K" \
-c "fsync" \
$SCRATCH_MNT/foo
# Do something similar to our file bar, but the first truncation sets
# the file size to 0 and the second truncation expands the size to the
# double of what it was initially.
$XFS_IO_PROG -c "truncate 0" \
-c "truncate 253K" \
-c "fsync" \
$SCRATCH_MNT/bar
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file
# contents.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# We expect foo to have a size of 125Kb, the first 64Kb of data all
# having the value 0xaa and the remaining 61Kb to be a hole (all bytes
# with value 0x00).
echo "File foo content after log replay:"
od -t x1 $SCRATCH_MNT/foo
# We expect bar to have a size of 253Kb and no extents (any byte read
# from bar has the value 0x00).
echo "File bar content after log replay:"
od -t x1 $SCRATCH_MNT/bar
status=0
exit
The expected file contents in the golden output are:
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0372000
File bar content after log replay:
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0772000
Without this fix, their contents are:
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0372000
File bar content after log replay:
0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0772000
A test case submission for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7ac45cf..ac90336 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
return 0;
}
+/*
+ * If the no holes feature is enabled we need to make sure any hole between the
+ * last extent and the i_size of our inode is explicitly marked in the log. This
+ * is to make sure that doing something like:
+ *
+ * 1) create file with 128Kb of data
+ * 2) truncate file to 64Kb
+ * 3) truncate file to 256Kb
+ * 4) fsync file
+ * 5) <crash/power failure>
+ * 6) mount fs and trigger log replay
+ *
+ * Will give us a file with a size of 256Kb, the first 64Kb of data match what
+ * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
+ * file correspond to a hole. The presence of explicit holes in a log tree is
+ * what guarantees that log replay will remove/adjust file extent items in the
+ * fs/subvol tree.
+ *
+ * Here we do not need to care about holes between extents, that is already done
+ * by copy_items(). We also only need to do this in the full sync path, where we
+ * lookup for extents from the fs/subvol tree only. In the fast path case, we
+ * lookup the list of modified extent maps and if any represents a hole, we
+ * insert a corresponding extent representing a hole in the log tree.
+ */
+static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct inode *inode,
+ struct btrfs_path *path)
+{
+ int ret;
+ struct btrfs_key key;
+ u64 hole_start;
+ u64 hole_size;
+ struct extent_buffer *leaf;
+ struct btrfs_root *log = root->log_root;
+ const u64 ino = btrfs_ino(inode);
+ const u64 i_size = i_size_read(inode);
+
+ if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
+ return 0;
+
+ key.objectid = ino;
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ key.offset = (u64)-1;
+
+ ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+ ASSERT(ret != 0);
+ if (ret < 0)
+ return ret;
+
+ ASSERT(path->slots[0] > 0);
+ path->slots[0]--;
+ leaf = path->nodes[0];
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+ if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+ /* inode does not have any extents */
+ hole_start = 0;
+ hole_size = i_size;
+ } else {
+ struct btrfs_file_extent_item *extent;
+ u64 len;
+
+ /*
+ * If there's an extent beyond i_size, an explicit hole was
+ * already inserted by copy_items().
+ */
+ if (key.offset >= i_size)
+ return 0;
+
+ extent = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, extent) ==
+ BTRFS_FILE_EXTENT_INLINE) {
+ len = btrfs_file_extent_inline_len(leaf,
+ path->slots[0],
+ extent);
+ ASSERT(len == i_size);
+ return 0;
+ }
+
+ len = btrfs_file_extent_num_bytes(leaf, extent);
+ /* Last extent goes beyond i_size, no need to log a hole. */
+ if (key.offset + len > i_size)
+ return 0;
+ hole_start = key.offset + len;
+ hole_size = i_size - hole_start;
+ }
+ btrfs_release_path(path);
+
+ /* Last extent ends at i_size. */
+ if (hole_size == 0)
+ return 0;
+
+ hole_size = ALIGN(hole_size, root->sectorsize);
+ ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
+ hole_size, 0, hole_size, 0, 0, 0);
+ return ret;
+}
+
/* log a single inode in the tree log.
* At least one parent directory for this inode must exist in the tree
* or be logged already.
@@ -4466,6 +4567,13 @@ next_slot:
err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
if (err)
goto out_unlock;
+ if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
+ btrfs_release_path(path);
+ btrfs_release_path(dst_path);
+ err = btrfs_log_trailing_hole(trans, root, inode, path);
+ if (err)
+ goto out_unlock;
+ }
log_extents:
btrfs_release_path(path);
btrfs_release_path(dst_path);
--
2.1.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled
2015-06-25 3:17 [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled fdmanana
@ 2015-06-25 14:20 ` Liu Bo
2015-06-25 14:23 ` Filipe Manana
0 siblings, 1 reply; 6+ messages in thread
From: Liu Bo @ 2015-06-25 14:20 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs, Filipe Manana
On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When we have the no_holes feature enabled, if a we truncate a file to a
> smaller size, truncate it again but to a size greater than or equals to
> its original size and fsync it, the log tree will not have any information
> about the hole covering the range [truncate_1_offset, new_file_size[.
> Which means if the fsync log is replayed, the file will remain with the
> state it had before both truncate operations.
Does the fs/subvol tree get updated to the right information at this
time?
Thanks,
-liubo
>
> Without the no_holes feature this does not happen, since when the inode
> is logged (full sync flag is set) it will find in the fs/subvol tree a
> leaf with a generation matching the current transaction id that has an
> explicit extent item representing the hole.
>
> Fix this by adding an explicit extent item representing a hole between
> the last extent and the inode's i_size if we are doing a full sync.
>
> The issue is easy to reproduce with the following test case for fstests:
>
> . ./common/rc
> . ./common/filter
> . ./common/dmflakey
>
> _need_to_be_root
> _supported_fs generic
> _supported_os Linux
> _require_scratch
> _require_dm_flakey
>
> # This test was motivated by an issue found in btrfs when the btrfs
> # no-holes feature is enabled (introduced in kernel 3.14). So enable
> # the feature if the fs being tested is btrfs.
> if [ $FSTYP == "btrfs" ]; then
> _require_btrfs_fs_feature "no_holes"
> _require_btrfs_mkfs_feature "no-holes"
> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
> fi
>
> rm -f $seqres.full
>
> _scratch_mkfs >>$seqres.full 2>&1
> _init_flakey
> _mount_flakey
>
> # Create our test files and make sure everything is durably persisted.
> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
> -c "pwrite -S 0xbb 64K 61K" \
> $SCRATCH_MNT/foo | _filter_xfs_io
> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
> -c "pwrite -S 0xff 64K 61K" \
> $SCRATCH_MNT/bar | _filter_xfs_io
> sync
>
> # Now truncate our file foo to a smaller size (64Kb) and then truncate
> # it to the size it had before the shrinking truncate (125Kb). Then
> # fsync our file. If a power failure happens after the fsync, we expect
> # our file to have a size of 125Kb, with the first 64Kb of data having
> # the value 0xaa and the second 61Kb of data having the value 0x00.
> $XFS_IO_PROG -c "truncate 64K" \
> -c "truncate 125K" \
> -c "fsync" \
> $SCRATCH_MNT/foo
>
> # Do something similar to our file bar, but the first truncation sets
> # the file size to 0 and the second truncation expands the size to the
> # double of what it was initially.
> $XFS_IO_PROG -c "truncate 0" \
> -c "truncate 253K" \
> -c "fsync" \
> $SCRATCH_MNT/bar
>
> _load_flakey_table $FLAKEY_DROP_WRITES
> _unmount_flakey
>
> # Allow writes again, mount to trigger log replay and validate file
> # contents.
> _load_flakey_table $FLAKEY_ALLOW_WRITES
> _mount_flakey
>
> # We expect foo to have a size of 125Kb, the first 64Kb of data all
> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
> # with value 0x00).
> echo "File foo content after log replay:"
> od -t x1 $SCRATCH_MNT/foo
>
> # We expect bar to have a size of 253Kb and no extents (any byte read
> # from bar has the value 0x00).
> echo "File bar content after log replay:"
> od -t x1 $SCRATCH_MNT/bar
>
> status=0
> exit
>
> The expected file contents in the golden output are:
>
> File foo content after log replay:
> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> *
> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0372000
> File bar content after log replay:
> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0772000
>
> Without this fix, their contents are:
>
> File foo content after log replay:
> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> *
> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> *
> 0372000
> File bar content after log replay:
> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
> *
> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> *
> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> *
> 0772000
>
> A test case submission for fstests follows soon.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 108 insertions(+)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 7ac45cf..ac90336 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
> return 0;
> }
>
> +/*
> + * If the no holes feature is enabled we need to make sure any hole between the
> + * last extent and the i_size of our inode is explicitly marked in the log. This
> + * is to make sure that doing something like:
> + *
> + * 1) create file with 128Kb of data
> + * 2) truncate file to 64Kb
> + * 3) truncate file to 256Kb
> + * 4) fsync file
> + * 5) <crash/power failure>
> + * 6) mount fs and trigger log replay
> + *
> + * Will give us a file with a size of 256Kb, the first 64Kb of data match what
> + * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
> + * file correspond to a hole. The presence of explicit holes in a log tree is
> + * what guarantees that log replay will remove/adjust file extent items in the
> + * fs/subvol tree.
> + *
> + * Here we do not need to care about holes between extents, that is already done
> + * by copy_items(). We also only need to do this in the full sync path, where we
> + * lookup for extents from the fs/subvol tree only. In the fast path case, we
> + * lookup the list of modified extent maps and if any represents a hole, we
> + * insert a corresponding extent representing a hole in the log tree.
> + */
> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
> + struct btrfs_root *root,
> + struct inode *inode,
> + struct btrfs_path *path)
> +{
> + int ret;
> + struct btrfs_key key;
> + u64 hole_start;
> + u64 hole_size;
> + struct extent_buffer *leaf;
> + struct btrfs_root *log = root->log_root;
> + const u64 ino = btrfs_ino(inode);
> + const u64 i_size = i_size_read(inode);
> +
> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
> + return 0;
> +
> + key.objectid = ino;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = (u64)-1;
> +
> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> + ASSERT(ret != 0);
> + if (ret < 0)
> + return ret;
> +
> + ASSERT(path->slots[0] > 0);
> + path->slots[0]--;
> + leaf = path->nodes[0];
> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +
> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> + /* inode does not have any extents */
> + hole_start = 0;
> + hole_size = i_size;
> + } else {
> + struct btrfs_file_extent_item *extent;
> + u64 len;
> +
> + /*
> + * If there's an extent beyond i_size, an explicit hole was
> + * already inserted by copy_items().
> + */
> + if (key.offset >= i_size)
> + return 0;
> +
> + extent = btrfs_item_ptr(leaf, path->slots[0],
> + struct btrfs_file_extent_item);
> +
> + if (btrfs_file_extent_type(leaf, extent) ==
> + BTRFS_FILE_EXTENT_INLINE) {
> + len = btrfs_file_extent_inline_len(leaf,
> + path->slots[0],
> + extent);
> + ASSERT(len == i_size);
> + return 0;
> + }
> +
> + len = btrfs_file_extent_num_bytes(leaf, extent);
> + /* Last extent goes beyond i_size, no need to log a hole. */
> + if (key.offset + len > i_size)
> + return 0;
> + hole_start = key.offset + len;
> + hole_size = i_size - hole_start;
> + }
> + btrfs_release_path(path);
> +
> + /* Last extent ends at i_size. */
> + if (hole_size == 0)
> + return 0;
> +
> + hole_size = ALIGN(hole_size, root->sectorsize);
> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
> + hole_size, 0, hole_size, 0, 0, 0);
> + return ret;
> +}
> +
> /* log a single inode in the tree log.
> * At least one parent directory for this inode must exist in the tree
> * or be logged already.
> @@ -4466,6 +4567,13 @@ next_slot:
> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
> if (err)
> goto out_unlock;
> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
> + btrfs_release_path(path);
> + btrfs_release_path(dst_path);
> + err = btrfs_log_trailing_hole(trans, root, inode, path);
> + if (err)
> + goto out_unlock;
> + }
> log_extents:
> btrfs_release_path(path);
> btrfs_release_path(dst_path);
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled
2015-06-25 14:20 ` Liu Bo
@ 2015-06-25 14:23 ` Filipe Manana
2015-06-26 3:21 ` Liu Bo
0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2015-06-25 14:23 UTC (permalink / raw)
To: Liu Bo; +Cc: linux-btrfs@vger.kernel.org, Filipe Manana
On Thu, Jun 25, 2015 at 3:20 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> When we have the no_holes feature enabled, if a we truncate a file to a
>> smaller size, truncate it again but to a size greater than or equals to
>> its original size and fsync it, the log tree will not have any information
>> about the hole covering the range [truncate_1_offset, new_file_size[.
>> Which means if the fsync log is replayed, the file will remain with the
>> state it had before both truncate operations.
>
> Does the fs/subvol tree get updated to the right information at this
> time?
No, and that's the problem. Because no file extent items are stored in
the log tree.
The inode item is updated with the new i_size however (as expected).
thanks
>
> Thanks,
>
> -liubo
>>
>> Without the no_holes feature this does not happen, since when the inode
>> is logged (full sync flag is set) it will find in the fs/subvol tree a
>> leaf with a generation matching the current transaction id that has an
>> explicit extent item representing the hole.
>>
>> Fix this by adding an explicit extent item representing a hole between
>> the last extent and the inode's i_size if we are doing a full sync.
>>
>> The issue is easy to reproduce with the following test case for fstests:
>>
>> . ./common/rc
>> . ./common/filter
>> . ./common/dmflakey
>>
>> _need_to_be_root
>> _supported_fs generic
>> _supported_os Linux
>> _require_scratch
>> _require_dm_flakey
>>
>> # This test was motivated by an issue found in btrfs when the btrfs
>> # no-holes feature is enabled (introduced in kernel 3.14). So enable
>> # the feature if the fs being tested is btrfs.
>> if [ $FSTYP == "btrfs" ]; then
>> _require_btrfs_fs_feature "no_holes"
>> _require_btrfs_mkfs_feature "no-holes"
>> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
>> fi
>>
>> rm -f $seqres.full
>>
>> _scratch_mkfs >>$seqres.full 2>&1
>> _init_flakey
>> _mount_flakey
>>
>> # Create our test files and make sure everything is durably persisted.
>> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
>> -c "pwrite -S 0xbb 64K 61K" \
>> $SCRATCH_MNT/foo | _filter_xfs_io
>> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
>> -c "pwrite -S 0xff 64K 61K" \
>> $SCRATCH_MNT/bar | _filter_xfs_io
>> sync
>>
>> # Now truncate our file foo to a smaller size (64Kb) and then truncate
>> # it to the size it had before the shrinking truncate (125Kb). Then
>> # fsync our file. If a power failure happens after the fsync, we expect
>> # our file to have a size of 125Kb, with the first 64Kb of data having
>> # the value 0xaa and the second 61Kb of data having the value 0x00.
>> $XFS_IO_PROG -c "truncate 64K" \
>> -c "truncate 125K" \
>> -c "fsync" \
>> $SCRATCH_MNT/foo
>>
>> # Do something similar to our file bar, but the first truncation sets
>> # the file size to 0 and the second truncation expands the size to the
>> # double of what it was initially.
>> $XFS_IO_PROG -c "truncate 0" \
>> -c "truncate 253K" \
>> -c "fsync" \
>> $SCRATCH_MNT/bar
>>
>> _load_flakey_table $FLAKEY_DROP_WRITES
>> _unmount_flakey
>>
>> # Allow writes again, mount to trigger log replay and validate file
>> # contents.
>> _load_flakey_table $FLAKEY_ALLOW_WRITES
>> _mount_flakey
>>
>> # We expect foo to have a size of 125Kb, the first 64Kb of data all
>> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
>> # with value 0x00).
>> echo "File foo content after log replay:"
>> od -t x1 $SCRATCH_MNT/foo
>>
>> # We expect bar to have a size of 253Kb and no extents (any byte read
>> # from bar has the value 0x00).
>> echo "File bar content after log replay:"
>> od -t x1 $SCRATCH_MNT/bar
>>
>> status=0
>> exit
>>
>> The expected file contents in the golden output are:
>>
>> File foo content after log replay:
>> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>> *
>> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 0372000
>> File bar content after log replay:
>> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 0772000
>>
>> Without this fix, their contents are:
>>
>> File foo content after log replay:
>> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>> *
>> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>> *
>> 0372000
>> File bar content after log replay:
>> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>> *
>> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> *
>> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 0772000
>>
>> A test case submission for fstests follows soon.
>>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>> fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 108 insertions(+)
>>
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index 7ac45cf..ac90336 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
>> return 0;
>> }
>>
>> +/*
>> + * If the no holes feature is enabled we need to make sure any hole between the
>> + * last extent and the i_size of our inode is explicitly marked in the log. This
>> + * is to make sure that doing something like:
>> + *
>> + * 1) create file with 128Kb of data
>> + * 2) truncate file to 64Kb
>> + * 3) truncate file to 256Kb
>> + * 4) fsync file
>> + * 5) <crash/power failure>
>> + * 6) mount fs and trigger log replay
>> + *
>> + * Will give us a file with a size of 256Kb, the first 64Kb of data match what
>> + * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
>> + * file correspond to a hole. The presence of explicit holes in a log tree is
>> + * what guarantees that log replay will remove/adjust file extent items in the
>> + * fs/subvol tree.
>> + *
>> + * Here we do not need to care about holes between extents, that is already done
>> + * by copy_items(). We also only need to do this in the full sync path, where we
>> + * lookup for extents from the fs/subvol tree only. In the fast path case, we
>> + * lookup the list of modified extent maps and if any represents a hole, we
>> + * insert a corresponding extent representing a hole in the log tree.
>> + */
>> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
>> + struct btrfs_root *root,
>> + struct inode *inode,
>> + struct btrfs_path *path)
>> +{
>> + int ret;
>> + struct btrfs_key key;
>> + u64 hole_start;
>> + u64 hole_size;
>> + struct extent_buffer *leaf;
>> + struct btrfs_root *log = root->log_root;
>> + const u64 ino = btrfs_ino(inode);
>> + const u64 i_size = i_size_read(inode);
>> +
>> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
>> + return 0;
>> +
>> + key.objectid = ino;
>> + key.type = BTRFS_EXTENT_DATA_KEY;
>> + key.offset = (u64)-1;
>> +
>> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> + ASSERT(ret != 0);
>> + if (ret < 0)
>> + return ret;
>> +
>> + ASSERT(path->slots[0] > 0);
>> + path->slots[0]--;
>> + leaf = path->nodes[0];
>> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> +
>> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>> + /* inode does not have any extents */
>> + hole_start = 0;
>> + hole_size = i_size;
>> + } else {
>> + struct btrfs_file_extent_item *extent;
>> + u64 len;
>> +
>> + /*
>> + * If there's an extent beyond i_size, an explicit hole was
>> + * already inserted by copy_items().
>> + */
>> + if (key.offset >= i_size)
>> + return 0;
>> +
>> + extent = btrfs_item_ptr(leaf, path->slots[0],
>> + struct btrfs_file_extent_item);
>> +
>> + if (btrfs_file_extent_type(leaf, extent) ==
>> + BTRFS_FILE_EXTENT_INLINE) {
>> + len = btrfs_file_extent_inline_len(leaf,
>> + path->slots[0],
>> + extent);
>> + ASSERT(len == i_size);
>> + return 0;
>> + }
>> +
>> + len = btrfs_file_extent_num_bytes(leaf, extent);
>> + /* Last extent goes beyond i_size, no need to log a hole. */
>> + if (key.offset + len > i_size)
>> + return 0;
>> + hole_start = key.offset + len;
>> + hole_size = i_size - hole_start;
>> + }
>> + btrfs_release_path(path);
>> +
>> + /* Last extent ends at i_size. */
>> + if (hole_size == 0)
>> + return 0;
>> +
>> + hole_size = ALIGN(hole_size, root->sectorsize);
>> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
>> + hole_size, 0, hole_size, 0, 0, 0);
>> + return ret;
>> +}
>> +
>> /* log a single inode in the tree log.
>> * At least one parent directory for this inode must exist in the tree
>> * or be logged already.
>> @@ -4466,6 +4567,13 @@ next_slot:
>> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
>> if (err)
>> goto out_unlock;
>> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
>> + btrfs_release_path(path);
>> + btrfs_release_path(dst_path);
>> + err = btrfs_log_trailing_hole(trans, root, inode, path);
>> + if (err)
>> + goto out_unlock;
>> + }
>> log_extents:
>> btrfs_release_path(path);
>> btrfs_release_path(dst_path);
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled
2015-06-25 14:23 ` Filipe Manana
@ 2015-06-26 3:21 ` Liu Bo
2015-06-26 8:48 ` Filipe Manana
0 siblings, 1 reply; 6+ messages in thread
From: Liu Bo @ 2015-06-26 3:21 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs@vger.kernel.org, Filipe Manana
On Thu, Jun 25, 2015 at 03:23:59PM +0100, Filipe Manana wrote:
> On Thu, Jun 25, 2015 at 3:20 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdmanana@kernel.org wrote:
> >> From: Filipe Manana <fdmanana@suse.com>
> >>
> >> When we have the no_holes feature enabled, if a we truncate a file to a
> >> smaller size, truncate it again but to a size greater than or equals to
> >> its original size and fsync it, the log tree will not have any information
> >> about the hole covering the range [truncate_1_offset, new_file_size[.
> >> Which means if the fsync log is replayed, the file will remain with the
> >> state it had before both truncate operations.
> >
> > Does the fs/subvol tree get updated to the right information at this
> > time?
>
> No, and that's the problem. Because no file extent items are stored in
> the log tree.
> The inode item is updated with the new i_size however (as expected).
Yeap, that's right and the patch looks right.
I do appreciate your great work on fixing these corner cases,
but as of my understanding, they really can be taken by a force commit
transaction, do they deserve these complex stuff?
After all, like punch_hole, remove xattr, they're rare cases.
Thanks,
-liubo
>
> thanks
>
> >
> > Thanks,
> >
> > -liubo
> >>
> >> Without the no_holes feature this does not happen, since when the inode
> >> is logged (full sync flag is set) it will find in the fs/subvol tree a
> >> leaf with a generation matching the current transaction id that has an
> >> explicit extent item representing the hole.
> >>
> >> Fix this by adding an explicit extent item representing a hole between
> >> the last extent and the inode's i_size if we are doing a full sync.
> >>
> >> The issue is easy to reproduce with the following test case for fstests:
> >>
> >> . ./common/rc
> >> . ./common/filter
> >> . ./common/dmflakey
> >>
> >> _need_to_be_root
> >> _supported_fs generic
> >> _supported_os Linux
> >> _require_scratch
> >> _require_dm_flakey
> >>
> >> # This test was motivated by an issue found in btrfs when the btrfs
> >> # no-holes feature is enabled (introduced in kernel 3.14). So enable
> >> # the feature if the fs being tested is btrfs.
> >> if [ $FSTYP == "btrfs" ]; then
> >> _require_btrfs_fs_feature "no_holes"
> >> _require_btrfs_mkfs_feature "no-holes"
> >> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
> >> fi
> >>
> >> rm -f $seqres.full
> >>
> >> _scratch_mkfs >>$seqres.full 2>&1
> >> _init_flakey
> >> _mount_flakey
> >>
> >> # Create our test files and make sure everything is durably persisted.
> >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
> >> -c "pwrite -S 0xbb 64K 61K" \
> >> $SCRATCH_MNT/foo | _filter_xfs_io
> >> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
> >> -c "pwrite -S 0xff 64K 61K" \
> >> $SCRATCH_MNT/bar | _filter_xfs_io
> >> sync
> >>
> >> # Now truncate our file foo to a smaller size (64Kb) and then truncate
> >> # it to the size it had before the shrinking truncate (125Kb). Then
> >> # fsync our file. If a power failure happens after the fsync, we expect
> >> # our file to have a size of 125Kb, with the first 64Kb of data having
> >> # the value 0xaa and the second 61Kb of data having the value 0x00.
> >> $XFS_IO_PROG -c "truncate 64K" \
> >> -c "truncate 125K" \
> >> -c "fsync" \
> >> $SCRATCH_MNT/foo
> >>
> >> # Do something similar to our file bar, but the first truncation sets
> >> # the file size to 0 and the second truncation expands the size to the
> >> # double of what it was initially.
> >> $XFS_IO_PROG -c "truncate 0" \
> >> -c "truncate 253K" \
> >> -c "fsync" \
> >> $SCRATCH_MNT/bar
> >>
> >> _load_flakey_table $FLAKEY_DROP_WRITES
> >> _unmount_flakey
> >>
> >> # Allow writes again, mount to trigger log replay and validate file
> >> # contents.
> >> _load_flakey_table $FLAKEY_ALLOW_WRITES
> >> _mount_flakey
> >>
> >> # We expect foo to have a size of 125Kb, the first 64Kb of data all
> >> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
> >> # with value 0x00).
> >> echo "File foo content after log replay:"
> >> od -t x1 $SCRATCH_MNT/foo
> >>
> >> # We expect bar to have a size of 253Kb and no extents (any byte read
> >> # from bar has the value 0x00).
> >> echo "File bar content after log replay:"
> >> od -t x1 $SCRATCH_MNT/bar
> >>
> >> status=0
> >> exit
> >>
> >> The expected file contents in the golden output are:
> >>
> >> File foo content after log replay:
> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >> *
> >> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> *
> >> 0372000
> >> File bar content after log replay:
> >> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> *
> >> 0772000
> >>
> >> Without this fix, their contents are:
> >>
> >> File foo content after log replay:
> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >> *
> >> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> >> *
> >> 0372000
> >> File bar content after log replay:
> >> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
> >> *
> >> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >> *
> >> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> *
> >> 0772000
> >>
> >> A test case submission for fstests follows soon.
> >>
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >> fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 108 insertions(+)
> >>
> >> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> >> index 7ac45cf..ac90336 100644
> >> --- a/fs/btrfs/tree-log.c
> >> +++ b/fs/btrfs/tree-log.c
> >> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
> >> return 0;
> >> }
> >>
> >> +/*
> >> + * If the no holes feature is enabled we need to make sure any hole between the
> >> + * last extent and the i_size of our inode is explicitly marked in the log. This
> >> + * is to make sure that doing something like:
> >> + *
> >> + * 1) create file with 128Kb of data
> >> + * 2) truncate file to 64Kb
> >> + * 3) truncate file to 256Kb
> >> + * 4) fsync file
> >> + * 5) <crash/power failure>
> >> + * 6) mount fs and trigger log replay
> >> + *
> >> + * Will give us a file with a size of 256Kb, the first 64Kb of data match what
> >> + * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
> >> + * file correspond to a hole. The presence of explicit holes in a log tree is
> >> + * what guarantees that log replay will remove/adjust file extent items in the
> >> + * fs/subvol tree.
> >> + *
> >> + * Here we do not need to care about holes between extents, that is already done
> >> + * by copy_items(). We also only need to do this in the full sync path, where we
> >> + * lookup for extents from the fs/subvol tree only. In the fast path case, we
> >> + * lookup the list of modified extent maps and if any represents a hole, we
> >> + * insert a corresponding extent representing a hole in the log tree.
> >> + */
> >> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
> >> + struct btrfs_root *root,
> >> + struct inode *inode,
> >> + struct btrfs_path *path)
> >> +{
> >> + int ret;
> >> + struct btrfs_key key;
> >> + u64 hole_start;
> >> + u64 hole_size;
> >> + struct extent_buffer *leaf;
> >> + struct btrfs_root *log = root->log_root;
> >> + const u64 ino = btrfs_ino(inode);
> >> + const u64 i_size = i_size_read(inode);
> >> +
> >> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
> >> + return 0;
> >> +
> >> + key.objectid = ino;
> >> + key.type = BTRFS_EXTENT_DATA_KEY;
> >> + key.offset = (u64)-1;
> >> +
> >> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >> + ASSERT(ret != 0);
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + ASSERT(path->slots[0] > 0);
> >> + path->slots[0]--;
> >> + leaf = path->nodes[0];
> >> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> >> +
> >> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> >> + /* inode does not have any extents */
> >> + hole_start = 0;
> >> + hole_size = i_size;
> >> + } else {
> >> + struct btrfs_file_extent_item *extent;
> >> + u64 len;
> >> +
> >> + /*
> >> + * If there's an extent beyond i_size, an explicit hole was
> >> + * already inserted by copy_items().
> >> + */
> >> + if (key.offset >= i_size)
> >> + return 0;
> >> +
> >> + extent = btrfs_item_ptr(leaf, path->slots[0],
> >> + struct btrfs_file_extent_item);
> >> +
> >> + if (btrfs_file_extent_type(leaf, extent) ==
> >> + BTRFS_FILE_EXTENT_INLINE) {
> >> + len = btrfs_file_extent_inline_len(leaf,
> >> + path->slots[0],
> >> + extent);
> >> + ASSERT(len == i_size);
> >> + return 0;
> >> + }
> >> +
> >> + len = btrfs_file_extent_num_bytes(leaf, extent);
> >> + /* Last extent goes beyond i_size, no need to log a hole. */
> >> + if (key.offset + len > i_size)
> >> + return 0;
> >> + hole_start = key.offset + len;
> >> + hole_size = i_size - hole_start;
> >> + }
> >> + btrfs_release_path(path);
> >> +
> >> + /* Last extent ends at i_size. */
> >> + if (hole_size == 0)
> >> + return 0;
> >> +
> >> + hole_size = ALIGN(hole_size, root->sectorsize);
> >> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
> >> + hole_size, 0, hole_size, 0, 0, 0);
> >> + return ret;
> >> +}
> >> +
> >> /* log a single inode in the tree log.
> >> * At least one parent directory for this inode must exist in the tree
> >> * or be logged already.
> >> @@ -4466,6 +4567,13 @@ next_slot:
> >> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
> >> if (err)
> >> goto out_unlock;
> >> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
> >> + btrfs_release_path(path);
> >> + btrfs_release_path(dst_path);
> >> + err = btrfs_log_trailing_hole(trans, root, inode, path);
> >> + if (err)
> >> + goto out_unlock;
> >> + }
> >> log_extents:
> >> btrfs_release_path(path);
> >> btrfs_release_path(dst_path);
> >> --
> >> 2.1.3
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled
2015-06-26 3:21 ` Liu Bo
@ 2015-06-26 8:48 ` Filipe Manana
2015-06-26 10:00 ` Liu Bo
0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2015-06-26 8:48 UTC (permalink / raw)
To: Liu Bo; +Cc: linux-btrfs@vger.kernel.org, Filipe Manana
On Fri, Jun 26, 2015 at 4:21 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Thu, Jun 25, 2015 at 03:23:59PM +0100, Filipe Manana wrote:
>> On Thu, Jun 25, 2015 at 3:20 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
>> > On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdmanana@kernel.org wrote:
>> >> From: Filipe Manana <fdmanana@suse.com>
>> >>
>> >> When we have the no_holes feature enabled, if a we truncate a file to a
>> >> smaller size, truncate it again but to a size greater than or equals to
>> >> its original size and fsync it, the log tree will not have any information
>> >> about the hole covering the range [truncate_1_offset, new_file_size[.
>> >> Which means if the fsync log is replayed, the file will remain with the
>> >> state it had before both truncate operations.
>> >
>> > Does the fs/subvol tree get updated to the right information at this
>> > time?
>>
>> No, and that's the problem. Because no file extent items are stored in
>> the log tree.
>> The inode item is updated with the new i_size however (as expected).
>
> Yeap, that's right and the patch looks right.
>
> I do appreciate your great work on fixing these corner cases,
> but as of my understanding, they really can be taken by a force commit
> transaction, do they deserve these complex stuff?
All the "complexity" is not about avoiding a transaction commit and do
something more efficient instead, but rather about detecting such
case. The simpler way of for example always force a commit if the file
has the full sync flag (or a new flag set on truncate) and was created
in a past generation, would be quite ironic given all the
optimizations effort put into fsync, no-holes feature, etc.
>
> After all, like punch_hole, remove xattr, they're rare cases.
Maybe not as much as one would think.
Having worked on databases before, hole punching, truncations and
other "rare" cases aren't that rare and correct fsync behaviour is a
must or the very least desirable.
thanks
>
> Thanks,
>
> -liubo
>
>>
>> thanks
>>
>> >
>> > Thanks,
>> >
>> > -liubo
>> >>
>> >> Without the no_holes feature this does not happen, since when the inode
>> >> is logged (full sync flag is set) it will find in the fs/subvol tree a
>> >> leaf with a generation matching the current transaction id that has an
>> >> explicit extent item representing the hole.
>> >>
>> >> Fix this by adding an explicit extent item representing a hole between
>> >> the last extent and the inode's i_size if we are doing a full sync.
>> >>
>> >> The issue is easy to reproduce with the following test case for fstests:
>> >>
>> >> . ./common/rc
>> >> . ./common/filter
>> >> . ./common/dmflakey
>> >>
>> >> _need_to_be_root
>> >> _supported_fs generic
>> >> _supported_os Linux
>> >> _require_scratch
>> >> _require_dm_flakey
>> >>
>> >> # This test was motivated by an issue found in btrfs when the btrfs
>> >> # no-holes feature is enabled (introduced in kernel 3.14). So enable
>> >> # the feature if the fs being tested is btrfs.
>> >> if [ $FSTYP == "btrfs" ]; then
>> >> _require_btrfs_fs_feature "no_holes"
>> >> _require_btrfs_mkfs_feature "no-holes"
>> >> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
>> >> fi
>> >>
>> >> rm -f $seqres.full
>> >>
>> >> _scratch_mkfs >>$seqres.full 2>&1
>> >> _init_flakey
>> >> _mount_flakey
>> >>
>> >> # Create our test files and make sure everything is durably persisted.
>> >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
>> >> -c "pwrite -S 0xbb 64K 61K" \
>> >> $SCRATCH_MNT/foo | _filter_xfs_io
>> >> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
>> >> -c "pwrite -S 0xff 64K 61K" \
>> >> $SCRATCH_MNT/bar | _filter_xfs_io
>> >> sync
>> >>
>> >> # Now truncate our file foo to a smaller size (64Kb) and then truncate
>> >> # it to the size it had before the shrinking truncate (125Kb). Then
>> >> # fsync our file. If a power failure happens after the fsync, we expect
>> >> # our file to have a size of 125Kb, with the first 64Kb of data having
>> >> # the value 0xaa and the second 61Kb of data having the value 0x00.
>> >> $XFS_IO_PROG -c "truncate 64K" \
>> >> -c "truncate 125K" \
>> >> -c "fsync" \
>> >> $SCRATCH_MNT/foo
>> >>
>> >> # Do something similar to our file bar, but the first truncation sets
>> >> # the file size to 0 and the second truncation expands the size to the
>> >> # double of what it was initially.
>> >> $XFS_IO_PROG -c "truncate 0" \
>> >> -c "truncate 253K" \
>> >> -c "fsync" \
>> >> $SCRATCH_MNT/bar
>> >>
>> >> _load_flakey_table $FLAKEY_DROP_WRITES
>> >> _unmount_flakey
>> >>
>> >> # Allow writes again, mount to trigger log replay and validate file
>> >> # contents.
>> >> _load_flakey_table $FLAKEY_ALLOW_WRITES
>> >> _mount_flakey
>> >>
>> >> # We expect foo to have a size of 125Kb, the first 64Kb of data all
>> >> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
>> >> # with value 0x00).
>> >> echo "File foo content after log replay:"
>> >> od -t x1 $SCRATCH_MNT/foo
>> >>
>> >> # We expect bar to have a size of 253Kb and no extents (any byte read
>> >> # from bar has the value 0x00).
>> >> echo "File bar content after log replay:"
>> >> od -t x1 $SCRATCH_MNT/bar
>> >>
>> >> status=0
>> >> exit
>> >>
>> >> The expected file contents in the golden output are:
>> >>
>> >> File foo content after log replay:
>> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>> >> *
>> >> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> >> *
>> >> 0372000
>> >> File bar content after log replay:
>> >> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> >> *
>> >> 0772000
>> >>
>> >> Without this fix, their contents are:
>> >>
>> >> File foo content after log replay:
>> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>> >> *
>> >> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>> >> *
>> >> 0372000
>> >> File bar content after log replay:
>> >> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
>> >> *
>> >> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> >> *
>> >> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> >> *
>> >> 0772000
>> >>
>> >> A test case submission for fstests follows soon.
>> >>
>> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> >> ---
>> >> fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> 1 file changed, 108 insertions(+)
>> >>
>> >> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> >> index 7ac45cf..ac90336 100644
>> >> --- a/fs/btrfs/tree-log.c
>> >> +++ b/fs/btrfs/tree-log.c
>> >> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
>> >> return 0;
>> >> }
>> >>
>> >> +/*
>> >> + * If the no holes feature is enabled we need to make sure any hole between the
>> >> + * last extent and the i_size of our inode is explicitly marked in the log. This
>> >> + * is to make sure that doing something like:
>> >> + *
>> >> + * 1) create file with 128Kb of data
>> >> + * 2) truncate file to 64Kb
>> >> + * 3) truncate file to 256Kb
>> >> + * 4) fsync file
>> >> + * 5) <crash/power failure>
>> >> + * 6) mount fs and trigger log replay
>> >> + *
>> >> + * Will give us a file with a size of 256Kb, the first 64Kb of data match what
>> >> + * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
>> >> + * file correspond to a hole. The presence of explicit holes in a log tree is
>> >> + * what guarantees that log replay will remove/adjust file extent items in the
>> >> + * fs/subvol tree.
>> >> + *
>> >> + * Here we do not need to care about holes between extents, that is already done
>> >> + * by copy_items(). We also only need to do this in the full sync path, where we
>> >> + * lookup for extents from the fs/subvol tree only. In the fast path case, we
>> >> + * lookup the list of modified extent maps and if any represents a hole, we
>> >> + * insert a corresponding extent representing a hole in the log tree.
>> >> + */
>> >> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
>> >> + struct btrfs_root *root,
>> >> + struct inode *inode,
>> >> + struct btrfs_path *path)
>> >> +{
>> >> + int ret;
>> >> + struct btrfs_key key;
>> >> + u64 hole_start;
>> >> + u64 hole_size;
>> >> + struct extent_buffer *leaf;
>> >> + struct btrfs_root *log = root->log_root;
>> >> + const u64 ino = btrfs_ino(inode);
>> >> + const u64 i_size = i_size_read(inode);
>> >> +
>> >> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
>> >> + return 0;
>> >> +
>> >> + key.objectid = ino;
>> >> + key.type = BTRFS_EXTENT_DATA_KEY;
>> >> + key.offset = (u64)-1;
>> >> +
>> >> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> >> + ASSERT(ret != 0);
>> >> + if (ret < 0)
>> >> + return ret;
>> >> +
>> >> + ASSERT(path->slots[0] > 0);
>> >> + path->slots[0]--;
>> >> + leaf = path->nodes[0];
>> >> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>> >> +
>> >> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
>> >> + /* inode does not have any extents */
>> >> + hole_start = 0;
>> >> + hole_size = i_size;
>> >> + } else {
>> >> + struct btrfs_file_extent_item *extent;
>> >> + u64 len;
>> >> +
>> >> + /*
>> >> + * If there's an extent beyond i_size, an explicit hole was
>> >> + * already inserted by copy_items().
>> >> + */
>> >> + if (key.offset >= i_size)
>> >> + return 0;
>> >> +
>> >> + extent = btrfs_item_ptr(leaf, path->slots[0],
>> >> + struct btrfs_file_extent_item);
>> >> +
>> >> + if (btrfs_file_extent_type(leaf, extent) ==
>> >> + BTRFS_FILE_EXTENT_INLINE) {
>> >> + len = btrfs_file_extent_inline_len(leaf,
>> >> + path->slots[0],
>> >> + extent);
>> >> + ASSERT(len == i_size);
>> >> + return 0;
>> >> + }
>> >> +
>> >> + len = btrfs_file_extent_num_bytes(leaf, extent);
>> >> + /* Last extent goes beyond i_size, no need to log a hole. */
>> >> + if (key.offset + len > i_size)
>> >> + return 0;
>> >> + hole_start = key.offset + len;
>> >> + hole_size = i_size - hole_start;
>> >> + }
>> >> + btrfs_release_path(path);
>> >> +
>> >> + /* Last extent ends at i_size. */
>> >> + if (hole_size == 0)
>> >> + return 0;
>> >> +
>> >> + hole_size = ALIGN(hole_size, root->sectorsize);
>> >> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
>> >> + hole_size, 0, hole_size, 0, 0, 0);
>> >> + return ret;
>> >> +}
>> >> +
>> >> /* log a single inode in the tree log.
>> >> * At least one parent directory for this inode must exist in the tree
>> >> * or be logged already.
>> >> @@ -4466,6 +4567,13 @@ next_slot:
>> >> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
>> >> if (err)
>> >> goto out_unlock;
>> >> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
>> >> + btrfs_release_path(path);
>> >> + btrfs_release_path(dst_path);
>> >> + err = btrfs_log_trailing_hole(trans, root, inode, path);
>> >> + if (err)
>> >> + goto out_unlock;
>> >> + }
>> >> log_extents:
>> >> btrfs_release_path(path);
>> >> btrfs_release_path(dst_path);
>> >> --
>> >> 2.1.3
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled
2015-06-26 8:48 ` Filipe Manana
@ 2015-06-26 10:00 ` Liu Bo
0 siblings, 0 replies; 6+ messages in thread
From: Liu Bo @ 2015-06-26 10:00 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs@vger.kernel.org, Filipe Manana
On Fri, Jun 26, 2015 at 09:48:05AM +0100, Filipe Manana wrote:
> On Fri, Jun 26, 2015 at 4:21 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Thu, Jun 25, 2015 at 03:23:59PM +0100, Filipe Manana wrote:
> >> On Thu, Jun 25, 2015 at 3:20 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> >> > On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdmanana@kernel.org wrote:
> >> >> From: Filipe Manana <fdmanana@suse.com>
> >> >>
> >> >> When we have the no_holes feature enabled, if a we truncate a file to a
> >> >> smaller size, truncate it again but to a size greater than or equals to
> >> >> its original size and fsync it, the log tree will not have any information
> >> >> about the hole covering the range [truncate_1_offset, new_file_size[.
> >> >> Which means if the fsync log is replayed, the file will remain with the
> >> >> state it had before both truncate operations.
> >> >
> >> > Does the fs/subvol tree get updated to the right information at this
> >> > time?
> >>
> >> No, and that's the problem. Because no file extent items are stored in
> >> the log tree.
> >> The inode item is updated with the new i_size however (as expected).
> >
> > Yeap, that's right and the patch looks right.
> >
> > I do appreciate your great work on fixing these corner cases,
> > but as of my understanding, they really can be taken by a force commit
> > transaction, do they deserve these complex stuff?
>
> All the "complexity" is not about avoiding a transaction commit and do
> something more efficient instead, but rather about detecting such
> case. The simpler way of for example always force a commit if the file
> has the full sync flag (or a new flag set on truncate) and was created
> in a past generation, would be quite ironic given all the
> optimizations effort put into fsync, no-holes feature, etc.
I agree.
>
> >
> > After all, like punch_hole, remove xattr, they're rare cases.
>
> Maybe not as much as one would think.
> Having worked on databases before, hole punching, truncations and
> other "rare" cases aren't that rare and correct fsync behaviour is a
> must or the very least desirable.
That makes sense.
Forgot to give
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Thanks,
-liubo
>
> thanks
>
> >
> > Thanks,
> >
> > -liubo
> >
> >>
> >> thanks
> >>
> >> >
> >> > Thanks,
> >> >
> >> > -liubo
> >> >>
> >> >> Without the no_holes feature this does not happen, since when the inode
> >> >> is logged (full sync flag is set) it will find in the fs/subvol tree a
> >> >> leaf with a generation matching the current transaction id that has an
> >> >> explicit extent item representing the hole.
> >> >>
> >> >> Fix this by adding an explicit extent item representing a hole between
> >> >> the last extent and the inode's i_size if we are doing a full sync.
> >> >>
> >> >> The issue is easy to reproduce with the following test case for fstests:
> >> >>
> >> >> . ./common/rc
> >> >> . ./common/filter
> >> >> . ./common/dmflakey
> >> >>
> >> >> _need_to_be_root
> >> >> _supported_fs generic
> >> >> _supported_os Linux
> >> >> _require_scratch
> >> >> _require_dm_flakey
> >> >>
> >> >> # This test was motivated by an issue found in btrfs when the btrfs
> >> >> # no-holes feature is enabled (introduced in kernel 3.14). So enable
> >> >> # the feature if the fs being tested is btrfs.
> >> >> if [ $FSTYP == "btrfs" ]; then
> >> >> _require_btrfs_fs_feature "no_holes"
> >> >> _require_btrfs_mkfs_feature "no-holes"
> >> >> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
> >> >> fi
> >> >>
> >> >> rm -f $seqres.full
> >> >>
> >> >> _scratch_mkfs >>$seqres.full 2>&1
> >> >> _init_flakey
> >> >> _mount_flakey
> >> >>
> >> >> # Create our test files and make sure everything is durably persisted.
> >> >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
> >> >> -c "pwrite -S 0xbb 64K 61K" \
> >> >> $SCRATCH_MNT/foo | _filter_xfs_io
> >> >> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
> >> >> -c "pwrite -S 0xff 64K 61K" \
> >> >> $SCRATCH_MNT/bar | _filter_xfs_io
> >> >> sync
> >> >>
> >> >> # Now truncate our file foo to a smaller size (64Kb) and then truncate
> >> >> # it to the size it had before the shrinking truncate (125Kb). Then
> >> >> # fsync our file. If a power failure happens after the fsync, we expect
> >> >> # our file to have a size of 125Kb, with the first 64Kb of data having
> >> >> # the value 0xaa and the second 61Kb of data having the value 0x00.
> >> >> $XFS_IO_PROG -c "truncate 64K" \
> >> >> -c "truncate 125K" \
> >> >> -c "fsync" \
> >> >> $SCRATCH_MNT/foo
> >> >>
> >> >> # Do something similar to our file bar, but the first truncation sets
> >> >> # the file size to 0 and the second truncation expands the size to the
> >> >> # double of what it was initially.
> >> >> $XFS_IO_PROG -c "truncate 0" \
> >> >> -c "truncate 253K" \
> >> >> -c "fsync" \
> >> >> $SCRATCH_MNT/bar
> >> >>
> >> >> _load_flakey_table $FLAKEY_DROP_WRITES
> >> >> _unmount_flakey
> >> >>
> >> >> # Allow writes again, mount to trigger log replay and validate file
> >> >> # contents.
> >> >> _load_flakey_table $FLAKEY_ALLOW_WRITES
> >> >> _mount_flakey
> >> >>
> >> >> # We expect foo to have a size of 125Kb, the first 64Kb of data all
> >> >> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
> >> >> # with value 0x00).
> >> >> echo "File foo content after log replay:"
> >> >> od -t x1 $SCRATCH_MNT/foo
> >> >>
> >> >> # We expect bar to have a size of 253Kb and no extents (any byte read
> >> >> # from bar has the value 0x00).
> >> >> echo "File bar content after log replay:"
> >> >> od -t x1 $SCRATCH_MNT/bar
> >> >>
> >> >> status=0
> >> >> exit
> >> >>
> >> >> The expected file contents in the golden output are:
> >> >>
> >> >> File foo content after log replay:
> >> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >> >> *
> >> >> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> >> *
> >> >> 0372000
> >> >> File bar content after log replay:
> >> >> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> >> *
> >> >> 0772000
> >> >>
> >> >> Without this fix, their contents are:
> >> >>
> >> >> File foo content after log replay:
> >> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >> >> *
> >> >> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
> >> >> *
> >> >> 0372000
> >> >> File bar content after log replay:
> >> >> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
> >> >> *
> >> >> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >> >> *
> >> >> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> >> *
> >> >> 0772000
> >> >>
> >> >> A test case submission for fstests follows soon.
> >> >>
> >> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> >> ---
> >> >> fs/btrfs/tree-log.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> 1 file changed, 108 insertions(+)
> >> >>
> >> >> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> >> >> index 7ac45cf..ac90336 100644
> >> >> --- a/fs/btrfs/tree-log.c
> >> >> +++ b/fs/btrfs/tree-log.c
> >> >> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct btrfs_trans_handle *trans,
> >> >> return 0;
> >> >> }
> >> >>
> >> >> +/*
> >> >> + * If the no holes feature is enabled we need to make sure any hole between the
> >> >> + * last extent and the i_size of our inode is explicitly marked in the log. This
> >> >> + * is to make sure that doing something like:
> >> >> + *
> >> >> + * 1) create file with 128Kb of data
> >> >> + * 2) truncate file to 64Kb
> >> >> + * 3) truncate file to 256Kb
> >> >> + * 4) fsync file
> >> >> + * 5) <crash/power failure>
> >> >> + * 6) mount fs and trigger log replay
> >> >> + *
> >> >> + * Will give us a file with a size of 256Kb, the first 64Kb of data match what
> >> >> + * the file had in its first 64Kb of data at step 1 and the last 192Kb of the
> >> >> + * file correspond to a hole. The presence of explicit holes in a log tree is
> >> >> + * what guarantees that log replay will remove/adjust file extent items in the
> >> >> + * fs/subvol tree.
> >> >> + *
> >> >> + * Here we do not need to care about holes between extents, that is already done
> >> >> + * by copy_items(). We also only need to do this in the full sync path, where we
> >> >> + * lookup for extents from the fs/subvol tree only. In the fast path case, we
> >> >> + * lookup the list of modified extent maps and if any represents a hole, we
> >> >> + * insert a corresponding extent representing a hole in the log tree.
> >> >> + */
> >> >> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
> >> >> + struct btrfs_root *root,
> >> >> + struct inode *inode,
> >> >> + struct btrfs_path *path)
> >> >> +{
> >> >> + int ret;
> >> >> + struct btrfs_key key;
> >> >> + u64 hole_start;
> >> >> + u64 hole_size;
> >> >> + struct extent_buffer *leaf;
> >> >> + struct btrfs_root *log = root->log_root;
> >> >> + const u64 ino = btrfs_ino(inode);
> >> >> + const u64 i_size = i_size_read(inode);
> >> >> +
> >> >> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
> >> >> + return 0;
> >> >> +
> >> >> + key.objectid = ino;
> >> >> + key.type = BTRFS_EXTENT_DATA_KEY;
> >> >> + key.offset = (u64)-1;
> >> >> +
> >> >> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> >> >> + ASSERT(ret != 0);
> >> >> + if (ret < 0)
> >> >> + return ret;
> >> >> +
> >> >> + ASSERT(path->slots[0] > 0);
> >> >> + path->slots[0]--;
> >> >> + leaf = path->nodes[0];
> >> >> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> >> >> +
> >> >> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
> >> >> + /* inode does not have any extents */
> >> >> + hole_start = 0;
> >> >> + hole_size = i_size;
> >> >> + } else {
> >> >> + struct btrfs_file_extent_item *extent;
> >> >> + u64 len;
> >> >> +
> >> >> + /*
> >> >> + * If there's an extent beyond i_size, an explicit hole was
> >> >> + * already inserted by copy_items().
> >> >> + */
> >> >> + if (key.offset >= i_size)
> >> >> + return 0;
> >> >> +
> >> >> + extent = btrfs_item_ptr(leaf, path->slots[0],
> >> >> + struct btrfs_file_extent_item);
> >> >> +
> >> >> + if (btrfs_file_extent_type(leaf, extent) ==
> >> >> + BTRFS_FILE_EXTENT_INLINE) {
> >> >> + len = btrfs_file_extent_inline_len(leaf,
> >> >> + path->slots[0],
> >> >> + extent);
> >> >> + ASSERT(len == i_size);
> >> >> + return 0;
> >> >> + }
> >> >> +
> >> >> + len = btrfs_file_extent_num_bytes(leaf, extent);
> >> >> + /* Last extent goes beyond i_size, no need to log a hole. */
> >> >> + if (key.offset + len > i_size)
> >> >> + return 0;
> >> >> + hole_start = key.offset + len;
> >> >> + hole_size = i_size - hole_start;
> >> >> + }
> >> >> + btrfs_release_path(path);
> >> >> +
> >> >> + /* Last extent ends at i_size. */
> >> >> + if (hole_size == 0)
> >> >> + return 0;
> >> >> +
> >> >> + hole_size = ALIGN(hole_size, root->sectorsize);
> >> >> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0,
> >> >> + hole_size, 0, hole_size, 0, 0, 0);
> >> >> + return ret;
> >> >> +}
> >> >> +
> >> >> /* log a single inode in the tree log.
> >> >> * At least one parent directory for this inode must exist in the tree
> >> >> * or be logged already.
> >> >> @@ -4466,6 +4567,13 @@ next_slot:
> >> >> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
> >> >> if (err)
> >> >> goto out_unlock;
> >> >> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
> >> >> + btrfs_release_path(path);
> >> >> + btrfs_release_path(dst_path);
> >> >> + err = btrfs_log_trailing_hole(trans, root, inode, path);
> >> >> + if (err)
> >> >> + goto out_unlock;
> >> >> + }
> >> >> log_extents:
> >> >> btrfs_release_path(path);
> >> >> btrfs_release_path(dst_path);
> >> >> --
> >> >> 2.1.3
> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-06-26 10:00 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-25 3:17 [PATCH] Btrfs: fix fsync after truncate when no_holes feature is enabled fdmanana
2015-06-25 14:20 ` Liu Bo
2015-06-25 14:23 ` Filipe Manana
2015-06-26 3:21 ` Liu Bo
2015-06-26 8:48 ` Filipe Manana
2015-06-26 10:00 ` Liu Bo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox