* [PATCH] Btrfs: fix partial loss of prealloc extent past i_size after fsync
@ 2020-04-21 10:25 fdmanana
2020-04-23 0:38 ` David Sterba
2020-04-23 15:30 ` [PATCH v2] " fdmanana
0 siblings, 2 replies; 4+ messages in thread
From: fdmanana @ 2020-04-21 10:25 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When we have an inode with a prealloc extent that starts at an offset
lower than the i_size and there is another prealloc extent that starts at
an offset beyond i_size, we can end up losing part of the first prealloc
extent (the part that starts at i_size) and have an implicit hole if we
fsync the file and then have a power failure.
Consider the following example with comments explaining how and why it
happens.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our test file with 2 consecutive prealloc extents, each with a
# size of 128Kb, and covering the range from 0 to 256Kb, with a file
# size of 0.
$ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
$ xfs_io -c "falloc -k 128K 128K" /mnt/foo
# Fsync the file to record both extents in the log tree.
$ xfs_io -c "fsync" /mnt/foo
# Now do a redudant extent allocation for the range from 0 to 64Kb.
# This will merely increase the file size from 0 to 64Kb. Instead we
# could also do a truncate to set the file size to 64Kb.
$ xfs_io -c "falloc 0 64K" /mnt/foo
# Fsync the file, so we update the inode item in the log tree with the
# new file size (64Kb). This also ends up setting the number of bytes
# for the first prealloc extent to 64Kb. This is done by the truncation
# at btrfs_log_prealloc_extents().
# This means that if a power failure happens after this, a write into
# the file range 64Kb to 128Kb will not use the prealloc extent and
# will result in allocation of a new extent.
$ xfs_io -c "fsync" /mnt/foo
# Now set the file size to 256K with a truncate and then fsync the file.
# Since no changes happened to the extents, the fsync only updates the
# i_size in the inode item at the log tree. This results in an implicit
# hole for the file range from 64Kb to 128Kb, something which fsck will
# complain when not using the NO_HOLES feature if we replay the log
# after a power failure.
$ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
So instead of always truncating the log to the inode's current i_size at
btrfs_log_prealloc_extents(), check first if there's a prealloc extent
that starts at an offset lower than the i_size and with a length that
crosses the i_size - if there is one, just make sure we truncate to a
size that corresponds to the end offset of that prealloc extent, so
that we don't lose the part of that extent that starts at i_size if a
power failure happens.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 43 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4272610d7472..9761e6d538fb 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4226,6 +4226,9 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
const u64 ino = btrfs_ino(inode);
struct btrfs_path *dst_path = NULL;
bool dropped_extents = false;
+ u64 truncate_offset = i_size;
+ struct extent_buffer *leaf;
+ int slot;
int ins_nr = 0;
int start_slot;
int ret;
@@ -4240,9 +4243,43 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
if (ret < 0)
goto out;
+ /*
+ * We must check if there is a prealloc extent that starts before the
+ * i_size and crosses the i_size boundary. This is to ensure later we
+ * truncate down to the end of that extent and not to the i_size, as
+ * otherwise we end up losing part of the prealloc extent after a log
+ * replay and with an implicit hole if there is another prealloc extent
+ * that starts at an offset beyond i_size.
+ */
+ ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
+ if (ret < 0)
+ goto out;
+
+ if (ret == 0) {
+ struct btrfs_file_extent_item *ei;
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+ ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) ==
+ BTRFS_FILE_EXTENT_PREALLOC) {
+ u64 extent_end;
+
+ btrfs_item_key_to_cpu(leaf, &key, slot);
+ extent_end = key.offset +
+ btrfs_file_extent_disk_num_bytes(leaf, ei);
+
+ if (extent_end > i_size)
+ truncate_offset = extent_end;
+ }
+ } else {
+ ret = 0;
+ }
+
while (true) {
- struct extent_buffer *leaf = path->nodes[0];
- int slot = path->slots[0];
+ leaf = path->nodes[0];
+ slot = path->slots[0];
if (slot >= btrfs_header_nritems(leaf)) {
if (ins_nr > 0) {
@@ -4280,7 +4317,7 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
ret = btrfs_truncate_inode_items(trans,
root->log_root,
&inode->vfs_inode,
- i_size,
+ truncate_offset,
BTRFS_EXTENT_DATA_KEY);
} while (ret == -EAGAIN);
if (ret)
--
2.11.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] Btrfs: fix partial loss of prealloc extent past i_size after fsync
2020-04-21 10:25 [PATCH] Btrfs: fix partial loss of prealloc extent past i_size after fsync fdmanana
@ 2020-04-23 0:38 ` David Sterba
2020-04-23 15:30 ` [PATCH v2] " fdmanana
1 sibling, 0 replies; 4+ messages in thread
From: David Sterba @ 2020-04-23 0:38 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Tue, Apr 21, 2020 at 11:25:20AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When we have an inode with a prealloc extent that starts at an offset
> lower than the i_size and there is another prealloc extent that starts at
> an offset beyond i_size, we can end up losing part of the first prealloc
> extent (the part that starts at i_size) and have an implicit hole if we
> fsync the file and then have a power failure.
>
> Consider the following example with comments explaining how and why it
> happens.
>
> $ mkfs.btrfs -f /dev/sdb
> $ mount /dev/sdb /mnt
>
> # Create our test file with 2 consecutive prealloc extents, each with a
> # size of 128Kb, and covering the range from 0 to 256Kb, with a file
> # size of 0.
> $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
> $ xfs_io -c "falloc -k 128K 128K" /mnt/foo
>
> # Fsync the file to record both extents in the log tree.
> $ xfs_io -c "fsync" /mnt/foo
>
> # Now do a redudant extent allocation for the range from 0 to 64Kb.
> # This will merely increase the file size from 0 to 64Kb. Instead we
> # could also do a truncate to set the file size to 64Kb.
> $ xfs_io -c "falloc 0 64K" /mnt/foo
>
> # Fsync the file, so we update the inode item in the log tree with the
> # new file size (64Kb). This also ends up setting the number of bytes
> # for the first prealloc extent to 64Kb. This is done by the truncation
> # at btrfs_log_prealloc_extents().
> # This means that if a power failure happens after this, a write into
> # the file range 64Kb to 128Kb will not use the prealloc extent and
> # will result in allocation of a new extent.
> $ xfs_io -c "fsync" /mnt/foo
>
> # Now set the file size to 256K with a truncate and then fsync the file.
> # Since no changes happened to the extents, the fsync only updates the
> # i_size in the inode item at the log tree. This results in an implicit
> # hole for the file range from 64Kb to 128Kb, something which fsck will
> # complain when not using the NO_HOLES feature if we replay the log
> # after a power failure.
> $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
>
> So instead of always truncating the log to the inode's current i_size at
> btrfs_log_prealloc_extents(), check first if there's a prealloc extent
> that starts at an offset lower than the i_size and with a length that
> crosses the i_size - if there is one, just make sure we truncate to a
> size that corresponds to the end offset of that prealloc extent, so
> that we don't lose the part of that extent that starts at i_size if a
> power failure happens.
>
> A test case for fstests follows soon.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2] Btrfs: fix partial loss of prealloc extent past i_size after fsync
2020-04-21 10:25 [PATCH] Btrfs: fix partial loss of prealloc extent past i_size after fsync fdmanana
2020-04-23 0:38 ` David Sterba
@ 2020-04-23 15:30 ` fdmanana
2020-04-27 14:24 ` David Sterba
1 sibling, 1 reply; 4+ messages in thread
From: fdmanana @ 2020-04-23 15:30 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When we have an inode with a prealloc extent that starts at an offset
lower than the i_size and there is another prealloc extent that starts at
an offset beyond i_size, we can end up losing part of the first prealloc
extent (the part that starts at i_size) and have an implicit hole if we
fsync the file and then have a power failure.
Consider the following example with comments explaining how and why it
happens.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our test file with 2 consecutive prealloc extents, each with a
# size of 128Kb, and covering the range from 0 to 256Kb, with a file
# size of 0.
$ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
$ xfs_io -c "falloc -k 128K 128K" /mnt/foo
# Fsync the file to record both extents in the log tree.
$ xfs_io -c "fsync" /mnt/foo
# Now do a redudant extent allocation for the range from 0 to 64Kb.
# This will merely increase the file size from 0 to 64Kb. Instead we
# could also do a truncate to set the file size to 64Kb.
$ xfs_io -c "falloc 0 64K" /mnt/foo
# Fsync the file, so we update the inode item in the log tree with the
# new file size (64Kb). This also ends up setting the number of bytes
# for the first prealloc extent to 64Kb. This is done by the truncation
# at btrfs_log_prealloc_extents().
# This means that if a power failure happens after this, a write into
# the file range 64Kb to 128Kb will not use the prealloc extent and
# will result in allocation of a new extent.
$ xfs_io -c "fsync" /mnt/foo
# Now set the file size to 256K with a truncate and then fsync the file.
# Since no changes happened to the extents, the fsync only updates the
# i_size in the inode item at the log tree. This results in an implicit
# hole for the file range from 64Kb to 128Kb, something which fsck will
# complain when not using the NO_HOLES feature if we replay the log
# after a power failure.
$ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
So instead of always truncating the log to the inode's current i_size at
btrfs_log_prealloc_extents(), check first if there's a prealloc extent
that starts at an offset lower than the i_size and with a length that
crosses the i_size - if there is one, just make sure we truncate to a
size that corresponds to the end offset of that prealloc extent, so
that we don't lose the part of that extent that starts at i_size if a
power failure happens.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: Fixed a bug where we could end up with overlapping extents due to use
of btrfs_file_extent_disk_num_bytes() instead of btrfs_file_extent_num_bytes().
Fixed that to use btrfs_file_extent_num_bytes(), which was the original intention.
Detected with fsstress.
fs/btrfs/tree-log.c | 43 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 4272610d7472..8dd4ed5b9e17 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4226,6 +4226,9 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
const u64 ino = btrfs_ino(inode);
struct btrfs_path *dst_path = NULL;
bool dropped_extents = false;
+ u64 truncate_offset = i_size;
+ struct extent_buffer *leaf;
+ int slot;
int ins_nr = 0;
int start_slot;
int ret;
@@ -4240,9 +4243,43 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
if (ret < 0)
goto out;
+ /*
+ * We must check if there is a prealloc extent that starts before the
+ * i_size and crosses the i_size boundary. This is to ensure later we
+ * truncate down to the end of that extent and not to the i_size, as
+ * otherwise we end up losing part of the prealloc extent after a log
+ * replay and with an implicit hole if there is another prealloc extent
+ * that starts at an offset beyond i_size.
+ */
+ ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
+ if (ret < 0)
+ goto out;
+
+ if (ret == 0) {
+ struct btrfs_file_extent_item *ei;
+
+ leaf = path->nodes[0];
+ slot = path->slots[0];
+ ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
+
+ if (btrfs_file_extent_type(leaf, ei) ==
+ BTRFS_FILE_EXTENT_PREALLOC) {
+ u64 extent_end;
+
+ btrfs_item_key_to_cpu(leaf, &key, slot);
+ extent_end = key.offset +
+ btrfs_file_extent_num_bytes(leaf, ei);
+
+ if (extent_end > i_size)
+ truncate_offset = extent_end;
+ }
+ } else {
+ ret = 0;
+ }
+
while (true) {
- struct extent_buffer *leaf = path->nodes[0];
- int slot = path->slots[0];
+ leaf = path->nodes[0];
+ slot = path->slots[0];
if (slot >= btrfs_header_nritems(leaf)) {
if (ins_nr > 0) {
@@ -4280,7 +4317,7 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
ret = btrfs_truncate_inode_items(trans,
root->log_root,
&inode->vfs_inode,
- i_size,
+ truncate_offset,
BTRFS_EXTENT_DATA_KEY);
} while (ret == -EAGAIN);
if (ret)
--
2.11.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2] Btrfs: fix partial loss of prealloc extent past i_size after fsync
2020-04-23 15:30 ` [PATCH v2] " fdmanana
@ 2020-04-27 14:24 ` David Sterba
0 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2020-04-27 14:24 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Thu, Apr 23, 2020 at 04:30:53PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Updated in misc-next, thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-04-27 14:25 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-21 10:25 [PATCH] Btrfs: fix partial loss of prealloc extent past i_size after fsync fdmanana
2020-04-23 0:38 ` David Sterba
2020-04-23 15:30 ` [PATCH v2] " fdmanana
2020-04-27 14:24 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).