* [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc
@ 2024-05-24 20:58 Omar Sandoval
2024-05-24 20:58 ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
2024-05-26 11:41 ` [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Filipe Manana
0 siblings, 2 replies; 9+ messages in thread
From: Omar Sandoval @ 2024-05-24 20:58 UTC (permalink / raw)
To: linux-btrfs; +Cc: kernel-team
From: Omar Sandoval <osandov@fb.com>
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():
BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:2620!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
With the following stack trace:
#0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
#1 btrfs_drop_extents (fs/btrfs/file.c:411:4)
#2 log_one_extent (fs/btrfs/tree-log.c:4732:9)
#3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
#4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
#5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
#6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
#7 btrfs_sync_file (fs/btrfs/file.c:1933:8)
#8 vfs_fsync_range (fs/sync.c:188:9)
#9 vfs_fsync (fs/sync.c:202:9)
#10 do_fsync (fs/sync.c:212:9)
#11 __do_sys_fdatasync (fs/sync.c:225:9)
#12 __se_sys_fdatasync (fs/sync.c:223:1)
#13 __x64_sys_fdatasync (fs/sync.c:223:1)
#14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
#15 do_syscall_64 (arch/x86/entry/common.c:83:7)
#16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().
This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:
>>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
leaf 33439744 flags 0x100000000000000
fs uuid e5bd3946-400c-4223-8923-190ef1f18677
chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 7 transid 9 size 8192 nbytes 8473563889606862198
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 204 flags 0x10(PREALLOC)
atime 1716417703.220000000 (2024-05-22 15:41:43)
ctime 1716417704.983333333 (2024-05-22 15:41:44)
mtime 1716417704.983333333 (2024-05-22 15:41:44)
otime 17592186044416.000000000 (559444-03-08 01:40:16)
item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
index 195 namelen 3 name: 193
item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
location key (0 UNKNOWN.0 0) type XATTR
transid 7 data_len 1 name_len 6
name: user.a
data a
item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 303144960 nr 12288
extent data offset 0 nr 4096 ram 12288
extent compression 0 (none)
item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 4096 nr 8192
item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 8192 nr 4096
...
So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.
Here is the state of the filesystem tree at the time of the crash:
>>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
>>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
>>> print_extent_buffer(nodes[0])
leaf 30425088 level 0 items 184 generation 9 owner 5
leaf 30425088 flags 0x100000000000000
fs uuid e5bd3946-400c-4223-8923-190ef1f18677
chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
...
item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
generation 7 transid 7 size 4096 nbytes 12288
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 6 flags 0x10(PREALLOC)
atime 1716417703.220000000 (2024-05-22 15:41:43)
ctime 1716417703.220000000 (2024-05-22 15:41:43)
mtime 1716417703.220000000 (2024-05-22 15:41:43)
otime 1716417703.220000000 (2024-05-22 15:41:43)
item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
index 195 namelen 3 name: 193
item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
location key (0 UNKNOWN.0 0) type XATTR
transid 7 data_len 1 name_len 6
name: user.a
data a
item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 303144960 nr 12288
extent data offset 0 nr 8192 ram 12288
extent compression 0 (none)
item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 8192 nr 4096
Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.
btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.
If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.
This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:
- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
to the log tree.
- An xattr is set on the file, which sets the
BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
calls copy_inode_items_to_log(), which calls
btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
filesystem tree. Since it starts before i_size, it skips it. Since it
is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
the prealloc extent to written and inserts the remaining prealloc part
from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
the log tree. Note that it overlaps with the 4k-12k prealloc extent
that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
adjusting the start of the 4k-12k prealloc extent in the log tree to
8k.
- btrfs_set_item_key_safe() sees that there is already an extent
starting at 8k in the log tree and calls BUG().
Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1 [1]:
- Change commit subject to not mention direct I/O since this can also
happen with buffered I/O.
- Reformat min() call to be on one line.
- Use btrfs_file_extent_end().
- Rebase on for-next.
1: https://lore.kernel.org/linux-btrfs/101430650a35b55b7a32d895fd292226d13346eb.1716486455.git.osandov@fb.com/
fs/btrfs/tree-log.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5146387b416b..26a2e5aa08e9 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4860,18 +4860,23 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
path->slots[0]++;
continue;
}
- if (!dropped_extents) {
- /*
- * Avoid logging extent items logged in past fsync calls
- * and leading to duplicate keys in the log tree.
- */
+ /*
+ * Avoid overlapping items in the log tree. The first time we
+ * get here, get rid of everything from a past fsync. After
+ * that, if the current extent starts before the end of the last
+ * extent we copied, truncate the last one. This can happen if
+ * an ordered extent completion modifies the subvolume tree
+ * while btrfs_next_leaf() has the tree unlocked.
+ */
+ if (!dropped_extents || key.offset < truncate_offset) {
ret = truncate_inode_items(trans, root->log_root, inode,
- truncate_offset,
+ min(key.offset, truncate_offset),
BTRFS_EXTENT_DATA_KEY);
if (ret)
goto out;
dropped_extents = true;
}
+ truncate_offset = btrfs_file_extent_end(path);
if (ins_nr == 0)
start_slot = slot;
ins_nr++;
--
2.45.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-05-24 20:58 [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Omar Sandoval
@ 2024-05-24 20:58 ` Omar Sandoval
2024-05-26 11:47 ` Filipe Manana
2024-05-26 11:41 ` [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Filipe Manana
1 sibling, 1 reply; 9+ messages in thread
From: Omar Sandoval @ 2024-05-24 20:58 UTC (permalink / raw)
To: fstests, linux-btrfs; +Cc: kernel-team
From: Omar Sandoval <osandov@fb.com>
This is a regression test for a Btrfs bug, but there's nothing
Btrfs-specific about it. Since it's a race, we just try to make the race
happen in a loop and pass if it doesn't crash after all of our attempts.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1 [1]:
- Added missing groups and requires.
- Simplified $XFS_IO_PROG calls.
- Removed -i flag from $XFS_IO_PROG to make race reproduce more
reliably.
- Removed all of the file creation and dump-tree parsing since the only
file on a fresh filesystem is guaranteed to be at the end of a leaf
anyways.
- Rewrote to be a generic test.
1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
tests/generic/745 | 44 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/745.out | 2 ++
2 files changed, 46 insertions(+)
create mode 100755 tests/generic/745
create mode 100644 tests/generic/745.out
diff --git a/tests/generic/745 b/tests/generic/745
new file mode 100755
index 00000000..925adba9
--- /dev/null
+++ b/tests/generic/745
@@ -0,0 +1,44 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# FS QA Test 745
+#
+# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
+# prealloc while extending i_size, then fdatasync. This is a regression test
+# for a Btrfs crash.
+#
+. ./common/preamble
+. ./common/attr
+_begin_fstest auto quick log preallocrw dangerous
+
+_supported_fs generic
+_require_scratch
+_require_attrs
+_require_xfs_io_command falloc -k
+_fixed_by_kernel_commit XXXXXXXXXXXX \
+ "btrfs: fix crash on racing fsync and size-extending write into prealloc"
+
+# -i slows down xfs_io startup and makes the race much less reliable.
+export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount
+
+blksz=$(_get_block_size "$SCRATCH_MNT")
+
+# On Btrfs, since this is the only file on the filesystem, its metadata is at
+# the end of a B-tree leaf. We want an ordered extent completion to add an
+# extent item at the end of the leaf while we're logging prealloc extents
+# beyond i_size after an xattr was set.
+for ((i = 0; i < 5000; i++)); do
+ $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
+ $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
+ $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
+done
+
+# If it didn't crash, we're good.
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/generic/745.out b/tests/generic/745.out
new file mode 100644
index 00000000..fce6b7f5
--- /dev/null
+++ b/tests/generic/745.out
@@ -0,0 +1,2 @@
+QA output created by 745
+Silence is golden
--
2.45.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-05-24 20:58 ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
@ 2024-05-26 11:47 ` Filipe Manana
2024-06-07 5:12 ` Zorro Lang
0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2024-05-26 11:47 UTC (permalink / raw)
To: Omar Sandoval; +Cc: fstests, linux-btrfs, kernel-team
On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This is a regression test for a Btrfs bug, but there's nothing
> Btrfs-specific about it. Since it's a race, we just try to make the race
> happen in a loop and pass if it doesn't crash after all of our attempts.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Changes from v1 [1]:
>
> - Added missing groups and requires.
> - Simplified $XFS_IO_PROG calls.
> - Removed -i flag from $XFS_IO_PROG to make race reproduce more
> reliably.
> - Removed all of the file creation and dump-tree parsing since the only
> file on a fresh filesystem is guaranteed to be at the end of a leaf
> anyways.
> - Rewrote to be a generic test.
>
> 1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
>
> tests/generic/745 | 44 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/745.out | 2 ++
> 2 files changed, 46 insertions(+)
> create mode 100755 tests/generic/745
> create mode 100644 tests/generic/745.out
>
> diff --git a/tests/generic/745 b/tests/generic/745
> new file mode 100755
> index 00000000..925adba9
> --- /dev/null
> +++ b/tests/generic/745
Btw, generic/745 already exists in the for-next branch (development is
based against that branch nowadays).
> @@ -0,0 +1,44 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) Meta Platforms, Inc. and affiliates.
> +#
> +# FS QA Test 745
> +#
> +# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
> +# prealloc while extending i_size, then fdatasync. This is a regression test
> +# for a Btrfs crash.
> +#
> +. ./common/preamble
> +. ./common/attr
> +_begin_fstest auto quick log preallocrw dangerous
> +
> +_supported_fs generic
> +_require_scratch
> +_require_attrs
> +_require_xfs_io_command falloc -k
Since this is now a generic test and we're using direct IO, also:
_require_odirect
> +_fixed_by_kernel_commit XXXXXXXXXXXX \
> + "btrfs: fix crash on racing fsync and size-extending write into prealloc"
Because it's now a generic test, it should be:
[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit ....
Otherwise it looks good to me, so with that:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> +
> +# -i slows down xfs_io startup and makes the race much less reliable.
> +export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
> +
> +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> +_scratch_mount
> +
> +blksz=$(_get_block_size "$SCRATCH_MNT")
> +
> +# On Btrfs, since this is the only file on the filesystem, its metadata is at
> +# the end of a B-tree leaf. We want an ordered extent completion to add an
> +# extent item at the end of the leaf while we're logging prealloc extents
> +# beyond i_size after an xattr was set.
> +for ((i = 0; i < 5000; i++)); do
> + $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
> + $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
> + $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
> +done
> +
> +# If it didn't crash, we're good.
> +
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/generic/745.out b/tests/generic/745.out
> new file mode 100644
> index 00000000..fce6b7f5
> --- /dev/null
> +++ b/tests/generic/745.out
> @@ -0,0 +1,2 @@
> +QA output created by 745
> +Silence is golden
> --
> 2.45.1
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-05-26 11:47 ` Filipe Manana
@ 2024-06-07 5:12 ` Zorro Lang
2024-06-07 9:43 ` Filipe Manana
0 siblings, 1 reply; 9+ messages in thread
From: Zorro Lang @ 2024-06-07 5:12 UTC (permalink / raw)
To: Filipe Manana; +Cc: Omar Sandoval, fstests, linux-btrfs, kernel-team
On Sun, May 26, 2024 at 12:47:49PM +0100, Filipe Manana wrote:
> On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > This is a regression test for a Btrfs bug, but there's nothing
> > Btrfs-specific about it. Since it's a race, we just try to make the race
> > happen in a loop and pass if it doesn't crash after all of our attempts.
> >
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > Changes from v1 [1]:
> >
> > - Added missing groups and requires.
> > - Simplified $XFS_IO_PROG calls.
> > - Removed -i flag from $XFS_IO_PROG to make race reproduce more
> > reliably.
> > - Removed all of the file creation and dump-tree parsing since the only
> > file on a fresh filesystem is guaranteed to be at the end of a leaf
> > anyways.
> > - Rewrote to be a generic test.
> >
> > 1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
> >
> > tests/generic/745 | 44 +++++++++++++++++++++++++++++++++++++++++++
> > tests/generic/745.out | 2 ++
> > 2 files changed, 46 insertions(+)
> > create mode 100755 tests/generic/745
> > create mode 100644 tests/generic/745.out
> >
> > diff --git a/tests/generic/745 b/tests/generic/745
> > new file mode 100755
> > index 00000000..925adba9
> > --- /dev/null
> > +++ b/tests/generic/745
>
> Btw, generic/745 already exists in the for-next branch (development is
> based against that branch nowadays).
>
> > @@ -0,0 +1,44 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) Meta Platforms, Inc. and affiliates.
> > +#
> > +# FS QA Test 745
> > +#
> > +# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
> > +# prealloc while extending i_size, then fdatasync. This is a regression test
> > +# for a Btrfs crash.
> > +#
> > +. ./common/preamble
> > +. ./common/attr
> > +_begin_fstest auto quick log preallocrw dangerous
> > +
> > +_supported_fs generic
> > +_require_scratch
> > +_require_attrs
> > +_require_xfs_io_command falloc -k
>
> Since this is now a generic test and we're using direct IO, also:
>
> _require_odirect
>
> > +_fixed_by_kernel_commit XXXXXXXXXXXX \
> > + "btrfs: fix crash on racing fsync and size-extending write into prealloc"
>
> Because it's now a generic test, it should be:
>
> [ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit ....
>
> Otherwise it looks good to me, so with that:
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks Filipe, merged this patch with above review points, more details refer to
"patches-in-queue" branch. Feel free to have more review points before I push
to for-next :)
Thanks,
Zorro
>
> Thanks.
>
> > +
> > +# -i slows down xfs_io startup and makes the race much less reliable.
> > +export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
> > +
> > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > +_scratch_mount
> > +
> > +blksz=$(_get_block_size "$SCRATCH_MNT")
> > +
> > +# On Btrfs, since this is the only file on the filesystem, its metadata is at
> > +# the end of a B-tree leaf. We want an ordered extent completion to add an
> > +# extent item at the end of the leaf while we're logging prealloc extents
> > +# beyond i_size after an xattr was set.
> > +for ((i = 0; i < 5000; i++)); do
> > + $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
> > + $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
> > + $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
> > +done
> > +
> > +# If it didn't crash, we're good.
> > +
> > +echo "Silence is golden"
> > +status=0
> > +exit
> > diff --git a/tests/generic/745.out b/tests/generic/745.out
> > new file mode 100644
> > index 00000000..fce6b7f5
> > --- /dev/null
> > +++ b/tests/generic/745.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 745
> > +Silence is golden
> > --
> > 2.45.1
> >
> >
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-06-07 5:12 ` Zorro Lang
@ 2024-06-07 9:43 ` Filipe Manana
2024-06-07 10:04 ` Zorro Lang
0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2024-06-07 9:43 UTC (permalink / raw)
To: Zorro Lang; +Cc: Omar Sandoval, fstests, linux-btrfs, kernel-team
On Fri, Jun 7, 2024 at 6:12 AM Zorro Lang <zlang@redhat.com> wrote:
>
> On Sun, May 26, 2024 at 12:47:49PM +0100, Filipe Manana wrote:
> > On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > From: Omar Sandoval <osandov@fb.com>
> > >
> > > This is a regression test for a Btrfs bug, but there's nothing
> > > Btrfs-specific about it. Since it's a race, we just try to make the race
> > > happen in a loop and pass if it doesn't crash after all of our attempts.
> > >
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > > Changes from v1 [1]:
> > >
> > > - Added missing groups and requires.
> > > - Simplified $XFS_IO_PROG calls.
> > > - Removed -i flag from $XFS_IO_PROG to make race reproduce more
> > > reliably.
> > > - Removed all of the file creation and dump-tree parsing since the only
> > > file on a fresh filesystem is guaranteed to be at the end of a leaf
> > > anyways.
> > > - Rewrote to be a generic test.
> > >
> > > 1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
> > >
> > > tests/generic/745 | 44 +++++++++++++++++++++++++++++++++++++++++++
> > > tests/generic/745.out | 2 ++
> > > 2 files changed, 46 insertions(+)
> > > create mode 100755 tests/generic/745
> > > create mode 100644 tests/generic/745.out
> > >
> > > diff --git a/tests/generic/745 b/tests/generic/745
> > > new file mode 100755
> > > index 00000000..925adba9
> > > --- /dev/null
> > > +++ b/tests/generic/745
> >
> > Btw, generic/745 already exists in the for-next branch (development is
> > based against that branch nowadays).
> >
> > > @@ -0,0 +1,44 @@
> > > +#! /bin/bash
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +# Copyright (c) Meta Platforms, Inc. and affiliates.
> > > +#
> > > +# FS QA Test 745
> > > +#
> > > +# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
> > > +# prealloc while extending i_size, then fdatasync. This is a regression test
> > > +# for a Btrfs crash.
> > > +#
> > > +. ./common/preamble
> > > +. ./common/attr
> > > +_begin_fstest auto quick log preallocrw dangerous
> > > +
> > > +_supported_fs generic
> > > +_require_scratch
> > > +_require_attrs
> > > +_require_xfs_io_command falloc -k
> >
> > Since this is now a generic test and we're using direct IO, also:
> >
> > _require_odirect
> >
> > > +_fixed_by_kernel_commit XXXXXXXXXXXX \
> > > + "btrfs: fix crash on racing fsync and size-extending write into prealloc"
> >
> > Because it's now a generic test, it should be:
> >
> > [ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit ....
> >
> > Otherwise it looks good to me, so with that:
> >
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks Filipe, merged this patch with above review points, more details refer to
> "patches-in-queue" branch. Feel free to have more review points before I push
> to for-next :)
Btw, there's a v3 with all that addressed:
https://lore.kernel.org/fstests/8c91247dd109bb94e8df36f2812274b5de2a7183.1716916346.git.osandov@osandov.com/
Also, looking at patches-in-queue, the test was added twice, once as
generic/748 and once as generic/749, in two different commits.
Also, unrelated, but this commit:
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?h=patches-in-queue&id=b4c4ba99435aa7fd4f8a6e3c02938e357e137ec9
As a Signed-off-by tag for David Disseldorp instead of Reviewed-by.
Thanks.
>
> Thanks,
> Zorro
>
> >
> > Thanks.
> >
> > > +
> > > +# -i slows down xfs_io startup and makes the race much less reliable.
> > > +export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
> > > +
> > > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > > +_scratch_mount
> > > +
> > > +blksz=$(_get_block_size "$SCRATCH_MNT")
> > > +
> > > +# On Btrfs, since this is the only file on the filesystem, its metadata is at
> > > +# the end of a B-tree leaf. We want an ordered extent completion to add an
> > > +# extent item at the end of the leaf while we're logging prealloc extents
> > > +# beyond i_size after an xattr was set.
> > > +for ((i = 0; i < 5000; i++)); do
> > > + $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
> > > + $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
> > > + $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
> > > +done
> > > +
> > > +# If it didn't crash, we're good.
> > > +
> > > +echo "Silence is golden"
> > > +status=0
> > > +exit
> > > diff --git a/tests/generic/745.out b/tests/generic/745.out
> > > new file mode 100644
> > > index 00000000..fce6b7f5
> > > --- /dev/null
> > > +++ b/tests/generic/745.out
> > > @@ -0,0 +1,2 @@
> > > +QA output created by 745
> > > +Silence is golden
> > > --
> > > 2.45.1
> > >
> > >
> >
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-06-07 9:43 ` Filipe Manana
@ 2024-06-07 10:04 ` Zorro Lang
2024-06-07 10:11 ` Filipe Manana
0 siblings, 1 reply; 9+ messages in thread
From: Zorro Lang @ 2024-06-07 10:04 UTC (permalink / raw)
To: Filipe Manana; +Cc: Omar Sandoval, fstests, linux-btrfs, kernel-team
On Fri, Jun 07, 2024 at 10:43:47AM +0100, Filipe Manana wrote:
> On Fri, Jun 7, 2024 at 6:12 AM Zorro Lang <zlang@redhat.com> wrote:
> >
> > On Sun, May 26, 2024 at 12:47:49PM +0100, Filipe Manana wrote:
> > > On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
> > > >
> > > > From: Omar Sandoval <osandov@fb.com>
> > > >
> > > > This is a regression test for a Btrfs bug, but there's nothing
> > > > Btrfs-specific about it. Since it's a race, we just try to make the race
> > > > happen in a loop and pass if it doesn't crash after all of our attempts.
> > > >
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > > Changes from v1 [1]:
> > > >
> > > > - Added missing groups and requires.
> > > > - Simplified $XFS_IO_PROG calls.
> > > > - Removed -i flag from $XFS_IO_PROG to make race reproduce more
> > > > reliably.
> > > > - Removed all of the file creation and dump-tree parsing since the only
> > > > file on a fresh filesystem is guaranteed to be at the end of a leaf
> > > > anyways.
> > > > - Rewrote to be a generic test.
> > > >
> > > > 1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
> > > >
> > > > tests/generic/745 | 44 +++++++++++++++++++++++++++++++++++++++++++
> > > > tests/generic/745.out | 2 ++
> > > > 2 files changed, 46 insertions(+)
> > > > create mode 100755 tests/generic/745
> > > > create mode 100644 tests/generic/745.out
> > > >
> > > > diff --git a/tests/generic/745 b/tests/generic/745
> > > > new file mode 100755
> > > > index 00000000..925adba9
> > > > --- /dev/null
> > > > +++ b/tests/generic/745
> > >
> > > Btw, generic/745 already exists in the for-next branch (development is
> > > based against that branch nowadays).
> > >
> > > > @@ -0,0 +1,44 @@
> > > > +#! /bin/bash
> > > > +# SPDX-License-Identifier: GPL-2.0
> > > > +# Copyright (c) Meta Platforms, Inc. and affiliates.
> > > > +#
> > > > +# FS QA Test 745
> > > > +#
> > > > +# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
> > > > +# prealloc while extending i_size, then fdatasync. This is a regression test
> > > > +# for a Btrfs crash.
> > > > +#
> > > > +. ./common/preamble
> > > > +. ./common/attr
> > > > +_begin_fstest auto quick log preallocrw dangerous
> > > > +
> > > > +_supported_fs generic
> > > > +_require_scratch
> > > > +_require_attrs
> > > > +_require_xfs_io_command falloc -k
> > >
> > > Since this is now a generic test and we're using direct IO, also:
> > >
> > > _require_odirect
> > >
> > > > +_fixed_by_kernel_commit XXXXXXXXXXXX \
> > > > + "btrfs: fix crash on racing fsync and size-extending write into prealloc"
> > >
> > > Because it's now a generic test, it should be:
> > >
> > > [ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit ....
> > >
> > > Otherwise it looks good to me, so with that:
> > >
> > > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >
> > Thanks Filipe, merged this patch with above review points, more details refer to
> > "patches-in-queue" branch. Feel free to have more review points before I push
> > to for-next :)
>
> Btw, there's a v3 with all that addressed:
>
> https://lore.kernel.org/fstests/8c91247dd109bb94e8df36f2812274b5de2a7183.1716916346.git.osandov@osandov.com/
>
> Also, looking at patches-in-queue, the test was added twice, once as
> generic/748 and once as generic/749, in two different commits.
Oh, I've merged this test case last week... sorry I forgot that. I'll keep
the g/748, and remove the g/749.
>
> Also, unrelated, but this commit:
>
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?h=patches-in-queue&id=b4c4ba99435aa7fd4f8a6e3c02938e357e137ec9
>
> As a Signed-off-by tag for David Disseldorp instead of Reviewed-by.
Oh, this Signed-off-by tag is generated automatically by:
https://lore.kernel.org/fstests/c9e54af5-4370-4d45-a8ed-4098b06b2629@suse.com/T/#m8fc24d233b2cf3a323c94c2b8039c0f043e09023
if it's a mistake, I'll change it to Reviewed-by:
Thanks,
Zorro
>
> Thanks.
>
> >
> > Thanks,
> > Zorro
> >
> > >
> > > Thanks.
> > >
> > > > +
> > > > +# -i slows down xfs_io startup and makes the race much less reliable.
> > > > +export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
> > > > +
> > > > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > > > +_scratch_mount
> > > > +
> > > > +blksz=$(_get_block_size "$SCRATCH_MNT")
> > > > +
> > > > +# On Btrfs, since this is the only file on the filesystem, its metadata is at
> > > > +# the end of a B-tree leaf. We want an ordered extent completion to add an
> > > > +# extent item at the end of the leaf while we're logging prealloc extents
> > > > +# beyond i_size after an xattr was set.
> > > > +for ((i = 0; i < 5000; i++)); do
> > > > + $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
> > > > + $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
> > > > + $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
> > > > +done
> > > > +
> > > > +# If it didn't crash, we're good.
> > > > +
> > > > +echo "Silence is golden"
> > > > +status=0
> > > > +exit
> > > > diff --git a/tests/generic/745.out b/tests/generic/745.out
> > > > new file mode 100644
> > > > index 00000000..fce6b7f5
> > > > --- /dev/null
> > > > +++ b/tests/generic/745.out
> > > > @@ -0,0 +1,2 @@
> > > > +QA output created by 745
> > > > +Silence is golden
> > > > --
> > > > 2.45.1
> > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-06-07 10:04 ` Zorro Lang
@ 2024-06-07 10:11 ` Filipe Manana
2024-06-07 10:21 ` David Disseldorp
0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2024-06-07 10:11 UTC (permalink / raw)
To: Zorro Lang
Cc: Omar Sandoval, fstests, linux-btrfs, kernel-team,
David Disseldorp
On Fri, Jun 7, 2024 at 11:04 AM Zorro Lang <zlang@redhat.com> wrote:
>
> On Fri, Jun 07, 2024 at 10:43:47AM +0100, Filipe Manana wrote:
> > On Fri, Jun 7, 2024 at 6:12 AM Zorro Lang <zlang@redhat.com> wrote:
> > >
> > > On Sun, May 26, 2024 at 12:47:49PM +0100, Filipe Manana wrote:
> > > > On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
> > > > >
> > > > > From: Omar Sandoval <osandov@fb.com>
> > > > >
> > > > > This is a regression test for a Btrfs bug, but there's nothing
> > > > > Btrfs-specific about it. Since it's a race, we just try to make the race
> > > > > happen in a loop and pass if it doesn't crash after all of our attempts.
> > > > >
> > > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > > ---
> > > > > Changes from v1 [1]:
> > > > >
> > > > > - Added missing groups and requires.
> > > > > - Simplified $XFS_IO_PROG calls.
> > > > > - Removed -i flag from $XFS_IO_PROG to make race reproduce more
> > > > > reliably.
> > > > > - Removed all of the file creation and dump-tree parsing since the only
> > > > > file on a fresh filesystem is guaranteed to be at the end of a leaf
> > > > > anyways.
> > > > > - Rewrote to be a generic test.
> > > > >
> > > > > 1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
> > > > >
> > > > > tests/generic/745 | 44 +++++++++++++++++++++++++++++++++++++++++++
> > > > > tests/generic/745.out | 2 ++
> > > > > 2 files changed, 46 insertions(+)
> > > > > create mode 100755 tests/generic/745
> > > > > create mode 100644 tests/generic/745.out
> > > > >
> > > > > diff --git a/tests/generic/745 b/tests/generic/745
> > > > > new file mode 100755
> > > > > index 00000000..925adba9
> > > > > --- /dev/null
> > > > > +++ b/tests/generic/745
> > > >
> > > > Btw, generic/745 already exists in the for-next branch (development is
> > > > based against that branch nowadays).
> > > >
> > > > > @@ -0,0 +1,44 @@
> > > > > +#! /bin/bash
> > > > > +# SPDX-License-Identifier: GPL-2.0
> > > > > +# Copyright (c) Meta Platforms, Inc. and affiliates.
> > > > > +#
> > > > > +# FS QA Test 745
> > > > > +#
> > > > > +# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
> > > > > +# prealloc while extending i_size, then fdatasync. This is a regression test
> > > > > +# for a Btrfs crash.
> > > > > +#
> > > > > +. ./common/preamble
> > > > > +. ./common/attr
> > > > > +_begin_fstest auto quick log preallocrw dangerous
> > > > > +
> > > > > +_supported_fs generic
> > > > > +_require_scratch
> > > > > +_require_attrs
> > > > > +_require_xfs_io_command falloc -k
> > > >
> > > > Since this is now a generic test and we're using direct IO, also:
> > > >
> > > > _require_odirect
> > > >
> > > > > +_fixed_by_kernel_commit XXXXXXXXXXXX \
> > > > > + "btrfs: fix crash on racing fsync and size-extending write into prealloc"
> > > >
> > > > Because it's now a generic test, it should be:
> > > >
> > > > [ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit ....
> > > >
> > > > Otherwise it looks good to me, so with that:
> > > >
> > > > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> > >
> > > Thanks Filipe, merged this patch with above review points, more details refer to
> > > "patches-in-queue" branch. Feel free to have more review points before I push
> > > to for-next :)
> >
> > Btw, there's a v3 with all that addressed:
> >
> > https://lore.kernel.org/fstests/8c91247dd109bb94e8df36f2812274b5de2a7183.1716916346.git.osandov@osandov.com/
> >
> > Also, looking at patches-in-queue, the test was added twice, once as
> > generic/748 and once as generic/749, in two different commits.
>
> Oh, I've merged this test case last week... sorry I forgot that. I'll keep
> the g/748, and remove the g/749.
>
> >
> > Also, unrelated, but this commit:
> >
> > https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?h=patches-in-queue&id=b4c4ba99435aa7fd4f8a6e3c02938e357e137ec9
> >
> > As a Signed-off-by tag for David Disseldorp instead of Reviewed-by.
>
> Oh, this Signed-off-by tag is generated automatically by:
>
> https://lore.kernel.org/fstests/c9e54af5-4370-4d45-a8ed-4098b06b2629@suse.com/T/#m8fc24d233b2cf3a323c94c2b8039c0f043e09023
>
> if it's a mistake, I'll change it to Reviewed-by:
That I didn't notice, odd that David replied with Signed-off-by and
not Reviewed-by.
(cc'ing him)
Thanks.
>
> Thanks,
> Zorro
>
> >
> > Thanks.
> >
> > >
> > > Thanks,
> > > Zorro
> > >
> > > >
> > > > Thanks.
> > > >
> > > > > +
> > > > > +# -i slows down xfs_io startup and makes the race much less reliable.
> > > > > +export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
> > > > > +
> > > > > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > > > > +_scratch_mount
> > > > > +
> > > > > +blksz=$(_get_block_size "$SCRATCH_MNT")
> > > > > +
> > > > > +# On Btrfs, since this is the only file on the filesystem, its metadata is at
> > > > > +# the end of a B-tree leaf. We want an ordered extent completion to add an
> > > > > +# extent item at the end of the leaf while we're logging prealloc extents
> > > > > +# beyond i_size after an xattr was set.
> > > > > +for ((i = 0; i < 5000; i++)); do
> > > > > + $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
> > > > > + $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
> > > > > + $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
> > > > > +done
> > > > > +
> > > > > +# If it didn't crash, we're good.
> > > > > +
> > > > > +echo "Silence is golden"
> > > > > +status=0
> > > > > +exit
> > > > > diff --git a/tests/generic/745.out b/tests/generic/745.out
> > > > > new file mode 100644
> > > > > index 00000000..fce6b7f5
> > > > > --- /dev/null
> > > > > +++ b/tests/generic/745.out
> > > > > @@ -0,0 +1,2 @@
> > > > > +QA output created by 745
> > > > > +Silence is golden
> > > > > --
> > > > > 2.45.1
> > > > >
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
2024-06-07 10:11 ` Filipe Manana
@ 2024-06-07 10:21 ` David Disseldorp
0 siblings, 0 replies; 9+ messages in thread
From: David Disseldorp @ 2024-06-07 10:21 UTC (permalink / raw)
To: Filipe Manana
Cc: Zorro Lang, Omar Sandoval, fstests, linux-btrfs, kernel-team
On Fri, 7 Jun 2024 11:11:34 +0100, Filipe Manana wrote:
...
> > > As a Signed-off-by tag for David Disseldorp instead of Reviewed-by.
> >
> > Oh, this Signed-off-by tag is generated automatically by:
> >
> > https://lore.kernel.org/fstests/c9e54af5-4370-4d45-a8ed-4098b06b2629@suse.com/T/#m8fc24d233b2cf3a323c94c2b8039c0f043e09023
> >
> > if it's a mistake, I'll change it to Reviewed-by:
>
> That I didn't notice, odd that David replied with Signed-off-by and
> not Reviewed-by.
> (cc'ing him)
Oops, yes that should have been a Reviewed-by, sorry.
Please change it.
Thanks, David
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc
2024-05-24 20:58 [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Omar Sandoval
2024-05-24 20:58 ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
@ 2024-05-26 11:41 ` Filipe Manana
1 sibling, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2024-05-26 11:41 UTC (permalink / raw)
To: Omar Sandoval; +Cc: linux-btrfs, kernel-team
On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> We have been seeing crashes on duplicate keys in
> btrfs_set_item_key_safe():
>
> BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
> ------------[ cut here ]------------
> kernel BUG at fs/btrfs/ctree.c:2620!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
> RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
>
> With the following stack trace:
>
> #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
> #1 btrfs_drop_extents (fs/btrfs/file.c:411:4)
> #2 log_one_extent (fs/btrfs/tree-log.c:4732:9)
> #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
> #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
> #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
> #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
> #7 btrfs_sync_file (fs/btrfs/file.c:1933:8)
> #8 vfs_fsync_range (fs/sync.c:188:9)
> #9 vfs_fsync (fs/sync.c:202:9)
> #10 do_fsync (fs/sync.c:212:9)
> #11 __do_sys_fdatasync (fs/sync.c:225:9)
> #12 __se_sys_fdatasync (fs/sync.c:223:1)
> #13 __x64_sys_fdatasync (fs/sync.c:223:1)
> #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
> #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
> #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
>
> So we're logging a changed extent from fsync, which is splitting an
> extent in the log tree. But this split part already exists in the tree,
> triggering the BUG().
>
> This is the state of the log tree at the time of the crash, dumped with
> drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
> to get more details than btrfs_print_leaf() gives us:
>
> >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
> leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
> leaf 33439744 flags 0x100000000000000
> fs uuid e5bd3946-400c-4223-8923-190ef1f18677
> chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
> item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
> generation 7 transid 9 size 8192 nbytes 8473563889606862198
> block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> sequence 204 flags 0x10(PREALLOC)
> atime 1716417703.220000000 (2024-05-22 15:41:43)
> ctime 1716417704.983333333 (2024-05-22 15:41:44)
> mtime 1716417704.983333333 (2024-05-22 15:41:44)
> otime 17592186044416.000000000 (559444-03-08 01:40:16)
> item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
> index 195 namelen 3 name: 193
> item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
> location key (0 UNKNOWN.0 0) type XATTR
> transid 7 data_len 1 name_len 6
> name: user.a
> data a
> item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
> generation 9 type 1 (regular)
> extent data disk byte 303144960 nr 12288
> extent data offset 0 nr 4096 ram 12288
> extent compression 0 (none)
> item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
> generation 9 type 2 (prealloc)
> prealloc data disk byte 303144960 nr 12288
> prealloc data offset 4096 nr 8192
> item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
> generation 9 type 2 (prealloc)
> prealloc data disk byte 303144960 nr 12288
> prealloc data offset 8192 nr 4096
> ...
>
> So the real problem happened earlier: notice that items 4 (4k-12k) and 5
> (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
> item 5 starts at i_size.
>
> Here is the state of the filesystem tree at the time of the crash:
>
> >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
> >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
> >>> print_extent_buffer(nodes[0])
> leaf 30425088 level 0 items 184 generation 9 owner 5
> leaf 30425088 flags 0x100000000000000
> fs uuid e5bd3946-400c-4223-8923-190ef1f18677
> chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
> ...
> item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
> generation 7 transid 7 size 4096 nbytes 12288
> block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> sequence 6 flags 0x10(PREALLOC)
> atime 1716417703.220000000 (2024-05-22 15:41:43)
> ctime 1716417703.220000000 (2024-05-22 15:41:43)
> mtime 1716417703.220000000 (2024-05-22 15:41:43)
> otime 1716417703.220000000 (2024-05-22 15:41:43)
> item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
> index 195 namelen 3 name: 193
> item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
> location key (0 UNKNOWN.0 0) type XATTR
> transid 7 data_len 1 name_len 6
> name: user.a
> data a
> item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
> generation 9 type 1 (regular)
> extent data disk byte 303144960 nr 12288
> extent data offset 0 nr 8192 ram 12288
> extent compression 0 (none)
> item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
> generation 9 type 2 (prealloc)
> prealloc data disk byte 303144960 nr 12288
> prealloc data offset 8192 nr 4096
>
> Item 5 in the log tree corresponds to item 183 in the filesystem tree,
> but nothing matches item 4. Furthermore, item 183 is the last item in
> the leaf.
>
> btrfs_log_prealloc_extents() is responsible for logging prealloc extents
> beyond i_size. It first truncates any previously logged prealloc extents
> that start beyond i_size. Then, it walks the filesystem tree and copies
> the prealloc extent items to the log tree.
>
> If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
> unlocks the tree and does another search. However, while the filesystem
> tree is unlocked, an ordered extent completion may modify the tree. In
> particular, it may insert an extent item that overlaps with an extent
> item that was already copied to the log tree.
>
> This may manifest in several ways depending on the exact scenario,
> including an EEXIST error that is silently translated to a full sync,
> overlapping items in the log tree, or this crash. This particular crash
> is triggered by the following sequence of events:
>
> - Initially, the file has i_size=4k, a regular extent from 0-4k, and a
> prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
> the last item in its B-tree leaf.
> - The file is fsync'd, which copies its inode item and both extent items
> to the log tree.
> - An xattr is set on the file, which sets the
> BTRFS_INODE_COPY_EVERYTHING flag.
> - The range 4k-8k in the file is written using direct I/O. i_size is
> extended to 8k, but the ordered extent is still in flight.
> - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
> calls copy_inode_items_to_log(), which calls
> btrfs_log_prealloc_extents().
> - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
> filesystem tree. Since it starts before i_size, it skips it. Since it
> is the last item in its B-tree leaf, it calls btrfs_next_leaf().
> - btrfs_next_leaf() unlocks the path.
> - The ordered extent completion runs, which converts the 4k-8k part of
> the prealloc extent to written and inserts the remaining prealloc part
> from 8k-12k.
> - btrfs_next_leaf() does a search and finds the new prealloc extent
> 8k-12k.
> - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
> the log tree. Note that it overlaps with the 4k-12k prealloc extent
> that was copied to the log tree by the first fsync.
> - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
> extent that was written.
> - This tries to drop the range 4k-8k in the log tree, which requires
> adjusting the start of the 4k-12k prealloc extent in the log tree to
> 8k.
> - btrfs_set_item_key_safe() sees that there is already an extent
> starting at 8k in the log tree and calls BUG().
>
> Fix this by detecting when we're about to insert an overlapping file
> extent item in the log tree and truncating the part that would overlap.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
Perfect, thanks!
Reviewed-by: Filipe Manana <fdmanana@suse.com>
> ---
> Changes from v1 [1]:
>
> - Change commit subject to not mention direct I/O since this can also
> happen with buffered I/O.
> - Reformat min() call to be on one line.
> - Use btrfs_file_extent_end().
> - Rebase on for-next.
>
> 1: https://lore.kernel.org/linux-btrfs/101430650a35b55b7a32d895fd292226d13346eb.1716486455.git.osandov@fb.com/
>
> fs/btrfs/tree-log.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 5146387b416b..26a2e5aa08e9 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4860,18 +4860,23 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
> path->slots[0]++;
> continue;
> }
> - if (!dropped_extents) {
> - /*
> - * Avoid logging extent items logged in past fsync calls
> - * and leading to duplicate keys in the log tree.
> - */
> + /*
> + * Avoid overlapping items in the log tree. The first time we
> + * get here, get rid of everything from a past fsync. After
> + * that, if the current extent starts before the end of the last
> + * extent we copied, truncate the last one. This can happen if
> + * an ordered extent completion modifies the subvolume tree
> + * while btrfs_next_leaf() has the tree unlocked.
> + */
> + if (!dropped_extents || key.offset < truncate_offset) {
> ret = truncate_inode_items(trans, root->log_root, inode,
> - truncate_offset,
> + min(key.offset, truncate_offset),
> BTRFS_EXTENT_DATA_KEY);
> if (ret)
> goto out;
> dropped_extents = true;
> }
> + truncate_offset = btrfs_file_extent_end(path);
> if (ins_nr == 0)
> start_slot = slot;
> ins_nr++;
> --
> 2.45.1
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-06-07 10:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24 20:58 [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Omar Sandoval
2024-05-24 20:58 ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
2024-05-26 11:47 ` Filipe Manana
2024-06-07 5:12 ` Zorro Lang
2024-06-07 9:43 ` Filipe Manana
2024-06-07 10:04 ` Zorro Lang
2024-06-07 10:11 ` Filipe Manana
2024-06-07 10:21 ` David Disseldorp
2024-05-26 11:41 ` [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox