* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-15 15:39 [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol Mark Harmstone
@ 2024-10-16 11:09 ` Filipe Manana
2024-10-18 17:36 ` Mark Harmstone
2024-10-23 3:43 ` Zorro Lang
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2024-10-16 11:09 UTC (permalink / raw)
To: Mark Harmstone; +Cc: fstests, linux-btrfs
On Tue, Oct 15, 2024 at 4:42 PM Mark Harmstone <maharmstone@fb.com> wrote:
>
> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
> race could mean that csums weren't getting written to the log tree,
> leading to corruption when it was replayed.
>
> The patches to detect log this tree corruption are in btrfs-progs 6.11.
This shouldn't be needed right?
Because after log replay the csums are missing and 'btrfs check'
detects (IIRC) missing csums for extents referred by file extent items
in a subvolume tree - if it doesn't then it should be improved.
>
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
> This is a genericized version of the test I originally proposed as
> btrfs/333.
>
> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/757.out | 2 ++
> 2 files changed, 73 insertions(+)
> create mode 100755 tests/generic/757
> create mode 100644 tests/generic/757.out
>
> diff --git a/tests/generic/757 b/tests/generic/757
> new file mode 100755
> index 00000000..6ad3d01e
> --- /dev/null
> +++ b/tests/generic/757
> @@ -0,0 +1,71 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# FS QA Test 757
> +#
> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
> +# weren't getting written to the log tree, causing corruptions on remount.
> +# This can be seen on subpage FSes on Linux 6.4.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick metadata log recoveryloop
> +
> +_fixed_by_kernel_commit e917ff56c8e7 \
> + "btrfs: determine synchronous writers from bio or writeback control"
For generic tests what we do is:
[ $FSTYP == "btrfs" ] && _fixed_by_kernel_commit .....
As long as the failure has not been observed and fixed on other filesystems.
In case one day a regression happens in another fs, we just add a
corresponding line using the same logic.
Otherwise if the test one days fails on another fs and fstests
suggests that that commit is missing, it would be odd.
Everything else looks good, so with that fixed (maybe Zorro can change
that when picking the patch):
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> +
> +fio_config=$tmp.fio
> +
> +. ./common/dmlogwrites
> +
> +_require_scratch
> +_require_log_writes
> +
> +cat >$fio_config <<EOF
> +[global]
> +iodepth=128
> +direct=1
> +ioengine=libaio
> +rw=randwrite
> +runtime=1s
> +[job0]
> +rw=randwrite
> +filename=$SCRATCH_MNT/file
> +size=1g
> +fdatasync=1
> +EOF
> +
> +_require_fio $fio_config
> +
> +cat $fio_config >> $seqres.full
> +
> +_log_writes_init $SCRATCH_DEV
> +_log_writes_mkfs >> $seqres.full 2>&1
> +_log_writes_mark mkfs
> +
> +_log_writes_mount
> +
> +$FIO_PROG $fio_config > /dev/null 2>&1
> +_log_writes_unmount
> +
> +_log_writes_remove
> +
> +prev=$(_log_writes_mark_to_entry_number mkfs)
> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
> +cur=$(_log_writes_find_next_fua $prev)
> +[ -z "$cur" ] && _fail "failed to locate next FUA write"
> +
> +while [ ! -z "$cur" ]; do
> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
> +
> + _check_scratch_fs
> +
> + prev=$cur
> + cur=$(_log_writes_find_next_fua $(($cur + 1)))
> + [ -z "$cur" ] && break
> +done
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/757.out b/tests/generic/757.out
> new file mode 100644
> index 00000000..dfbc8094
> --- /dev/null
> +++ b/tests/generic/757.out
> @@ -0,0 +1,2 @@
> +QA output created by 757
> +Silence is golden
> --
> 2.44.2
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-16 11:09 ` Filipe Manana
@ 2024-10-18 17:36 ` Mark Harmstone
2024-10-23 3:31 ` Zorro Lang
0 siblings, 1 reply; 10+ messages in thread
From: Mark Harmstone @ 2024-10-18 17:36 UTC (permalink / raw)
To: Filipe Manana, Mark Harmstone, zlang@kernel.org
Cc: fstests@vger.kernel.org, linux-btrfs@vger.kernel.org
On 16/10/24 12:09, Filipe Manana wrote:
> >
> On Tue, Oct 15, 2024 at 4:42 PM Mark Harmstone <maharmstone@fb.com> wrote:
>>
>> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
>> race could mean that csums weren't getting written to the log tree,
>> leading to corruption when it was replayed.
>>
>> The patches to detect log this tree corruption are in btrfs-progs 6.11.
>
> This shouldn't be needed right?
> Because after log replay the csums are missing and 'btrfs check'
> detects (IIRC) missing csums for extents referred by file extent items
> in a subvolume tree - if it doesn't then it should be improved.
Yes, but we're not mounting it in the tests between the log_writes
calls, so the log isn't getting replayed. The patches to btrfs check
make it so that it identifies filesystems that would get corrupted as
soon as they're next mounted.
>
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>> This is a genericized version of the test I originally proposed as
>> btrfs/333.
>>
>> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
>> tests/generic/757.out | 2 ++
>> 2 files changed, 73 insertions(+)
>> create mode 100755 tests/generic/757
>> create mode 100644 tests/generic/757.out
>>
>> diff --git a/tests/generic/757 b/tests/generic/757
>> new file mode 100755
>> index 00000000..6ad3d01e
>> --- /dev/null
>> +++ b/tests/generic/757
>> @@ -0,0 +1,71 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +#
>> +# FS QA Test 757
>> +#
>> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
>> +# weren't getting written to the log tree, causing corruptions on remount.
>> +# This can be seen on subpage FSes on Linux 6.4.
>> +#
>> +. ./common/preamble
>> +_begin_fstest auto quick metadata log recoveryloop
>> +
>> +_fixed_by_kernel_commit e917ff56c8e7 \
>> + "btrfs: determine synchronous writers from bio or writeback control"
>
> For generic tests what we do is:
>
> [ $FSTYP == "btrfs" ] && _fixed_by_kernel_commit .....
>
> As long as the failure has not been observed and fixed on other filesystems.
> In case one day a regression happens in another fs, we just add a
> corresponding line using the same logic.
>
> Otherwise if the test one days fails on another fs and fstests
> suggests that that commit is missing, it would be odd.
>
> Everything else looks good, so with that fixed (maybe Zorro can change
> that when picking the patch):
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks.
>
Thanks Filipe. Zorro, let me know if you're happy making this change, or
otherwise I'll resubmit.
Mark
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-18 17:36 ` Mark Harmstone
@ 2024-10-23 3:31 ` Zorro Lang
0 siblings, 0 replies; 10+ messages in thread
From: Zorro Lang @ 2024-10-23 3:31 UTC (permalink / raw)
To: Mark Harmstone
Cc: Filipe Manana, zlang@kernel.org, fstests@vger.kernel.org,
linux-btrfs@vger.kernel.org
On Fri, Oct 18, 2024 at 05:36:26PM +0000, Mark Harmstone wrote:
> On 16/10/24 12:09, Filipe Manana wrote:
> > >
> > On Tue, Oct 15, 2024 at 4:42 PM Mark Harmstone <maharmstone@fb.com> wrote:
> >>
> >> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
> >> race could mean that csums weren't getting written to the log tree,
> >> leading to corruption when it was replayed.
> >>
> >> The patches to detect log this tree corruption are in btrfs-progs 6.11.
> >
> > This shouldn't be needed right?
> > Because after log replay the csums are missing and 'btrfs check'
> > detects (IIRC) missing csums for extents referred by file extent items
> > in a subvolume tree - if it doesn't then it should be improved.
>
> Yes, but we're not mounting it in the tests between the log_writes
> calls, so the log isn't getting replayed. The patches to btrfs check
> make it so that it identifies filesystems that would get corrupted as
> soon as they're next mounted.
>
> >
> >>
> >> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> >> ---
> >> This is a genericized version of the test I originally proposed as
> >> btrfs/333.
> >>
> >> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
> >> tests/generic/757.out | 2 ++
> >> 2 files changed, 73 insertions(+)
> >> create mode 100755 tests/generic/757
> >> create mode 100644 tests/generic/757.out
> >>
> >> diff --git a/tests/generic/757 b/tests/generic/757
> >> new file mode 100755
> >> index 00000000..6ad3d01e
> >> --- /dev/null
> >> +++ b/tests/generic/757
> >> @@ -0,0 +1,71 @@
> >> +#! /bin/bash
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +#
> >> +# FS QA Test 757
> >> +#
> >> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
> >> +# weren't getting written to the log tree, causing corruptions on remount.
> >> +# This can be seen on subpage FSes on Linux 6.4.
> >> +#
> >> +. ./common/preamble
> >> +_begin_fstest auto quick metadata log recoveryloop
> >> +
> >> +_fixed_by_kernel_commit e917ff56c8e7 \
> >> + "btrfs: determine synchronous writers from bio or writeback control"
> >
> > For generic tests what we do is:
> >
> > [ $FSTYP == "btrfs" ] && _fixed_by_kernel_commit .....
> >
> > As long as the failure has not been observed and fixed on other filesystems.
> > In case one day a regression happens in another fs, we just add a
> > corresponding line using the same logic.
> >
> > Otherwise if the test one days fails on another fs and fstests
> > suggests that that commit is missing, it would be odd.
> >
> > Everything else looks good, so with that fixed (maybe Zorro can change
> > that when picking the patch):
> >
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >
> > Thanks.
> >
>
> Thanks Filipe. Zorro, let me know if you're happy making this change, or
> otherwise I'll resubmit.
I can help to change that when I merge it. Thanks you and Filipe.
Thanks,
Zorro
>
> Mark
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-15 15:39 [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol Mark Harmstone
2024-10-16 11:09 ` Filipe Manana
@ 2024-10-23 3:43 ` Zorro Lang
2024-10-23 3:53 ` Zorro Lang
2024-10-28 21:57 ` Darrick J. Wong
3 siblings, 0 replies; 10+ messages in thread
From: Zorro Lang @ 2024-10-23 3:43 UTC (permalink / raw)
To: Mark Harmstone; +Cc: fstests, linux-btrfs
On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote:
> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
> race could mean that csums weren't getting written to the log tree,
> leading to corruption when it was replayed.
>
> The patches to detect log this tree corruption are in btrfs-progs 6.11.
>
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
> This is a genericized version of the test I originally proposed as
> btrfs/333.
>
> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/757.out | 2 ++
> 2 files changed, 73 insertions(+)
> create mode 100755 tests/generic/757
> create mode 100644 tests/generic/757.out
>
> diff --git a/tests/generic/757 b/tests/generic/757
> new file mode 100755
> index 00000000..6ad3d01e
> --- /dev/null
> +++ b/tests/generic/757
> @@ -0,0 +1,71 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# FS QA Test 757
> +#
> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
> +# weren't getting written to the log tree, causing corruptions on remount.
> +# This can be seen on subpage FSes on Linux 6.4.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick metadata log recoveryloop
> +
> +_fixed_by_kernel_commit e917ff56c8e7 \
> + "btrfs: determine synchronous writers from bio or writeback control"
> +
> +fio_config=$tmp.fio
> +
> +. ./common/dmlogwrites
> +
> +_require_scratch
> +_require_log_writes
> +
> +cat >$fio_config <<EOF
> +[global]
> +iodepth=128
> +direct=1
> +ioengine=libaio
> +rw=randwrite
> +runtime=1s
> +[job0]
> +rw=randwrite
> +filename=$SCRATCH_MNT/file
> +size=1g
> +fdatasync=1
> +EOF
> +
> +_require_fio $fio_config
> +
> +cat $fio_config >> $seqres.full
> +
> +_log_writes_init $SCRATCH_DEV
> +_log_writes_mkfs >> $seqres.full 2>&1
> +_log_writes_mark mkfs
> +
> +_log_writes_mount
For dmlogwrites test, we generally calls _log_writes_cleanup in _cleanup, to
recover the SCRATCH_DEV anyway, even if this test is killed at the middle of
its testing phase, to avoid later tests failed.
I'll add below code in this case, when I merge it. If there's not objection
from you.
_cleanup()
{
cd /
_log_writes_cleanup
rm -f $tmp.*
}
Thanks,
Zorro
> +
> +$FIO_PROG $fio_config > /dev/null 2>&1
> +_log_writes_unmount
> +
> +_log_writes_remove
> +
> +prev=$(_log_writes_mark_to_entry_number mkfs)
> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
> +cur=$(_log_writes_find_next_fua $prev)
> +[ -z "$cur" ] && _fail "failed to locate next FUA write"
> +
> +while [ ! -z "$cur" ]; do
> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
> +
> + _check_scratch_fs
> +
> + prev=$cur
> + cur=$(_log_writes_find_next_fua $(($cur + 1)))
> + [ -z "$cur" ] && break
> +done
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/757.out b/tests/generic/757.out
> new file mode 100644
> index 00000000..dfbc8094
> --- /dev/null
> +++ b/tests/generic/757.out
> @@ -0,0 +1,2 @@
> +QA output created by 757
> +Silence is golden
> --
> 2.44.2
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-15 15:39 [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol Mark Harmstone
2024-10-16 11:09 ` Filipe Manana
2024-10-23 3:43 ` Zorro Lang
@ 2024-10-23 3:53 ` Zorro Lang
2024-10-23 10:34 ` Mark Harmstone
2024-10-28 21:57 ` Darrick J. Wong
3 siblings, 1 reply; 10+ messages in thread
From: Zorro Lang @ 2024-10-23 3:53 UTC (permalink / raw)
To: Mark Harmstone; +Cc: fstests, linux-btrfs
On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote:
> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
> race could mean that csums weren't getting written to the log tree,
> leading to corruption when it was replayed.
>
> The patches to detect log this tree corruption are in btrfs-progs 6.11.
>
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
Sorry, more review points below. I can help to change these if you say "yes"
to all :)
> This is a genericized version of the test I originally proposed as
> btrfs/333.
>
> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/757.out | 2 ++
> 2 files changed, 73 insertions(+)
> create mode 100755 tests/generic/757
> create mode 100644 tests/generic/757.out
>
> diff --git a/tests/generic/757 b/tests/generic/757
> new file mode 100755
> index 00000000..6ad3d01e
> --- /dev/null
> +++ b/tests/generic/757
> @@ -0,0 +1,71 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# FS QA Test 757
> +#
> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
> +# weren't getting written to the log tree, causing corruptions on remount.
> +# This can be seen on subpage FSes on Linux 6.4.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick metadata log recoveryloop
^^^
aio
> +
> +_fixed_by_kernel_commit e917ff56c8e7 \
> + "btrfs: determine synchronous writers from bio or writeback control"
> +
> +fio_config=$tmp.fio
> +
> +. ./common/dmlogwrites
> +
> +_require_scratch
> +_require_log_writes
> +
> +cat >$fio_config <<EOF
> +[global]
> +iodepth=128
> +direct=1
> +ioengine=libaio
_require_aiodio ?
> +rw=randwrite
> +runtime=1s
> +[job0]
> +rw=randwrite
> +filename=$SCRATCH_MNT/file
> +size=1g
> +fdatasync=1
> +EOF
> +
> +_require_fio $fio_config
> +
> +cat $fio_config >> $seqres.full
> +
> +_log_writes_init $SCRATCH_DEV
> +_log_writes_mkfs >> $seqres.full 2>&1
> +_log_writes_mark mkfs
> +
> +_log_writes_mount
> +
> +$FIO_PROG $fio_config > /dev/null 2>&1
Don't you care the output of fio running anymore? Maybe use > $seqres.full ?
And just make sure, do you want to ignore failures of fio, as you do "2>&1"?
Thanks,
Zorro
> +_log_writes_unmount
> +
> +_log_writes_remove
> +
> +prev=$(_log_writes_mark_to_entry_number mkfs)
> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
> +cur=$(_log_writes_find_next_fua $prev)
> +[ -z "$cur" ] && _fail "failed to locate next FUA write"
> +
> +while [ ! -z "$cur" ]; do
> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
> +
> + _check_scratch_fs
> +
> + prev=$cur
> + cur=$(_log_writes_find_next_fua $(($cur + 1)))
> + [ -z "$cur" ] && break
> +done
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/757.out b/tests/generic/757.out
> new file mode 100644
> index 00000000..dfbc8094
> --- /dev/null
> +++ b/tests/generic/757.out
> @@ -0,0 +1,2 @@
> +QA output created by 757
> +Silence is golden
> --
> 2.44.2
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-23 3:53 ` Zorro Lang
@ 2024-10-23 10:34 ` Mark Harmstone
0 siblings, 0 replies; 10+ messages in thread
From: Mark Harmstone @ 2024-10-23 10:34 UTC (permalink / raw)
To: Zorro Lang; +Cc: fstests@vger.kernel.org, linux-btrfs@vger.kernel.org
These look good to me. Thank you for your help
On 23/10/24 04:53, Zorro Lang wrote:
> >
> On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote:
>> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
>> race could mean that csums weren't getting written to the log tree,
>> leading to corruption when it was replayed.
>>
>> The patches to detect log this tree corruption are in btrfs-progs 6.11.
>>
>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>> ---
>
> Sorry, more review points below. I can help to change these if you say "yes"
> to all :)
>
>> This is a genericized version of the test I originally proposed as
>> btrfs/333.
>>
>> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
>> tests/generic/757.out | 2 ++
>> 2 files changed, 73 insertions(+)
>> create mode 100755 tests/generic/757
>> create mode 100644 tests/generic/757.out
>>
>> diff --git a/tests/generic/757 b/tests/generic/757
>> new file mode 100755
>> index 00000000..6ad3d01e
>> --- /dev/null
>> +++ b/tests/generic/757
>> @@ -0,0 +1,71 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +#
>> +# FS QA Test 757
>> +#
>> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
>> +# weren't getting written to the log tree, causing corruptions on remount.
>> +# This can be seen on subpage FSes on Linux 6.4.
>> +#
>> +. ./common/preamble
>> +_begin_fstest auto quick metadata log recoveryloop
> ^^^
> aio
>
>> +
>> +_fixed_by_kernel_commit e917ff56c8e7 \
>> + "btrfs: determine synchronous writers from bio or writeback control"
>> +
>> +fio_config=$tmp.fio
>> +
>> +. ./common/dmlogwrites
>> +
>> +_require_scratch
>> +_require_log_writes
>> +
>> +cat >$fio_config <<EOF
>> +[global]
>> +iodepth=128
>> +direct=1
>> +ioengine=libaio
>
> _require_aiodio ?
>
>> +rw=randwrite
>> +runtime=1s
>> +[job0]
>> +rw=randwrite
>> +filename=$SCRATCH_MNT/file
>> +size=1g
>> +fdatasync=1
>> +EOF
>> +
>> +_require_fio $fio_config
>> +
>> +cat $fio_config >> $seqres.full
>> +
>> +_log_writes_init $SCRATCH_DEV
>> +_log_writes_mkfs >> $seqres.full 2>&1
>> +_log_writes_mark mkfs
>> +
>> +_log_writes_mount
>> +
>> +$FIO_PROG $fio_config > /dev/null 2>&1
>
> Don't you care the output of fio running anymore? Maybe use > $seqres.full ?
>
> And just make sure, do you want to ignore failures of fio, as you do "2>&1"?
>
> Thanks,
> Zorro
>
>> +_log_writes_unmount
>> +
>> +_log_writes_remove
>> +
>> +prev=$(_log_writes_mark_to_entry_number mkfs)
>> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
>> +cur=$(_log_writes_find_next_fua $prev)
>> +[ -z "$cur" ] && _fail "failed to locate next FUA write"
>> +
>> +while [ ! -z "$cur" ]; do
>> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
>> +
>> + _check_scratch_fs
>> +
>> + prev=$cur
>> + cur=$(_log_writes_find_next_fua $(($cur + 1)))
>> + [ -z "$cur" ] && break
>> +done
>> +
>> +echo "Silence is golden"
>> +
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/generic/757.out b/tests/generic/757.out
>> new file mode 100644
>> index 00000000..dfbc8094
>> --- /dev/null
>> +++ b/tests/generic/757.out
>> @@ -0,0 +1,2 @@
>> +QA output created by 757
>> +Silence is golden
>> --
>> 2.44.2
>>
>>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-15 15:39 [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol Mark Harmstone
` (2 preceding siblings ...)
2024-10-23 3:53 ` Zorro Lang
@ 2024-10-28 21:57 ` Darrick J. Wong
2024-10-29 5:11 ` Zorro Lang
3 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2024-10-28 21:57 UTC (permalink / raw)
To: Mark Harmstone; +Cc: fstests, linux-btrfs
On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote:
> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
> race could mean that csums weren't getting written to the log tree,
> leading to corruption when it was replayed.
>
> The patches to detect log this tree corruption are in btrfs-progs 6.11.
>
> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> ---
> This is a genericized version of the test I originally proposed as
> btrfs/333.
>
> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
> tests/generic/757.out | 2 ++
> 2 files changed, 73 insertions(+)
> create mode 100755 tests/generic/757
> create mode 100644 tests/generic/757.out
>
> diff --git a/tests/generic/757 b/tests/generic/757
> new file mode 100755
> index 00000000..6ad3d01e
> --- /dev/null
> +++ b/tests/generic/757
> @@ -0,0 +1,71 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# FS QA Test 757
> +#
> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
> +# weren't getting written to the log tree, causing corruptions on remount.
> +# This can be seen on subpage FSes on Linux 6.4.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick metadata log recoveryloop
> +
> +_fixed_by_kernel_commit e917ff56c8e7 \
> + "btrfs: determine synchronous writers from bio or writeback control"
> +
> +fio_config=$tmp.fio
> +
> +. ./common/dmlogwrites
> +
> +_require_scratch
> +_require_log_writes
> +
> +cat >$fio_config <<EOF
> +[global]
> +iodepth=128
> +direct=1
> +ioengine=libaio
> +rw=randwrite
> +runtime=1s
> +[job0]
> +rw=randwrite
> +filename=$SCRATCH_MNT/file
> +size=1g
> +fdatasync=1
> +EOF
> +
> +_require_fio $fio_config
> +
> +cat $fio_config >> $seqres.full
> +
> +_log_writes_init $SCRATCH_DEV
> +_log_writes_mkfs >> $seqres.full 2>&1
> +_log_writes_mark mkfs
> +
> +_log_writes_mount
> +
> +$FIO_PROG $fio_config > /dev/null 2>&1
> +_log_writes_unmount
> +
> +_log_writes_remove
> +
> +prev=$(_log_writes_mark_to_entry_number mkfs)
> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
> +cur=$(_log_writes_find_next_fua $prev)
> +[ -z "$cur" ] && _fail "failed to locate next FUA write"
> +
> +while [ ! -z "$cur" ]; do
> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
> +
> + _check_scratch_fs
This test fails on xfs because (afaict) replaying the log to $cur
results in $SCRATCH_DEV being a filesystem with a dirty log; and
xfs_repair fails when it is given a filesystem with a dirty log.
I then fixed the test to mount and unmount the filesystem to recovery
the dirty log before invoking xfs_repair:
# xfs_repair won't run if the log is dirty
if [ $FSTYP = "xfs" ]; then
_scratch_mount
_scratch_unmount
fi
_check_scratch_fs
But now the test takes a very long time to run because (on my system
anyway) the fio run can initiate 17,000 FUAs, which means that this loop
runs that many times. 100 iterations takes about 45 seconds, which is
about two hours.
Is it necessary to iterate the loop that many times to reproduce
whatever issue btrfs had?
--D
> +
> + prev=$cur
> + cur=$(_log_writes_find_next_fua $(($cur + 1)))
> + [ -z "$cur" ] && break
> +done
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/757.out b/tests/generic/757.out
> new file mode 100644
> index 00000000..dfbc8094
> --- /dev/null
> +++ b/tests/generic/757.out
> @@ -0,0 +1,2 @@
> +QA output created by 757
> +Silence is golden
> --
> 2.44.2
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-28 21:57 ` Darrick J. Wong
@ 2024-10-29 5:11 ` Zorro Lang
2024-10-29 10:03 ` Mark Harmstone
0 siblings, 1 reply; 10+ messages in thread
From: Zorro Lang @ 2024-10-29 5:11 UTC (permalink / raw)
To: Darrick J. Wong, Mark Harmstone; +Cc: fstests, linux-btrfs
On Mon, Oct 28, 2024 at 02:57:28PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote:
> > Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
> > race could mean that csums weren't getting written to the log tree,
> > leading to corruption when it was replayed.
> >
> > The patches to detect log this tree corruption are in btrfs-progs 6.11.
> >
> > Signed-off-by: Mark Harmstone <maharmstone@fb.com>
> > ---
> > This is a genericized version of the test I originally proposed as
> > btrfs/333.
> >
> > tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
> > tests/generic/757.out | 2 ++
> > 2 files changed, 73 insertions(+)
> > create mode 100755 tests/generic/757
> > create mode 100644 tests/generic/757.out
> >
> > diff --git a/tests/generic/757 b/tests/generic/757
> > new file mode 100755
> > index 00000000..6ad3d01e
> > --- /dev/null
> > +++ b/tests/generic/757
> > @@ -0,0 +1,71 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +#
> > +# FS QA Test 757
> > +#
> > +# Test async dio with fsync to test a btrfs bug where a race meant that csums
> > +# weren't getting written to the log tree, causing corruptions on remount.
> > +# This can be seen on subpage FSes on Linux 6.4.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto quick metadata log recoveryloop
> > +
> > +_fixed_by_kernel_commit e917ff56c8e7 \
> > + "btrfs: determine synchronous writers from bio or writeback control"
> > +
> > +fio_config=$tmp.fio
> > +
> > +. ./common/dmlogwrites
> > +
> > +_require_scratch
> > +_require_log_writes
> > +
> > +cat >$fio_config <<EOF
> > +[global]
> > +iodepth=128
> > +direct=1
> > +ioengine=libaio
> > +rw=randwrite
> > +runtime=1s
> > +[job0]
> > +rw=randwrite
> > +filename=$SCRATCH_MNT/file
> > +size=1g
> > +fdatasync=1
> > +EOF
> > +
> > +_require_fio $fio_config
> > +
> > +cat $fio_config >> $seqres.full
> > +
> > +_log_writes_init $SCRATCH_DEV
> > +_log_writes_mkfs >> $seqres.full 2>&1
> > +_log_writes_mark mkfs
> > +
> > +_log_writes_mount
> > +
> > +$FIO_PROG $fio_config > /dev/null 2>&1
> > +_log_writes_unmount
> > +
> > +_log_writes_remove
> > +
> > +prev=$(_log_writes_mark_to_entry_number mkfs)
> > +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
> > +cur=$(_log_writes_find_next_fua $prev)
> > +[ -z "$cur" ] && _fail "failed to locate next FUA write"
> > +
> > +while [ ! -z "$cur" ]; do
> > + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
> > +
> > + _check_scratch_fs
>
> This test fails on xfs because (afaict) replaying the log to $cur
> results in $SCRATCH_DEV being a filesystem with a dirty log; and
> xfs_repair fails when it is given a filesystem with a dirty log.
>
> I then fixed the test to mount and unmount the filesystem to recovery
> the dirty log before invoking xfs_repair:
>
> # xfs_repair won't run if the log is dirty
> if [ $FSTYP = "xfs" ]; then
> _scratch_mount
> _scratch_unmount
> fi
Thanks Darrick, you're right.
I'm wondering can we always do a mount&unmount at here, no matter the
$FSTYP, if that doesn't affect the testing of other filesystems?
> _check_scratch_fs
>
> But now the test takes a very long time to run because (on my system
> anyway) the fio run can initiate 17,000 FUAs, which means that this loop
> runs that many times. 100 iterations takes about 45 seconds, which is
> about two hours.
>
> Is it necessary to iterate the loop that many times to reproduce
> whatever issue btrfs had?
Yes, it takes much long time on my side too:
FSTYP -- ext4
PLATFORM -- Linux/x86_64 dell-per750-47 6.12.0-rc4+ #1 SMP PREEMPT_DYNAMIC Fri Oct 25 14:25:45 EDT 2024
MKFS_OPTIONS -- -F /dev/sda4
MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:root_t:s0 /dev/sda4 /mnt/xfstests/scratch
generic/757 4247s
Ran: generic/757
Passed all 1 tests
So better to reduce the testing time as much as possible, and remove it
from the "quick" group. (Maybe we can have a tag to mark those cases need
much long time too).
This patch has been merged into for-next branch, as:
cf97fa373 generic: add test for missing btrfs csums in log when doing async on subpage vol
Please send another (or other two) patch to fix above 2 problems.
Thanks,
Zorro
>
> --D
>
> > +
> > + prev=$cur
> > + cur=$(_log_writes_find_next_fua $(($cur + 1)))
> > + [ -z "$cur" ] && break
> > +done
> > +
> > +echo "Silence is golden"
> > +
> > +# success, all done
> > +status=0
> > +exit
> > diff --git a/tests/generic/757.out b/tests/generic/757.out
> > new file mode 100644
> > index 00000000..dfbc8094
> > --- /dev/null
> > +++ b/tests/generic/757.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 757
> > +Silence is golden
> > --
> > 2.44.2
> >
> >
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] generic: add test for missing btrfs csums in log when doing async on subpage vol
2024-10-29 5:11 ` Zorro Lang
@ 2024-10-29 10:03 ` Mark Harmstone
0 siblings, 0 replies; 10+ messages in thread
From: Mark Harmstone @ 2024-10-29 10:03 UTC (permalink / raw)
To: Zorro Lang, Darrick J. Wong, Mark Harmstone, Filipe Manana
Cc: fstests@vger.kernel.org, linux-btrfs@vger.kernel.org
On 29/10/24 05:11, Zorro Lang wrote:
> On Mon, Oct 28, 2024 at 02:57:28PM -0700, Darrick J. Wong wrote:
>> On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote:
>>> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a
>>> race could mean that csums weren't getting written to the log tree,
>>> leading to corruption when it was replayed.
>>>
>>> The patches to detect log this tree corruption are in btrfs-progs 6.11.
>>>
>>> Signed-off-by: Mark Harmstone <maharmstone@fb.com>
>>> ---
>>> This is a genericized version of the test I originally proposed as
>>> btrfs/333.
>>>
>>> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++
>>> tests/generic/757.out | 2 ++
>>> 2 files changed, 73 insertions(+)
>>> create mode 100755 tests/generic/757
>>> create mode 100644 tests/generic/757.out
>>>
>>> diff --git a/tests/generic/757 b/tests/generic/757
>>> new file mode 100755
>>> index 00000000..6ad3d01e
>>> --- /dev/null
>>> +++ b/tests/generic/757
>>> @@ -0,0 +1,71 @@
>>> +#! /bin/bash
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +#
>>> +# FS QA Test 757
>>> +#
>>> +# Test async dio with fsync to test a btrfs bug where a race meant that csums
>>> +# weren't getting written to the log tree, causing corruptions on remount.
>>> +# This can be seen on subpage FSes on Linux 6.4.
>>> +#
>>> +. ./common/preamble
>>> +_begin_fstest auto quick metadata log recoveryloop
>>> +
>>> +_fixed_by_kernel_commit e917ff56c8e7 \
>>> + "btrfs: determine synchronous writers from bio or writeback control"
>>> +
>>> +fio_config=$tmp.fio
>>> +
>>> +. ./common/dmlogwrites
>>> +
>>> +_require_scratch
>>> +_require_log_writes
>>> +
>>> +cat >$fio_config <<EOF
>>> +[global]
>>> +iodepth=128
>>> +direct=1
>>> +ioengine=libaio
>>> +rw=randwrite
>>> +runtime=1s
>>> +[job0]
>>> +rw=randwrite
>>> +filename=$SCRATCH_MNT/file
>>> +size=1g
>>> +fdatasync=1
>>> +EOF
>>> +
>>> +_require_fio $fio_config
>>> +
>>> +cat $fio_config >> $seqres.full
>>> +
>>> +_log_writes_init $SCRATCH_DEV
>>> +_log_writes_mkfs >> $seqres.full 2>&1
>>> +_log_writes_mark mkfs
>>> +
>>> +_log_writes_mount
>>> +
>>> +$FIO_PROG $fio_config > /dev/null 2>&1
>>> +_log_writes_unmount
>>> +
>>> +_log_writes_remove
>>> +
>>> +prev=$(_log_writes_mark_to_entry_number mkfs)
>>> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'"
>>> +cur=$(_log_writes_find_next_fua $prev)
>>> +[ -z "$cur" ] && _fail "failed to locate next FUA write"
>>> +
>>> +while [ ! -z "$cur" ]; do
>>> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
>>> +
>>> + _check_scratch_fs
>>
>> This test fails on xfs because (afaict) replaying the log to $cur
>> results in $SCRATCH_DEV being a filesystem with a dirty log; and
>> xfs_repair fails when it is given a filesystem with a dirty log.
>>
>> I then fixed the test to mount and unmount the filesystem to recovery
>> the dirty log before invoking xfs_repair:
>>
>> # xfs_repair won't run if the log is dirty
>> if [ $FSTYP = "xfs" ]; then
>> _scratch_mount
>> _scratch_unmount
>> fi
>
> Thanks Darrick, you're right.
> I'm wondering can we always do a mount&unmount at here, no matter the
> $FSTYP, if that doesn't affect the testing of other filesystems?
>
>> _check_scratch_fs
>>
>> But now the test takes a very long time to run because (on my system
>> anyway) the fio run can initiate 17,000 FUAs, which means that this loop
>> runs that many times. 100 iterations takes about 45 seconds, which is
>> about two hours.
>>
>> Is it necessary to iterate the loop that many times to reproduce
>> whatever issue btrfs had?
>
> Yes, it takes much long time on my side too:
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 dell-per750-47 6.12.0-rc4+ #1 SMP PREEMPT_DYNAMIC Fri Oct 25 14:25:45 EDT 2024
> MKFS_OPTIONS -- -F /dev/sda4
> MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:root_t:s0 /dev/sda4 /mnt/xfstests/scratch
>
> generic/757 4247s
> Ran: generic/757
> Passed all 1 tests
>
> So better to reduce the testing time as much as possible, and remove it
> from the "quick" group. (Maybe we can have a tag to mark those cases need
> much long time too).
Or maybe it should be made btrfs-specific again? That's how I originally
wrote it, but Filipe Manana thought it ought to be genericized.
An advantage of making it FS-specific is that you can use the --fsck
option in log-writes, which is about 10 times quicker than doing the
loop in bash.
Mark
^ permalink raw reply [flat|nested] 10+ messages in thread