public inbox for fstests@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem
@ 2019-03-21 10:30 Edwin Török
  2019-03-21 20:23 ` Darrick J. Wong
  2019-03-21 21:26 ` Dave Chinner
  0 siblings, 2 replies; 4+ messages in thread
From: Edwin Török @ 2019-03-21 10:30 UTC (permalink / raw)
  To: fstests; +Cc: Edwin Török, Mark Syms, Tim Smith, Ross Lagerwall

Based on tests/generic/347.

In our lab we've found that if multiple iSCSI connection errors are
detected (without completely loosing the iSCSI connection) then the GFS2
filesystem becomes corrupt due to differences in filesystem and device blocksizes.
Add a test that explicitly checks for this by simulating I/O errors
deterministically with dm-thin.
Changing the blocksize to 512 would make this test pass on GFS as well.

This test causes I/O errors on purpose, and thus it is expected that a filesystem
might withdraw itself, or remount readonly.
However when mounting the filesystem again it should be usable,
without corruption (i.e. immediately unmount itself again).

Tested that this passes on ext4 and fails on GFS2.

CC: Mark Syms <Mark.Syms@citrix.com>
CC: Tim Smith <Tim.Smith@citrix.com>
CC: Ross Lagerwall <Ross.Lagerwall@citrix.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
---
 common/dmthin         |  6 ++++
 tests/generic/536     | 80 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/536.out |  2 ++
 tests/generic/group   |  1 +
 4 files changed, 89 insertions(+)
 create mode 100755 tests/generic/536
 create mode 100644 tests/generic/536.out

diff --git a/common/dmthin b/common/dmthin
index 7946e9a7..28f2ef9c 100644
--- a/common/dmthin
+++ b/common/dmthin
@@ -41,6 +41,12 @@ _dmthin_check_fs()
 	_check_scratch_fs $DMTHIN_VOL_DEV
 }
 
+_dmthin_cycle_mount()
+{
+	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_dmthin_mount
+}
+
 # Set up a dm-thin device on $SCRATCH_DEV
 #
 # All arguments are optional, and in this order; defaults follows:
diff --git a/tests/generic/536 b/tests/generic/536
new file mode 100755
index 00000000..0e8d137e
--- /dev/null
+++ b/tests/generic/536
@@ -0,0 +1,80 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2015 Red Hat, Inc.  All Rights Reserved.
+#
+# FS QA Test No. generic/536
+#
+# Test that intermittent IO errors during pwrite do not cause filesystem corruption
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+BACKING_SIZE=$((500 * 1024 * 1024 / 512))	# 500M
+VIRTUAL_SIZE=$((10 * $BACKING_SIZE))		# 5000M
+GROW_SIZE=$((100 * 1024 * 1024 / 512))		# 100M
+
+_cleanup()
+{
+	_dmthin_cleanup
+	rm -f $tmp.*
+}
+
+_setup_thin()
+{
+	_dmthin_init $BACKING_SIZE $VIRTUAL_SIZE
+	_dmthin_set_queue
+	_mkfs_dev $DMTHIN_VOL_DEV
+	_dmthin_mount
+}
+
+_workout()
+{
+	# Overfill it by a bit
+	for I in `seq 1 500`; do
+		$XFS_IO_PROG -f -c "pwrite -W 0 1M" $SCRATCH_MNT/file$I &>/dev/null
+	done
+
+	sync
+
+	_dmthin_grow  $GROW_SIZE
+
+	# Write a little more, but don't fill
+	for I in `seq 501 510`; do
+		$XFS_IO_PROG -f -c "pwrite 0 1M" $SCRATCH_MNT/file$I &>/dev/null
+	done
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/dmthin
+
+_supported_fs generic
+_supported_os Linux
+_require_scratch_nocheck
+_require_dm_target thin-pool
+
+_setup_thin
+
+# trigger IO errors, the filesystem may be remounted RO or withdrawn, this is expected
+_workout
+
+# now remount the filesystem without triggering IO errors,
+# and check that the filesystem is not corrupt
+_dmthin_cycle_mount
+# ls --color makes ls stat each file, which finds the corruption
+ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
+ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
+ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
+_dmthin_cleanup
+
+echo "=== completed"
+
+status=0
+exit
diff --git a/tests/generic/536.out b/tests/generic/536.out
new file mode 100644
index 00000000..5140d261
--- /dev/null
+++ b/tests/generic/536.out
@@ -0,0 +1,2 @@
+QA output created by 536
+=== completed
diff --git a/tests/generic/group b/tests/generic/group
index 78b9b45d..a346dfab 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -538,3 +538,4 @@
 533 auto quick attr
 534 auto quick log
 535 auto quick log
+536 auto quick rw thin
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem
  2019-03-21 10:30 [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem Edwin Török
@ 2019-03-21 20:23 ` Darrick J. Wong
  2019-03-22 14:42   ` Edwin Török
  2019-03-21 21:26 ` Dave Chinner
  1 sibling, 1 reply; 4+ messages in thread
From: Darrick J. Wong @ 2019-03-21 20:23 UTC (permalink / raw)
  To: Edwin Török; +Cc: fstests, Mark Syms, Tim Smith, Ross Lagerwall

On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
> Based on tests/generic/347.
> 
> In our lab we've found that if multiple iSCSI connection errors are
> detected (without completely loosing the iSCSI connection) then the GFS2
> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
> Add a test that explicitly checks for this by simulating I/O errors
> deterministically with dm-thin.

How is this different from generic/475?  Is there something specific to
thin pools here (vs. using dm-error to simulate the errors)?

--D

> Changing the blocksize to 512 would make this test pass on GFS as well.
> 
> This test causes I/O errors on purpose, and thus it is expected that a filesystem
> might withdraw itself, or remount readonly.
> However when mounting the filesystem again it should be usable,
> without corruption (i.e. immediately unmount itself again).
> 
> Tested that this passes on ext4 and fails on GFS2.

> CC: Mark Syms <Mark.Syms@citrix.com>
> CC: Tim Smith <Tim.Smith@citrix.com>
> CC: Ross Lagerwall <Ross.Lagerwall@citrix.com>
> Signed-off-by: Edwin Török <edvin.torok@citrix.com>
> ---
>  common/dmthin         |  6 ++++
>  tests/generic/536     | 80 +++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/536.out |  2 ++
>  tests/generic/group   |  1 +
>  4 files changed, 89 insertions(+)
>  create mode 100755 tests/generic/536
>  create mode 100644 tests/generic/536.out
> 
> diff --git a/common/dmthin b/common/dmthin
> index 7946e9a7..28f2ef9c 100644
> --- a/common/dmthin
> +++ b/common/dmthin
> @@ -41,6 +41,12 @@ _dmthin_check_fs()
>  	_check_scratch_fs $DMTHIN_VOL_DEV
>  }
>  
> +_dmthin_cycle_mount()
> +{
> +	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
> +	_dmthin_mount
> +}
> +
>  # Set up a dm-thin device on $SCRATCH_DEV
>  #
>  # All arguments are optional, and in this order; defaults follows:
> diff --git a/tests/generic/536 b/tests/generic/536
> new file mode 100755
> index 00000000..0e8d137e
> --- /dev/null
> +++ b/tests/generic/536
> @@ -0,0 +1,80 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2015 Red Hat, Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. generic/536
> +#
> +# Test that intermittent IO errors during pwrite do not cause filesystem corruption
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +BACKING_SIZE=$((500 * 1024 * 1024 / 512))	# 500M
> +VIRTUAL_SIZE=$((10 * $BACKING_SIZE))		# 5000M
> +GROW_SIZE=$((100 * 1024 * 1024 / 512))		# 100M
> +
> +_cleanup()
> +{
> +	_dmthin_cleanup
> +	rm -f $tmp.*
> +}
> +
> +_setup_thin()
> +{
> +	_dmthin_init $BACKING_SIZE $VIRTUAL_SIZE
> +	_dmthin_set_queue
> +	_mkfs_dev $DMTHIN_VOL_DEV
> +	_dmthin_mount
> +}
> +
> +_workout()
> +{
> +	# Overfill it by a bit
> +	for I in `seq 1 500`; do
> +		$XFS_IO_PROG -f -c "pwrite -W 0 1M" $SCRATCH_MNT/file$I &>/dev/null
> +	done
> +
> +	sync
> +
> +	_dmthin_grow  $GROW_SIZE
> +
> +	# Write a little more, but don't fill
> +	for I in `seq 501 510`; do
> +		$XFS_IO_PROG -f -c "pwrite 0 1M" $SCRATCH_MNT/file$I &>/dev/null
> +	done
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/dmthin
> +
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch_nocheck
> +_require_dm_target thin-pool
> +
> +_setup_thin
> +
> +# trigger IO errors, the filesystem may be remounted RO or withdrawn, this is expected
> +_workout
> +
> +# now remount the filesystem without triggering IO errors,
> +# and check that the filesystem is not corrupt
> +_dmthin_cycle_mount
> +# ls --color makes ls stat each file, which finds the corruption
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +_dmthin_cleanup
> +
> +echo "=== completed"
> +
> +status=0
> +exit
> diff --git a/tests/generic/536.out b/tests/generic/536.out
> new file mode 100644
> index 00000000..5140d261
> --- /dev/null
> +++ b/tests/generic/536.out
> @@ -0,0 +1,2 @@
> +QA output created by 536
> +=== completed
> diff --git a/tests/generic/group b/tests/generic/group
> index 78b9b45d..a346dfab 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -538,3 +538,4 @@
>  533 auto quick attr
>  534 auto quick log
>  535 auto quick log
> +536 auto quick rw thin
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem
  2019-03-21 10:30 [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem Edwin Török
  2019-03-21 20:23 ` Darrick J. Wong
@ 2019-03-21 21:26 ` Dave Chinner
  1 sibling, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2019-03-21 21:26 UTC (permalink / raw)
  To: Edwin Török; +Cc: fstests, Mark Syms, Tim Smith, Ross Lagerwall

On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
> Based on tests/generic/347.
> 
> In our lab we've found that if multiple iSCSI connection errors are
> detected (without completely loosing the iSCSI connection) then the GFS2
> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
> Add a test that explicitly checks for this by simulating I/O errors
> deterministically with dm-thin.

Exactly what IO errors is dm-thinp generating here? If you run it
out of space, then it triggers ENOSPC, not EIO. That's very, very
different to iSCSI throwing random EIO errors..

.....
> +# now remount the filesystem without triggering IO errors,
> +# and check that the filesystem is not corrupt
> +_dmthin_cycle_mount
> +# ls --color makes ls stat each file, which finds the corruption

Not sure it always does - ISTR that in the past if the dtype
returned indicated the type of file, then it ls would omit the stat
just for the purposes of coloring....

And, realistically, the way we find /filesystem/ corruption is to
run fsck/repair, not iterate the directory structure. If we are
looking for missing files, then we dump the directory structure to
the golden output file or dump it before/after errors and compare
that they are the same.

> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"

If corruption is not found on the first pass, why would the next 2
passes find anything different?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem
  2019-03-21 20:23 ` Darrick J. Wong
@ 2019-03-22 14:42   ` Edwin Török
  0 siblings, 0 replies; 4+ messages in thread
From: Edwin Török @ 2019-03-22 14:42 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner
  Cc: fstests, Mark Syms, Tim Smith, Ross Lagerwall

On 21/03/2019 20:23, Darrick J. Wong wrote:
> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
>> Based on tests/generic/347.
>>
>> In our lab we've found that if multiple iSCSI connection errors are
>> detected (without completely loosing the iSCSI connection) then the GFS2
>> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
>> Add a test that explicitly checks for this by simulating I/O errors
>> deterministically with dm-thin.
> 
> How is this different from generic/475?  Is there something specific to
> thin pools here (vs. using dm-error to simulate the errors)?

When I tried generic/475 it hanged in unmount and never reached the data corruption part.
Thanks for the suggestion, dm-error would be better than dm-thin, see below.

On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
>> Based on tests/generic/347.
>>
>> In our lab we've found that if multiple iSCSI connection errors are
>> detected (without completely loosing the iSCSI connection) then the GFS2
>> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
>> Add a test that explicitly checks for this by simulating I/O errors
>> deterministically with dm-thin.
> 
> Exactly what IO errors is dm-thinp generating here? If you run it
> out of space, then it triggers ENOSPC, not EIO. That's very, very
> different to iSCSI throwing random EIO errors..

I agree that dm-error would be a better starting place than dm-thin for this test,
I'll try to modify it and see if I can get it to finish running without hanging, and reproduce the corruption issue.


On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
>> +# now remount the filesystem without triggering IO errors,
>> +# and check that the filesystem is not corrupt
>> +_dmthin_cycle_mount
>> +# ls --color makes ls stat each file, which finds the corruption
> 
> Not sure it always does - ISTR that in the past if the dtype
> returned indicated the type of file, then it ls would omit the stat
> just for the purposes of coloring....
> 
> And, realistically, the way we find /filesystem/ corruption is to
> run fsck/repair, not iterate the directory structure.

I don't disagree, however GFS2's fsck is very noisy and complains about inconsistencies
even on a filesystem where I can otherwise list and read each entry correctly.
I wanted to make a clear distinction between that and actual corruption observed, so that the 2 bugs
can be fixed independently.

Perhaps the test should first do an 'ls/stat', and if that is fine then unmount and run the filesystem check as usual.

> If we are
> looking for missing files, then we dump the directory structure to
> the golden output file or dump it before/after errors and compare
> that they are the same.
> 
>> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
>> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
>> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> 
> If corruption is not found on the first pass, why would the next 2
> passes find anything different?

Indeed, I'll drop them.

Thanks,
--Edwin

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-03-22 14:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-21 10:30 [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem Edwin Török
2019-03-21 20:23 ` Darrick J. Wong
2019-03-22 14:42   ` Edwin Török
2019-03-21 21:26 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox