* [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-09 18:18 [PATCHSET 0/3] fstests: random fixes for v2023.10.08 Darrick J. Wong
@ 2023-10-09 18:18 ` Darrick J. Wong
2023-10-10 7:01 ` Christoph Hellwig
` (2 more replies)
2023-10-09 18:18 ` [PATCH 2/3] generic/465: only complain about stale disk contents when racing directio Darrick J. Wong
2023-10-09 18:18 ` [PATCH 3/3] generic/269,xfs/051: don't drop fsstress failures to stdout Darrick J. Wong
2 siblings, 3 replies; 13+ messages in thread
From: Darrick J. Wong @ 2023-10-09 18:18 UTC (permalink / raw)
To: djwong, zlang; +Cc: linux-xfs, fstests, guan
From: Darrick J. Wong <djwong@kernel.org>
When I added an fstests config for "RAID" striping (aka MKFS_OPTIONS='-d
su=128k,sw=4'), I suddenly started seeing this test fail sporadically
with:
--- /tmp/fstests/tests/xfs/178.out 2023-07-11 12:18:21.714970364 -0700
+++ /var/tmp/fstests/xfs/178.out.bad 2023-07-25 22:05:39.756000000 -0700
@@ -10,6 +10,20 @@ bad primary superblock - bad magic numbe
attempting to find secondary superblock...
found candidate secondary superblock...
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb root inode INO inconsistent with calculated value INO
Eventually I tracked this down to a mis-interaction between the test,
xfs_repair, and the storage device.
The storage advertises SCSI UNMAP support, but it is of the variety
where the UNMAP command returns immediately but takes its time to unmap
in the background. Subsequent rereads are allowed to return stale
contents, per DISCARD semantics.
When the fstests cloud is not busy, the old contents disappear in a few
seconds. However, at peak utilization, there are ~75 VMs running, and
the storage backend can take several minutes to commit these background
requests.
When we zero the primary super and start xfs_repair on SCRATCH_DEV, it
will walk the device looking for secondary supers. Most of the time it
finds the actual AG 1 secondary super, but sometimes it finds ghosts
from previous formats. When that happens, xfs_repair will talk quite a
bit about those failed secondaries, even if it eventually finds an
acceptable secondary sb and completes the repair.
Filter out the messages about secondary supers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
tests/xfs/178 | 9 ++++++++-
tests/xfs/178.out | 2 --
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/tests/xfs/178 b/tests/xfs/178
index a65197cde3..fee1e92bf3 100755
--- a/tests/xfs/178
+++ b/tests/xfs/178
@@ -10,13 +10,20 @@
. ./common/preamble
_begin_fstest mkfs other auto
+filter_repair() {
+ _filter_repair | sed \
+ -e '/unable to verify superblock, continuing/d' \
+ -e '/found candidate secondary superblock/d' \
+ -e '/error reading superblock.*-- seek to offset/d'
+}
+
# dd the 1st sector then repair
_dd_repair_check()
{
#dd first sector
dd if=/dev/zero of=$1 bs=$2 count=1 2>&1 | _filter_dd
#xfs_repair
- _scratch_xfs_repair 2>&1 | _filter_repair
+ _scratch_xfs_repair 2>&1 | filter_repair
#check repair
if _check_scratch_fs; then
echo "repair passed"
diff --git a/tests/xfs/178.out b/tests/xfs/178.out
index 0bebe553eb..711e90cc26 100644
--- a/tests/xfs/178.out
+++ b/tests/xfs/178.out
@@ -9,7 +9,6 @@ Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
-found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb root inode INO inconsistent with calculated value INO
@@ -45,7 +44,6 @@ Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
-found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb root inode INO inconsistent with calculated value INO
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-09 18:18 ` [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks Darrick J. Wong
@ 2023-10-10 7:01 ` Christoph Hellwig
2023-10-11 19:10 ` Darrick J. Wong
2023-10-12 8:30 ` Christoph Hellwig
2023-10-12 15:09 ` [PATCH v2 " Darrick J. Wong
2 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2023-10-10 7:01 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: zlang, linux-xfs, fstests, guan, linux-block, linux-scsi
On Mon, Oct 09, 2023 at 11:18:33AM -0700, Darrick J. Wong wrote:
> The storage advertises SCSI UNMAP support, but it is of the variety
> where the UNMAP command returns immediately but takes its time to unmap
> in the background. Subsequent rereads are allowed to return stale
> contents, per DISCARD semantics.
>
> When the fstests cloud is not busy, the old contents disappear in a few
> seconds. However, at peak utilization, there are ~75 VMs running, and
> the storage backend can take several minutes to commit these background
> requests.
Umm, that is not valid behavior fo SCSI UNMAP or any other command
that Linux discard maps to. All of them can do one of the two options
on a per-block basis:
- return the unmap pattern (usually but not always 0) for any read
following the unmap/trim/discard
- always return the previous pattern until it is overwritten or
discarded again
Changing the pattern some time after unmap is a grave bug, and we need
to blacklist the device.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-10 7:01 ` Christoph Hellwig
@ 2023-10-11 19:10 ` Darrick J. Wong
2023-10-12 4:26 ` Christoph Hellwig
0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-10-11 19:10 UTC (permalink / raw)
To: Christoph Hellwig
Cc: zlang, linux-xfs, fstests, guan, linux-block, linux-scsi
On Tue, Oct 10, 2023 at 12:01:33AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 09, 2023 at 11:18:33AM -0700, Darrick J. Wong wrote:
> > The storage advertises SCSI UNMAP support, but it is of the variety
> > where the UNMAP command returns immediately but takes its time to unmap
> > in the background. Subsequent rereads are allowed to return stale
> > contents, per DISCARD semantics.
> >
> > When the fstests cloud is not busy, the old contents disappear in a few
> > seconds. However, at peak utilization, there are ~75 VMs running, and
> > the storage backend can take several minutes to commit these background
> > requests.
>
> Umm, that is not valid behavior fo SCSI UNMAP or any other command
> that Linux discard maps to. All of them can do one of the two options
> on a per-block basis:
>
> - return the unmap pattern (usually but not always 0) for any read
> following the unmap/trim/discard
> - always return the previous pattern until it is overwritten or
> discarded again
>
> Changing the pattern some time after unmap is a grave bug, and we need
> to blacklist the device.
Ok, I'll go pester them about fixing that, if they haven't already.
Apparently discard support is somewhat new.
I'm pretty sure I've seen some NVME SSDs where you can issue devicewide
DISCARDs and slowly watch the namespace utilization go down over tens of
minutes; and reads will only eventually start returning zeroes.
(Note that *writes* during the slow-discard period are persisted
correctly.)
However, that's orthogonal to this patch -- if the device doesn't
support discard, _scratch_mkfs won't zero the entire disk to remove old
dead superblocks that might have been written by previous tests. After
we shatter the primary super, the xfs_repair scanning code can still
trip over those old supers and break the golden output.
--D
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-11 19:10 ` Darrick J. Wong
@ 2023-10-12 4:26 ` Christoph Hellwig
2023-10-12 5:03 ` Darrick J. Wong
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2023-10-12 4:26 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, zlang, linux-xfs, fstests, guan, linux-block,
linux-scsi
On Wed, Oct 11, 2023 at 12:10:25PM -0700, Darrick J. Wong wrote:
> I'm pretty sure I've seen some NVME SSDs where you can issue devicewide
> DISCARDs and slowly watch the namespace utilization go down over tens of
> minutes; and reads will only eventually start returning zeroes.
Well, the second part is broken. The first part is fine, and I've briefly
consulted with a firmware team implementing such a feature. It just needs
to make sure to return zeroes right after the return of the discard
even if the blocks aren't erased yet, including after a powerfail.
(anyone who knows the XFS truncate / hole punch code will have a vague
idea of how that could work).
> However, that's orthogonal to this patch -- if the device doesn't
> support discard, _scratch_mkfs won't zero the entire disk to remove old
> dead superblocks that might have been written by previous tests. After
> we shatter the primary super, the xfs_repair scanning code can still
> trip over those old supers and break the golden output.
True. I have to admit I stopped reading the patch after the unmap
description. I'll take another look.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-12 4:26 ` Christoph Hellwig
@ 2023-10-12 5:03 ` Darrick J. Wong
0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2023-10-12 5:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: zlang, linux-xfs, fstests, guan, linux-block, linux-scsi
On Wed, Oct 11, 2023 at 09:26:51PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 11, 2023 at 12:10:25PM -0700, Darrick J. Wong wrote:
> > I'm pretty sure I've seen some NVME SSDs where you can issue devicewide
> > DISCARDs and slowly watch the namespace utilization go down over tens of
> > minutes; and reads will only eventually start returning zeroes.
>
> Well, the second part is broken. The first part is fine, and I've briefly
> consulted with a firmware team implementing such a feature. It just needs
> to make sure to return zeroes right after the return of the discard
> even if the blocks aren't erased yet, including after a powerfail.
> (anyone who knows the XFS truncate / hole punch code will have a vague
> idea of how that could work).
>
> > However, that's orthogonal to this patch -- if the device doesn't
> > support discard, _scratch_mkfs won't zero the entire disk to remove old
> > dead superblocks that might have been written by previous tests. After
> > we shatter the primary super, the xfs_repair scanning code can still
> > trip over those old supers and break the golden output.
>
> True. I have to admit I stopped reading the patch after the unmap
> description. I'll take another look.
<nod> I think I'll update the next version of this patch to substitute
the paragraph that I wrote above for all the misleading ramblings about
DISCARD. Today's revisit of that clod block device doesn't show the
weird stale reads that disappear behavior, so it's entirely possible
that they've fixed it already.
--D
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-09 18:18 ` [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks Darrick J. Wong
2023-10-10 7:01 ` Christoph Hellwig
@ 2023-10-12 8:30 ` Christoph Hellwig
2023-10-12 15:09 ` [PATCH v2 " Darrick J. Wong
2 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2023-10-12 8:30 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: zlang, linux-xfs, fstests, guan
Unlike the commit log, the changes look good to me.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks
2023-10-09 18:18 ` [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks Darrick J. Wong
2023-10-10 7:01 ` Christoph Hellwig
2023-10-12 8:30 ` Christoph Hellwig
@ 2023-10-12 15:09 ` Darrick J. Wong
2023-10-12 15:10 ` Christoph Hellwig
2 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-10-12 15:09 UTC (permalink / raw)
To: zlang, Christoph Hellwig; +Cc: linux-xfs, fstests, guan
From: Darrick J. Wong <djwong@kernel.org>
When I added an fstests config for "RAID" striping (aka MKFS_OPTIONS='-d
su=128k,sw=4'), I suddenly started seeing this test fail sporadically
with:
--- /tmp/fstests/tests/xfs/178.out 2023-07-11 12:18:21.714970364 -0700
+++ /var/tmp/fstests/xfs/178.out.bad 2023-07-25 22:05:39.756000000 -0700
@@ -10,6 +10,20 @@ bad primary superblock - bad magic numbe
attempting to find secondary superblock...
found candidate secondary superblock...
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
+error reading superblock 1 -- seek to offset 584115421184 failed
+unable to verify superblock, continuing...
+found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb root inode INO inconsistent with calculated value INO
Eventually I tracked this down to a mis-interaction between the test,
xfs_repair, and the storage device.
If the device doesn't support discard, _scratch_mkfs won't zero the
entire disk to remove old dead superblocks that might have been written
by previous tests. After we shatter the primary super, the xfs_repair
scanning code can still trip over those old supers when it goes looking
for secondary supers.
Most of the time it finds the actual AG 1 secondary super, but sometimes
it finds ghosts from previous formats. When that happens, xfs_repair
will talk quite a bit about those failed secondaries, even if it
eventually finds an acceptable secondary sb and completes the repair.
Filter out the messages about secondary supers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v2: fix commit message to identify the problem in fstests, drop the
irrelevant mumbbling about SCSI UNMAP
---
tests/xfs/178 | 9 ++++++++-
tests/xfs/178.out | 2 --
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/tests/xfs/178 b/tests/xfs/178
index a65197cde3..fee1e92bf3 100755
--- a/tests/xfs/178
+++ b/tests/xfs/178
@@ -10,13 +10,20 @@
. ./common/preamble
_begin_fstest mkfs other auto
+filter_repair() {
+ _filter_repair | sed \
+ -e '/unable to verify superblock, continuing/d' \
+ -e '/found candidate secondary superblock/d' \
+ -e '/error reading superblock.*-- seek to offset/d'
+}
+
# dd the 1st sector then repair
_dd_repair_check()
{
#dd first sector
dd if=/dev/zero of=$1 bs=$2 count=1 2>&1 | _filter_dd
#xfs_repair
- _scratch_xfs_repair 2>&1 | _filter_repair
+ _scratch_xfs_repair 2>&1 | filter_repair
#check repair
if _check_scratch_fs; then
echo "repair passed"
diff --git a/tests/xfs/178.out b/tests/xfs/178.out
index 0bebe553eb..711e90cc26 100644
--- a/tests/xfs/178.out
+++ b/tests/xfs/178.out
@@ -9,7 +9,6 @@ Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
-found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb root inode INO inconsistent with calculated value INO
@@ -45,7 +44,6 @@ Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
-found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
sb root inode INO inconsistent with calculated value INO
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/3] generic/465: only complain about stale disk contents when racing directio
2023-10-09 18:18 [PATCHSET 0/3] fstests: random fixes for v2023.10.08 Darrick J. Wong
2023-10-09 18:18 ` [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks Darrick J. Wong
@ 2023-10-09 18:18 ` Darrick J. Wong
2023-10-10 7:03 ` Christoph Hellwig
2023-10-09 18:18 ` [PATCH 3/3] generic/269,xfs/051: don't drop fsstress failures to stdout Darrick J. Wong
2 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-10-09 18:18 UTC (permalink / raw)
To: djwong, zlang; +Cc: tytso, jack, linux-xfs, fstests, guan
From: Darrick J. Wong <djwong@kernel.org>
This test does a strange thing with directio -- it races a reader thread
with an appending aio writer thread and checks that the reader thread
only ever sees a (probably short) buffer containing the same contents
that are being read.
However, this has never worked correctly on XFS, which supports
concurrent readers and writers for directio. Say you already have a
file with a single written mapping A:
AAAAAAAAAA
0 EOF
Then one thread initiates an aligned appending write:
AAAAAAAAAA---------
0 EOF new_EOF
However, the free space is fragmented, so the file range maps to
multiple extents b and c (lowercase means unwritten here):
AAAAAAAAAAbbbbccccc
0 EOF new_EOF
This implies separate bios for b and c. Both bios are issued, but c
completes first. The ioend for c will extend i_size all the way to
new_EOF. Extent b is still marked unwritten because it hasn't completed
yet.
Next, the test reader slips in and tries to read the range between the
old EOF and the new EOF. The file looks like this now:
AAAAAAAAAAbbbbCCCCC
0 EOF new_EOF
So the reader sees "bbbbCCCCC" in the mapping, and the buffer returned
contains a range of zeroes followed by whatever was written to C.
For pagecache IO I would say that i_size should not be extended until
the extending write is fully complete, but the pagecache also
coordinates access so that reads and writes cannot conflict.
However, this is directio. Reads and writes to the storage device can
be issued and acknowledged in any order. I asked Ted and Jan about this
point, and they echoed that for directio it's expected that application
software must coordinate access themselves.
In other words, the only thing that the reader can check here is that
the filesystem is not returning stale disk contents. Amend the test so
that null bytes in the reader buffer are acceptable.
Cc: tytso@mit.edu, jack@suse.cz
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
.../aio-dio-append-write-read-race.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/aio-dio-regress/aio-dio-append-write-read-race.c b/src/aio-dio-regress/aio-dio-append-write-read-race.c
index 911f27230b..d9f8982f00 100644
--- a/src/aio-dio-regress/aio-dio-append-write-read-race.c
+++ b/src/aio-dio-regress/aio-dio-append-write-read-race.c
@@ -191,7 +191,7 @@ int main(int argc, char *argv[])
}
for (j = 0; j < rdata.read_sz; j++) {
- if (rdata.buf[j] != 'a') {
+ if (rdata.buf[j] != 'a' && rdata.buf[j] != 0) {
fail("encounter an error: "
"block %d offset %d, content %x\n",
i, j, rbuf[j]);
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 3/3] generic/269,xfs/051: don't drop fsstress failures to stdout
2023-10-09 18:18 [PATCHSET 0/3] fstests: random fixes for v2023.10.08 Darrick J. Wong
2023-10-09 18:18 ` [PATCH 1/3] xfs/178: don't fail when SCRATCH_DEV contains random xfs superblocks Darrick J. Wong
2023-10-09 18:18 ` [PATCH 2/3] generic/465: only complain about stale disk contents when racing directio Darrick J. Wong
@ 2023-10-09 18:18 ` Darrick J. Wong
2023-10-10 7:03 ` Christoph Hellwig
2 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-10-09 18:18 UTC (permalink / raw)
To: djwong, zlang; +Cc: linux-xfs, fstests, guan
From: Darrick J. Wong <djwong@kernel.org>
Prior to commit f55e46d629, these two tests would run fsstress until it
hit a failure -- ENOSPC in the case of generic/269, and EIO in the case
of xfs/051. These errors are expected, which was why stderr was also
redirected to /dev/null. Commit f55e46d629 removed the stderr
redirection, which has resulted in a 100% failure rate.
Fix this regression by pushing stderr stream to $seqres.full.
Fixes: f55e46d629 ("fstests: redirect fsstress' stdout to $seqres.full instead of /dev/null")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
tests/generic/269 | 2 +-
tests/xfs/051 | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tests/generic/269 b/tests/generic/269
index b852f6bf7e..b7cdecd94f 100755
--- a/tests/generic/269
+++ b/tests/generic/269
@@ -23,7 +23,7 @@ _workout()
out=$SCRATCH_MNT/fsstress.$$
args=`_scale_fsstress_args -p128 -n999999999 -f setattr=1 $FSSTRESS_AVOID -d $out`
echo "fsstress $args" >> $seqres.full
- $FSSTRESS_PROG $args >> $seqres.full &
+ $FSSTRESS_PROG $args &>> $seqres.full &
pid=$!
echo "Run dd writers in parallel"
for ((i=0; i < num_iterations; i++))
diff --git a/tests/xfs/051 b/tests/xfs/051
index 1c6709648d..aca867c940 100755
--- a/tests/xfs/051
+++ b/tests/xfs/051
@@ -38,7 +38,7 @@ _scratch_mount
# Start a workload and shutdown the fs. The subsequent mount will require log
# recovery.
-$FSSTRESS_PROG -n 9999 -p 2 -w -d $SCRATCH_MNT >> $seqres.full &
+$FSSTRESS_PROG -n 9999 -p 2 -w -d $SCRATCH_MNT &>> $seqres.full &
sleep 5
_scratch_shutdown -f
$KILLALL_PROG -q $FSSTRESS_PROG
^ permalink raw reply related [flat|nested] 13+ messages in thread