* mkfs is broken due to platform_zero_range
@ 2021-05-04 0:20 Darrick J. Wong
2021-05-04 0:51 ` Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Darrick J. Wong @ 2021-05-04 0:20 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs
So... I have a machine with an nvme drive manufactured by a certain
manufacturer who isn't known for the quality of their firmware
implementation. I'm pretty sure that this is a result of the use of
fallocate(FALLOC_FL_ZERO_RANGE) to zero the log during format.
If I format a device, mounting and repair both fail because the primary
superblock UUID doesn't match the log UUID:
[root@abacus654 ~]# btrace /dev/nvme0n1 > btrace.txt &
[root@abacus654 ~]# strace -s99 -o /tmp/a mkfs.xfs /dev/nvme0n1 -f
meta-data=/dev/nvme0n1 isize=512 agcount=6, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=1542990848, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Discarding blocks...Done.
[root@abacus654 ~]# fg
btrace /dev/nvme0n1 > btrace.txt^C
[root@abacus654 ~]# xfs_repair -n /dev/nvme0n1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
* ERROR: mismatched uuid in log
* SB : 1f0e74b7-c3af-4da1-9f15-3aa3605faed7
* log: 83a2835b-58f4-4e2d-bc3f-e063617363dd
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
So then I looked at the log locations via xfs_db:
[root@abacus654 ~]# xfs_db -c 'sb' -c 'print logstart logblocks blocksize' -c 'convert fsb 805306374 daddr' /dev/nvme0n1
logstart = 805306374
logblocks = 521728
blocksize = 4096
[root@abacus654 ~]# xfs_db -c 'sb' -c 'convert fsb 805306374 daddr' /dev/nvme0n1
0x180000018 (6442450968)
Ok, so the log starts at (512b sector) 6442450968, which is byte offset
3298534895616 in /dev/nvme0n1. Checking the strace log reveals:
[root@abacus654 ~]# grep 3298534895616 /tmp/a
fallocate(4, FALLOC_FL_ZERO_RANGE, 3298534895616, 2136997888) = 0
pwrite64(4, <buffer>, 1024, 3298534895616) = 1024
Which shows that first we instructed the device to zero the entire log,
and then we wrote the log head to the first 1024 bytes of the log. If
we dump the log contents, we get:
[root@abacus654 ~]# dd if=/dev/nvme0n1 skip=805306371 count=521728 bs=4096 of=/tmp/badlog iflag=direct
[root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog
0000000 fe ed ba be 00 00 00 01 00 00 00 02 00 00 02 00
376 355 272 276 \0 \0 \0 001 \0 \0 \0 002 \0 \0 002 \0
<snip>
0000304 1f 0e 74 b7 c3 af 4d a1 9f 15 3a a3 60 5f ae d7
037 016 t 267 303 257 M 241 237 025 : 243 ` _ 256 327
Ok, so this is the 1024 bytes that we wrote out, and it matches the rest
of the filesystem. Note the 1f 0e 74 b7 c3 af... sequence near byte
304, which is the fs uuid according to xlog_rec_header. This looks good
so far. But then we get to the next 1024 bytes:
0001024 fe ed ba be 00 00 00 01 00 00 00 02 00 00 7e 00
376 355 272 276 \0 \0 \0 001 \0 \0 \0 002 \0 \0 ~ \0
<snip>
0001328 83 a2 83 5b 58 f4 4e 2d bc 3f e0 63 61 73 63 dd
203 242 203 [ X 364 N - 274 ? 340 c a s c 335
Notice the UUID here doesn't match -- it's the 83 a2 83 5b... sequence
that repair complained about above. Clearly, the log zeroing didn't
work, and mkfs didn't even try a trivial check that the accelerated
zeroing worked properly.
Indeed, I can even re-zero the entire log from the command line:
[root@abacus654 ~]# btrace /dev/nvme0n1 > btrace2.txt &
[root@abacus654 ~]# strace -s99 -e fallocate xfs_io -c 'fzero 3298534895616 2136997888' /dev/nvme0n1
fallocate(3, FALLOC_FL_ZERO_RANGE, 3298534895616, 2136997888) = 0
[root@abacus654 ~]# fg
btrace /dev/nvme0n1 > btrace.txt^C
And then re-excerpt the log:
[root@abacus654 ~]# dd if=/dev/nvme0n1 skip=805306371 count=521728 bs=4096 of=/tmp/badlog2 iflag=direct
[root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog2 | less
0000000 fe ed ba be 00 00 00 01 00 00 00 02 00 00 02 00
376 355 272 276 \0 \0 \0 001 \0 \0 \0 002 \0 \0 002 \0
0000016 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00
\0 \0 \0 001 \0 \0 \0 \0 \0 \0 \0 001 \0 \0 \0 \0
0000032 00 00 00 00 ff ff ff ff 00 00 00 01 b0 c0 d0 d0
\0 \0 \0 \0 377 377 377 377 \0 \0 \0 001 260 300 320 320
0000048 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
0000288 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01
\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 001
0000304 1f 0e 74 b7 c3 af 4d a1 9f 15 3a a3 60 5f ae d7
037 016 t 267 303 257 M 241 237 025 : 243 ` _ 256 327
Which should have zeroed the log completely. Looking at the block trace
from the mkfs run, I see:
[root@abacus654 ~]# grep 6442450968 btrace.txt
259,0 4 5907 0.118525480 26004 Q WS 6442450968 + 4173824 [mkfs.xfs]
259,0 4 5908 0.118541800 0 C WS 6442450968 + 4173824 [0]
259,0 4 5909 0.118572240 26004 Q WS 6442450968 + 2 [mkfs.xfs]
259,0 4 5910 0.118588160 0 C WS 6442450968 + 2 [0]
So we did in fact issue a write to the device, but we can still read
the previous contents after the write completes! Hooray, the device
firmware is broken!
If I look at the block trace from the xfs_io zero-range invocation, I
see:
[root@abacus654 ~]# grep 6442450968 btrace2.txt
259,0 6 1 1266874889.701085336 26385 Q WS 6442450968 + 4173824 [xfs_io]
259,0 6 2 1266874889.701172696 0 C WS 6442450968 + 4173824 [0]
So the kernel indeed sent the write request, but a subsequent re-read of
the log contents still shows the old log contents, which is why repair
and mount get mad.
Note that a standard overwite of the log causes a subsequent re-read
to produce zeroes:
[root@abacus654 ~]# dd if=/dev/zero of=/dev/nvme0n1 seek=805306371 count=521728 bs=4096 oflag=direct,sync
521728+0 records in
521728+0 records out
2136997888 bytes (2.1 GB, 2.0 GiB) copied, 9.75987 s, 219 MB/s
[root@abacus654 ~]# dd if=/dev/nvme0n1 skip=805306371 count=521728 bs=4096 of=/tmp/badlog3 iflag=direct
521728+0 records in
521728+0 records out
2136997888 bytes (2.1 GB, 2.0 GiB) copied, 8.55764 s, 250 MB/s
[root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog3 | less
[root@abacus654 ~]# ^C
[root@abacus654 ~]# od -tx1 -Ad -c /tmp/badlog3 | head -n15
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
2136997888
And the format works this time too:
[root@abacus654 ~]# strace -s99 -o /tmp/a mkfs.xfs /dev/nvme0n1 -f
meta-data=/dev/nvme0n1 isize=512 agcount=6, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=1542990848, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Discarding blocks...Done.
(reverse-i-search)`-n': od -tx1 -Ad -c /tmp/badlog3 | head ^C15
[root@abacus654 ~]# xfs_repair -n /dev/nvme0n1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
In conclusion, the drive firmware is broken.
Question: Should we be doing /some/ kind of re-read after a zeroing the
log to detect these sh*tty firmwares and fall back to a pwrite()?
--D
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: mkfs is broken due to platform_zero_range
2021-05-04 0:20 mkfs is broken due to platform_zero_range Darrick J. Wong
@ 2021-05-04 0:51 ` Dave Chinner
2021-05-04 6:57 ` Christoph Hellwig
2021-05-04 7:04 ` Chaitanya Kulkarni
2 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2021-05-04 0:51 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Eric Sandeen, xfs
On Mon, May 03, 2021 at 05:20:53PM -0700, Darrick J. Wong wrote:
> So... I have a machine with an nvme drive manufactured by a certain
> manufacturer who isn't known for the quality of their firmware
> implementation. I'm pretty sure that this is a result of the use of
> fallocate(FALLOC_FL_ZERO_RANGE) to zero the log during format.
>
> If I format a device, mounting and repair both fail because the primary
> superblock UUID doesn't match the log UUID:
.....
> And the format works this time too:
>
> [root@abacus654 ~]# strace -s99 -o /tmp/a mkfs.xfs /dev/nvme0n1 -f
> meta-data=/dev/nvme0n1 isize=512 agcount=6, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=0
> = reflink=1
> data = bsize=4096 blocks=1542990848, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> log =internal log bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> Discarding blocks...Done.
> (reverse-i-search)`-n': od -tx1 -Ad -c /tmp/badlog3 | head ^C15
> [root@abacus654 ~]# xfs_repair -n /dev/nvme0n1
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
>
> In conclusion, the drive firmware is broken.
>
> Question: Should we be doing /some/ kind of re-read after a zeroing the
> log to detect these sh*tty firmwares and fall back to a pwrite()?
No, userspace should not have to wrok around broken hardware. The
kernel needs to blacklist/quirk this device so that it will do
either:
a) redirect to a zeroing mechanism that actually works on that
device; or
b) fail the fallocate() call with -EOPNOTSUPP so that the
application can fall back to manual zeroing.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: mkfs is broken due to platform_zero_range
2021-05-04 0:20 mkfs is broken due to platform_zero_range Darrick J. Wong
2021-05-04 0:51 ` Dave Chinner
@ 2021-05-04 6:57 ` Christoph Hellwig
2021-05-04 7:04 ` Chaitanya Kulkarni
2 siblings, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2021-05-04 6:57 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Eric Sandeen, xfs
We'll need to quirk the device in the kernel. Can you send the report
including the output of 'nvme id-ctrl /dev/nvme0' to the linux-nvme
list?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: mkfs is broken due to platform_zero_range
2021-05-04 0:20 mkfs is broken due to platform_zero_range Darrick J. Wong
2021-05-04 0:51 ` Dave Chinner
2021-05-04 6:57 ` Christoph Hellwig
@ 2021-05-04 7:04 ` Chaitanya Kulkarni
2 siblings, 0 replies; 4+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-04 7:04 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Eric Sandeen, xfs
Darrick,
> On May 3, 2021, at 5:21 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>
> In conclusion, the drive firmware is broken.
>
> Question: Should we be doing /some/ kind of re-read after a zeroing the
> log to detect these sh*tty firmwares and fall back to a pwrite()?
As pointed out by Christoph and Dave, this maybe a broken write zeroes issues for NVMe device(not entirely sure).
We have seen such controllers and we add a quirk so that kernel can fallback to issuing writes instead of offloading write zeroes to controller so that usrspace don't have to deal with it.
Please send the report to Linux NVMe list that is mentioned in the Christoph's reply.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-05-04 7:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-04 0:20 mkfs is broken due to platform_zero_range Darrick J. Wong
2021-05-04 0:51 ` Dave Chinner
2021-05-04 6:57 ` Christoph Hellwig
2021-05-04 7:04 ` Chaitanya Kulkarni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox