Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
@ 2022-03-13 15:47 Manfred Spraul
  2022-03-13 22:46 ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 2022-03-13 15:47 UTC (permalink / raw)
  To: linux-xfs; +Cc: Spraul Manfred (XC/QMM21-CT)

Hello together,


after a simulated power failure, I have observed:

 >>>

Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 
[xfs], xfs_dir3_block block 0x86f58
[14768.047531] XFS (loop0): Unmount and run xfs_repair
[14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 
58  XDB3..........oX

<<<

Is this a known issue?

The image file is here: 
https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz

As first question:

Are 512 byte sectors supported, or does xfs assume that 4096 byte writes 
are atomic?


How were the power failures simulated:

I added support to nbd to log all write operations, including the 
written data. This got merged into nbd-3.24

I've used that to create a log of running dbench (+ a few tar/rm/manual 
tests) on a 500 MB image file.

In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long.

If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802 
blocks, the above listed error message is shown.

After 1,821,799 or 1,821,803 sectors, everything is ok.

(block numbers are 0-based)

> > H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE) 
> O=0000000010deb000 L=00001000
> block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), 
> len 512 (0x200).
> block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), 
> len 512 (0x200).
> block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), 
> len 512 (0x200).  << OK
> block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), 
> len 512 (0x200).  FAIL
> block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), 
> len 512 (0x200).  FAIL
> block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), 
> len 512 (0x200).  FAIL
> block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), 
> len 512 (0x200).  FAIL
> block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), 
> len 512 (0x200). << OK
>

The output from xfs_repair is below.

kernel: 5.16.12-200.fc35.x86_64

nbd:nbd-3.24-1.fc37.x86_64

mkfs options: mkfs.xfs /dev/nbd0 -m bigtime=1 -m finobt=1 -m rmapbt=1

mount options: mount -t xfs -o uqnoenforce /dev/nbd0 $tmpmnt

Generator script: 
https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/generator/maketr

Further log file are also on github: 
https://github.com/manfred-colorfu/nbd-datalog-referencefiles/tree/main/xfs-02/result


<<<

/dev/loop0: [0037]:17060 (/tmp/data-341131.img)
Phase 1 - find and verify superblock...
         - block cache size set to 759616 entries
Phase 2 - using internal log
         - zero log...
zero_log: head block 734 tail block 734
         - scan filesystem freespace and inode maps...
         - found root inode chunk
Phase 3 - for each AG...
         - scan (but don't clear) agi unlinked lists...
         - process known inodes and perform inode discovery...
         - agno = 0
         - agno = 1
         - agno = 2
Metadata CRC error detected at 0x563aa27804c3, xfs_dir3_block block 
0x86f58/0x1000
corrupt block 0 in directory inode 551205
         would junk block
no . entry for directory 551205
no .. entry for directory 551205
problem with directory contents in inode 551205
would have cleared inode 551205
         - agno = 3
         - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
         - setting up duplicate extent list...
         - check for inodes claiming duplicate blocks...
         - agno = 1
         - agno = 3
         - agno = 2
         - agno = 0
corrupt block 0 in directory inode 551205
         would junk block
no . entry for directory 551205
no .. entry for directory 551205
problem with directory contents in inode 551205
would have cleared inode 551205
entry "COREL" in shortform directory 789069 references free inode 551205
would have junked entry "COREL" in directory inode 789069
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
         - traversing filesystem ...
         - agno = 0
         - agno = 1
         - agno = 2
         - agno = 3
entry "COREL" in shortform directory inode 789069 points to free inode 
551205
would junk entry
         - traversal finished ...
         - moving disconnected inodes to lost+found ...
disconnected inode 551174, would move to lost+found
disconnected inode 551176, would move to lost+found
disconnected inode 551178, would move to lost+found
disconnected inode 551180, would move to lost+found
disconnected inode 551206, would move to lost+found
disconnected inode 551207, would move to lost+found

disconnected inode 551208, would move to lost+found
disconnected inode 551209, would move to lost+found
disconnected inode 551210, would move to lost+found
disconnected inode 551211, would move to lost+found
disconnected inode 551212, would move to lost+found
disconnected inode 551213, would move to lost+found
disconnected inode 551214, would move to lost+found
disconnected inode 551215, would move to lost+found
disconnected inode 551217, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 789069 nlinks from 11 to 10
No modify flag set, skipping filesystem flush and exiting.

<<<<

>>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-13 15:47 Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 Manfred Spraul
@ 2022-03-13 22:46 ` Dave Chinner
  2022-03-14 15:18   ` Manfred Spraul
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2022-03-13 22:46 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT)

On Sun, Mar 13, 2022 at 04:47:19PM +0100, Manfred Spraul wrote:
> Hello together,
> 
> 
> after a simulated power failure, I have observed:
> 
> >>>
> 
> Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs],
> xfs_dir3_block block 0x86f58
> [14768.047531] XFS (loop0): Unmount and run xfs_repair
> [14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> [14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 
> XDB3..........oX

For future reference, please paste the entire log message, from
the time that the fs was mounted to the end of the hexdump output.
You might not think the hexdump output is important, but as you'll
see later....

> <<<
> 
> Is this a known issue?

Is what a known issue? All this is XFS finding a corrupt metadata
block because a CRC is invalid, which is exactly what it's supposed
to do.

As it is, CRC errors are indicative of storage problem such as bit
errors and torn writes, because what has been read from disk does
not match what XFS wrote when it calculated the CRC.

> The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz
> 
> As first question:
> 
> Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are
> atomic?

512 byte *IO* is supported on devices that have 512 byte sector
support, but there are other rules that XFS sets for metadata. e.g.
that metadata writes are expected to be written completely or
replayed completely as a whole unit regardless of their length. This
is bookended by the use of cache flushes and FUAs to ensure that
multi-sector writes are wholly completed before the recovery
information is tossed away.

If a cache flush has not been issued, then the metadata block
recvoery information is whole in the journal, and so if we crash or
lose power then journal recovery replays the changes and overwrites
whatever is on the disk with the correct, consistent metadata.

Log recovery will also issue large writes and cache flushes will
occur as part of the process so that the recovered metadata is whole
on stable storage before it is removed from the journal.

IOWs, if the storage ends up doing a partial write as a result of a
power failure, log recovery fixes that up if it is still in the
journal. If it is not in the journal then a cache flush *must* have
happened, and hence the metadata is complete on disk.

So....

> How were the power failures simulated:
> 
> I added support to nbd to log all write operations, including the written
> data. This got merged into nbd-3.24
> 
> I've used that to create a log of running dbench (+ a few tar/rm/manual
> tests) on a 500 MB image file.
> 
> In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long.
> 
> If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802
> blocks, the above listed error message is shown.
> 
> After 1,821,799 or 1,821,803 sectors, everything is ok.
> 
> (block numbers are 0-based)
> 
> > > H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE)
> > O=0000000010deb000 L=00001000
> > block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), len
> > 512 (0x200).
> > block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), len
> > 512 (0x200).
> > block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), len
> > 512 (0x200).  << OK
> > block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), len
> > 512 (0x200).  FAIL
> > block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), len
> > 512 (0x200).  FAIL
> > block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), len
> > 512 (0x200).  FAIL
> > block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), len
> > 512 (0x200).  FAIL
> > block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), len
> > 512 (0x200). << OK

OK, this test is explicitly tearing writes at the storage level.
When there is an update to multiple sectors of the metadata block,
the metadata will be inconsistent on disk while those individual
sector writes are replayed.

For example the problem here is likely the LSN that this write
stamps into the header along with the updated CRC. Log recovery
doesn't actually check the incoming CRC because it might be invalid
(say, due to a torn write) but it does check the magic number and
then the LSN that is stamped into the metadata block to determine if
it should be replayed or not (i.e. we have metadata version
checks in recovery).

If the LSN that is stamped into the header is more recent that the
object version that log recovery is trying to replay, it will skip
replay because that can result in unnecessary transient corruption
of the metadata on disk that doesn't get corrected until later in
the recovery process. This is bad - if log recovery then fails
before we recover then more recent changes, we've created new
on-disk corruption and made things worse, not better....

So, let's find the log recovery lsn (same in all images) via
logprint - it's last logged as part of this transaction:

LOG REC AT LSN cycle 15 block 604 (0xf, 0x25c)                                   
============================================================================     
TRANS: tid:0x6d1b8e4f  #items:201  trans:0x6d1b8e4f q:0x5608eeb23bd0 

And there are 3 data regions in it:

BUF: cnt:4 total:4 a:0x5608eeb23f60 len:24 a:0x5608eeb20f70 len:128 a:0x5608eeb23970 len:384 a:0x5608eeb22130 len:256 
        BUF:  #regs:4   start blkno:0x86f58   len:8   bmap size:1   flags:0x5000 
        BUF DATA                                                                 
        BUF DATA                                                                 
        BUF DATA 

The three regions are 128 bytes, 384 bytes and 256 bytes long. The
first chunk is clearly the first 128 bytes of the sector:

40       69    4123c    85000    86f58        0        1 c000001d 4f8e1b6d       
            buf item             daddr			  TID
48 80000000       69 33424458 b21e33d9        0 586f0800  e000000 29580000       
   ophdr flags   ID  XDB3     CRC   		daddr     CYCLE    BLCK
50 23355a53 f14c2c57 b07cac8d b7eca938        0 25690800 400d2802 30006801       
58 3000c801        0        0 25690800    22e01 40000000        0 4d0a0c00       
60  22e2e02 50000000        0 26690800 5244430b 534c4f52 4746432e 60000001       
68        0 27690800 4f8e1b6d

So, when this item was logged, the LSN in the in memory buffer was
(0xe,0x5829), and it is being replayed at (0xf,0x25c). That's good,
it indicates what is in the journal is valid but what is in the
block on disk?

xfs_db> daddr 0x86f58 
xfs_db> p
000: 58444233 9fabd7f4 00000000 00086f58 0000000f 0000025c 535a3523 572c4cf1
     magic    crc               daddr     block    cycle   ....

Oh, I didn't need to get it off disk like this - it's in the second
line of the hexdump output in the corruption reports:

[15063.024355] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[15063.024466] XFS (loop0): Unmount and run xfs_repair
[15063.024468] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[15063.024471] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[15063.024474] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
                         CYCLE       BLOCK

Yup, there we go. The LSN is (0xf,0x25c), which tells log recovery
not to recover it because it's the same as the LSN as the last
journal checkpoint that records changes to the block has.

So the write that the test is tearing up is the in-place metadata
overwrite, so it's creating physical metadata corruption in the
storage. That corruption persists until all the sectors in the
metadata block have been updated, at which point your test failures
go away again.

Hence to answer your original question: Yes, XFS is behaving exactly
as it was designed to behave. The metadata verifiers have correctly
detected corruption that has resulted from the storage tearing all
it's writes to little pieces and that journal recovery couldn't
automatically repair after the fact. We failed to repair it
automatically beacuse the nature of the torn write told log recovery
"don't recover this metadata from the journal because it is already
up to date". Instead, the problem was detected on first access to
the torn up metadata, and by xfs_repair.

IOWs, there is no problems with XFS here. If there is any issue at
all, it is with the assumption that filesystems can always cleanly
recovery from massively (or randomly) torn writes. The fact is that
they can't and that's why we have things like CRCs and self
describing metadata to detect when unexpected or unrecoverable torn
or misplaced writes occur deep down in the storage layers...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-13 22:46 ` Dave Chinner
@ 2022-03-14 15:18   ` Manfred Spraul
  2022-03-16  8:55     ` Manfred Spraul
  0 siblings, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 2022-03-14 15:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT)

Hi Dave,

On 3/13/22 23:46, Dave Chinner wrote:
> On Sun, Mar 13, 2022 at 04:47:19PM +0100, Manfred Spraul wrote:
>> Hello together,
>>
>>
>> after a simulated power failure, I have observed:
>>
>> Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs],
>> xfs_dir3_block block 0x86f58
>> [14768.047531] XFS (loop0): Unmount and run xfs_repair
>> [14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer:
>> [14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58
>> XDB3..........oX
> For future reference, please paste the entire log message, from
> the time that the fs was mounted to the end of the hexdump output.
> You might not think the hexdump output is important, but as you'll
> see later....
Noted. I had to chose what I add into the mail, too much information.
>> <<<
>>
>> Is this a known issue?
> Is what a known issue? All this is XFS finding a corrupt metadata
> block because a CRC is invalid, which is exactly what it's supposed
> to do.
>
> As it is, CRC errors are indicative of storage problem such as bit
> errors and torn writes, because what has been read from disk does
> not match what XFS wrote when it calculated the CRC.
>
>> The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz
>>
>> As first question:
>>
>> Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are
>> atomic?
> 512 byte *IO* is supported on devices that have 512 byte sector
> support, but there are other rules that XFS sets for metadata. e.g.
> that metadata writes are expected to be written completely or
> replayed completely as a whole unit regardless of their length.

>   This
> is bookended by the use of cache flushes and FUAs to ensure that
> multi-sector writes are wholly completed before the recovery
> information is tossed away.

[...]


>> How were the power failures simulated:
>>
>> I added support to nbd to log all write operations, including the written
>> data. This got merged into nbd-3.24
>>
>> I've used that to create a log of running dbench (+ a few tar/rm/manual
>> tests) on a 500 MB image file.
>>
>> In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long.
>>
>> If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802
>> blocks, the above listed error message is shown.
>>
>> After 1,821,799 or 1,821,803 sectors, everything is ok.
(Correcting my own typo:)
1,821,798 or 1,821,803 are ok.
>>
>> (block numbers are 0-based)
>>
>>>> H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE)
>>> O=0000000010deb000 L=00001000
>>> block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), len
>>> 512 (0x200).
>>> block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), len
>>> 512 (0x200).
>>> block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), len
>>> 512 (0x200).  << OK
>>> block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), len
>>> 512 (0x200).  FAIL
>>> block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), len
>>> 512 (0x200).  FAIL
>>> block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), len
>>> 512 (0x200).  FAIL
>>> block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), len
>>> 512 (0x200).  FAIL
>>> block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), len
>>> 512 (0x200). << OK
> OK, this test is explicitly tearing writes at the storage level.
> When there is an update to multiple sectors of the metadata block,
> the metadata will be inconsistent on disk while those individual
> sector writes are replayed.

Thanks for the clarification.

I'll modify the test application to never tear write operations and retry.

If there are findings, then I'll distribute them.


--

     Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-14 15:18   ` Manfred Spraul
@ 2022-03-16  8:55     ` Manfred Spraul
  2022-03-17  2:47       ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Manfred Spraul @ 2022-03-16  8:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT)

Hi Dave,

On 3/14/22 16:18, Manfred Spraul wrote:
> Hi Dave,
>
> On 3/13/22 23:46, Dave Chinner wrote:
>> OK, this test is explicitly tearing writes at the storage level.
>> When there is an update to multiple sectors of the metadata block,
>> the metadata will be inconsistent on disk while those individual
>> sector writes are replayed.
>
> Thanks for the clarification.
>
> I'll modify the test application to never tear write operations and 
> retry.
>
> If there are findings, then I'll distribute them.
>
I've modified the test app, and with 4000 simulated power failures I 
have not seen any corruptions.


Thus:

- With teared write operations: 2 corruptions from ~800 simulated power 
failures

- Without teared write operations: no corruptions from ~4000 simulated 
power failures.

But:

I've checked the eMMC specification, and the spec allows that teared 
write happen:

JESD84-B51A, chapter 6.6.8.1:

> All of the sectors being modified by the write operation that was interrupted may be in one of the following states: all sectors contain new data, all sectors contain old data or some sectors contain new data and some sectors contain old data.
"some sectors contain new data and some sectors contain old data".

NVM also appears to allow tearing for writes larger than a certain size 
(and the size is 2 kB in the example in the spec, and one observed 
corruption happened when tearing a 20 kB write that crosses a 32kB boundary)

NVMe-NVM-Command-Set-Specification-1.0a-2021.07.26-Ratified, Chapter 
2.1.4.2AWUPF/NAWUPF

> If a write command is submitted with size greater than the 
> AWUPF/NAWUPF value or crosses an atomic
> boundary, then there is no guarantee of the data returned on 
> subsequent reads of the associated logical
> blocks.


Is my understanding correct that XFS support neither eMMC nor NVM devices?
(unless there is a battery backup that exceeds the guarantees from the spec)


--

     Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-16  8:55     ` Manfred Spraul
@ 2022-03-17  2:47       ` Dave Chinner
  2022-03-17  3:08         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2022-03-17  2:47 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT)

On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote:
> Hi Dave,
> 
> On 3/14/22 16:18, Manfred Spraul wrote:
> > Hi Dave,
> > 
> > On 3/13/22 23:46, Dave Chinner wrote:
> > > OK, this test is explicitly tearing writes at the storage level.
> > > When there is an update to multiple sectors of the metadata block,
> > > the metadata will be inconsistent on disk while those individual
> > > sector writes are replayed.
> > 
> > Thanks for the clarification.
> > 
> > I'll modify the test application to never tear write operations and
> > retry.
> > 
> > If there are findings, then I'll distribute them.
> > 
> I've modified the test app, and with 4000 simulated power failures I have
> not seen any corruptions.
> 
> 
> Thus:
> 
> - With teared write operations: 2 corruptions from ~800 simulated power
> failures
> 
> - Without teared write operations: no corruptions from ~4000 simulated power
> failures.

Good to hear.

> But:
> 
> I've checked the eMMC specification, and the spec allows that teared write
> happen:

Yes, most storage only guarantees that sector writes are atomic and
so multi-sector writes have no guarantees of being written
atomically.  IOWs, all storage technologies that currently exist are
allowed to tear multi-sector writes.

However, FUA writes are guaranteed to be whole on persistent storage
regardless of size when the hardware signals completion. And any
write that the hardware has signalled as complete before a cache
flush is received is also guaranteed to be whole on persistent
storage when the cache flush is signalled as complete by the
hardware. These mechanisms provide protection against torn writes.

IOWs, it's up to filesystems to guarantee data is on stable storage
before they trust it fully. Filesystems are pretty good at using
REQ_FLUSH, REQ_FUA and write completion ordering to ensure that
anything they need whole and complete on stable storage is actually
whole and complete.

In the cases where torn writes occur because that haven't been
covered by a FUA or cache flush guarantee (such as your test),
filesystems need mechanisms in their metadata to detect such events.
CRCs are the prime mechanism for this - that's what XFS uses, and it
was XFS reporting a CRC failure when reading torn metadata that
started this whole thread.

> Is my understanding correct that XFS support neither eMMC nor NVM devices?
> (unless there is a battery backup that exceeds the guarantees from the spec)

Incorrect.

They are supported just fine because flush/FUA semantics provide
guarantees against torn writes in normal operation. IOWs, torn
writes are something that almost *never* happen in real life, even
when power fails suddenly. Despite this, XFS can detect it has
occurred (because broken storage is all too common!), and if it
can't recovery automatically, it will shut down and ask the user to
correct the problem.

BTRFS and ZFS can also detect torn writes, and if you use the
(non-default) ext4 option "metadata_csum" it will also detect torn
writes to metadata via CRC failures. There are other filesystems
that can detect and correct torn writes, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-17  2:47       ` Dave Chinner
@ 2022-03-17  3:08         ` Dave Chinner
  2022-03-17  6:49           ` Manfred Spraul
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2022-03-17  3:08 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT)

On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote:
> > Hi Dave,
> > 
> > On 3/14/22 16:18, Manfred Spraul wrote:
> > > Hi Dave,
> > > 
> > > On 3/13/22 23:46, Dave Chinner wrote:
> > > > OK, this test is explicitly tearing writes at the storage level.
> > > > When there is an update to multiple sectors of the metadata block,
> > > > the metadata will be inconsistent on disk while those individual
> > > > sector writes are replayed.
> > > 
> > > Thanks for the clarification.
> > > 
> > > I'll modify the test application to never tear write operations and
> > > retry.
> > > 
> > > If there are findings, then I'll distribute them.
> > > 
> > I've modified the test app, and with 4000 simulated power failures I have
> > not seen any corruptions.
> > 
> > 
> > Thus:
> > 
> > - With teared write operations: 2 corruptions from ~800 simulated power
> > failures
> > 
> > - Without teared write operations: no corruptions from ~4000 simulated power
> > failures.
> 
> Good to hear.
> 
> > But:
> > 
> > I've checked the eMMC specification, and the spec allows that teared write
> > happen:
> 
> Yes, most storage only guarantees that sector writes are atomic and
> so multi-sector writes have no guarantees of being written
> atomically.  IOWs, all storage technologies that currently exist are
> allowed to tear multi-sector writes.
> 
> However, FUA writes are guaranteed to be whole on persistent storage
> regardless of size when the hardware signals completion. And any
> write that the hardware has signalled as complete before a cache
> flush is received is also guaranteed to be whole on persistent
> storage when the cache flush is signalled as complete by the
> hardware. These mechanisms provide protection against torn writes.
> 
> IOWs, it's up to filesystems to guarantee data is on stable storage
> before they trust it fully. Filesystems are pretty good at using
> REQ_FLUSH, REQ_FUA and write completion ordering to ensure that
> anything they need whole and complete on stable storage is actually
> whole and complete.
> 
> In the cases where torn writes occur because that haven't been
> covered by a FUA or cache flush guarantee (such as your test),
> filesystems need mechanisms in their metadata to detect such events.
> CRCs are the prime mechanism for this - that's what XFS uses, and it
> was XFS reporting a CRC failure when reading torn metadata that
> started this whole thread.
> 
> > Is my understanding correct that XFS support neither eMMC nor NVM devices?
> > (unless there is a battery backup that exceeds the guarantees from the spec)
> 
> Incorrect.
> 
> They are supported just fine because flush/FUA semantics provide
> guarantees against torn writes in normal operation. IOWs, torn
> writes are something that almost *never* happen in real life, even
> when power fails suddenly. Despite this, XFS can detect it has
> occurred (because broken storage is all too common!), and if it
> can't recovery automatically, it will shut down and ask the user to
> correct the problem.
> 
> BTRFS and ZFS can also detect torn writes, and if you use the
> (non-default) ext4 option "metadata_csum" it will also detect torn

Correction - metadata_csum is ienabled by default, I just ran the
wrong mkfs command when I tested it a few moments ago.

-Dave.

> writes to metadata via CRC failures. There are other filesystems
> that can detect and correct torn writes, too.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-17  3:08         ` Dave Chinner
@ 2022-03-17  6:49           ` Manfred Spraul
  2022-03-17  8:24             ` Dave Chinner
  2022-03-17 14:50             ` Theodore Ts'o
  0 siblings, 2 replies; 11+ messages in thread
From: Manfred Spraul @ 2022-03-17  6:49 UTC (permalink / raw)
  To: Dave Chinner, Theodore Ts'o; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT)

[-- Attachment #1: Type: text/plain, Size: 2972 bytes --]

Hi Dave,

[+Ted as the topic also applies to ext4]

On 3/17/22 04:08, Dave Chinner wrote:
> On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote:
>> On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote:
>>> Hi Dave,
>>>
>>> On 3/14/22 16:18, Manfred Spraul wrote:
>>>
>>> But:
>>>
>>> I've checked the eMMC specification, and the spec allows that teared write
>>> happen:
>> Yes, most storage only guarantees that sector writes are atomic and
>> so multi-sector writes have no guarantees of being written
>> atomically.  IOWs, all storage technologies that currently exist are
>> allowed to tear multi-sector writes.
>>
>> However, FUA writes are guaranteed to be whole on persistent storage
>> regardless of size when the hardware signals completion. And any
>> write that the hardware has signalled as complete before a cache
>> flush is received is also guaranteed to be whole on persistent
>> storage when the cache flush is signalled as complete by the
>> hardware. These mechanisms provide protection against torn writes.

My plan was to create a replay application that randomly creates disc 
images allowed by the writeback_cache_control documentation.

https://www.kernel.org/doc/html/latest/block/writeback_cache_control.html

And then check that the filesystem behaves as expected/defined.

The first step was: Implement the framework and just stop at a random 
location.

>>> Is my understanding correct that XFS support neither eMMC nor NVM devices?
>>> (unless there is a battery backup that exceeds the guarantees from the spec)
>> Incorrect.
>>
>> They are supported just fine because flush/FUA semantics provide
>> guarantees against torn writes in normal operation. IOWs, torn
>> writes are something that almost *never* happen in real life, even
>> when power fails suddenly. Despite this, XFS can detect it has
>> occurred (because broken storage is all too common!), and if it
>> can't recovery automatically, it will shut down and ask the user to
>> correct the problem.

So for xfs the behavior should be:

- without torn writes: Mount always successful, no errors when accessing 
the content.

- with torn writes: There may be error that will be detected only at 
runtime. The errors may at the end cause a file system shutdown.

(commented dmesg is attached)

The application I have in mind are embedded systems.

I.e. there is no user that can correct something, the recovery strategy 
must be included in the design.

>> BTRFS and ZFS can also detect torn writes, and if you use the
>> (non-default) ext4 option "metadata_csum" it will also detect torn
> Correction - metadata_csum is ienabled by default, I just ran the
> wrong mkfs command when I tested it a few moments ago.

For ext4, I have seen so far only corrupted commit blocks that cause 
mount failures.

https://lore.kernel.org/all/8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com/

But Ted didn't confirm yet that this is per design :-)


--

     Manfred

[-- Attachment #2: dmesg-final.txt --]
[-- Type: text/plain, Size: 10860 bytes --]

1) setup
[ 1591.878832] loop0: detected capacity change from 0 to 1024000

For info: md5sum of the image file:
	b7103b519ada7dc5281d7a42c29a4271  /tmp/mount_img-2536.img

2) mount. Command:
	mount -t auto /dev/loop0 x
[ 1591.911516] XFS (loop0): Mounting V5 Filesystem
[ 1591.945058] XFS (loop0): Starting recovery (logdev: internal)
[ 1591.949055] XFS (loop0): resetting quota flags
[ 1591.949590] XFS (loop0): Ending recovery (logdev: internal)

Especially: Corruption not noticed at mount time.

3) find x -type f
[ 1741.033535] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1741.033693] XFS (loop0): Unmount and run xfs_repair
[ 1741.033696] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1741.033700] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1741.033704] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1741.033706] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1741.033708] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1741.033711] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1741.033713] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1741.033715] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1741.033717] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1741.033761] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1741.033886] XFS (loop0): Unmount and run xfs_repair
[ 1741.033889] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1741.033893] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1741.033896] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1741.033899] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1741.033901] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1741.033903] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1741.033905] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1741.033907] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1741.033909] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1741.033920] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74

--> corruption noticed at run time. FS tries to continue.

4) manual playing around in the filesystem:
	rm -Rf
	mv a ../b

[ 1824.642762] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1824.643000] XFS (loop0): Unmount and run xfs_repair
[ 1824.643006] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1824.643014] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1824.643020] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1824.643025] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1824.643030] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1824.643035] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1824.643040] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1824.643044] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1824.643049] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1824.643145] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1824.643361] XFS (loop0): Unmount and run xfs_repair
[ 1824.643366] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1824.643371] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1824.643377] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1824.643381] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1824.643386] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1824.643390] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1824.643395] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1824.643399] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1824.643403] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1824.643433] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74
[ 1824.643749] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1824.643880] XFS (loop0): Unmount and run xfs_repair
[ 1824.643883] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1824.643887] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1824.643891] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1824.643893] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1824.643896] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1824.643898] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1824.643900] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1824.643903] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1824.643905] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1824.643948] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1824.644074] XFS (loop0): Unmount and run xfs_repair
[ 1824.644076] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1824.644079] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1824.644081] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1824.644084] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1824.644086] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1824.644088] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1824.644091] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1824.644093] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1824.644095] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1824.644107] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74
[ 1838.578296] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1838.578452] XFS (loop0): Unmount and run xfs_repair
[ 1838.578456] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1838.578460] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1838.578464] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1838.578467] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1838.578470] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1838.578472] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1838.578475] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1838.578477] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1838.578479] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1838.578529] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1838.578699] XFS (loop0): Unmount and run xfs_repair
[ 1838.578703] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1838.578707] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1838.578711] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1838.578715] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1838.578719] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1838.578724] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1838.578728] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1838.578732] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1838.578736] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1838.578785] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74
[ 1876.302019] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[ 1876.302177] XFS (loop0): Unmount and run xfs_repair
[ 1876.302180] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[ 1876.302184] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[ 1876.302188] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
[ 1876.302191] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25  ..|.8.........i%
[ 1876.302194] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00  .(.@.h.0...0....
[ 1876.302196] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40  ......i%.......@
[ 1876.302199] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50  .......M.......P
[ 1876.302201] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53  ......i&.CDRROLS
[ 1876.302204] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27  .CFG...`......i'
[ 1876.302221] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74
[ 1876.302656] XFS (loop0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x12f/0x2b0 [xfs] (fs/xfs/xfs_trans_buf.c:296).  Shutting down filesystem.
[ 1876.302912] XFS (loop0): Please unmount the filesystem and rectify the problem(s)


   -> Now I broke it :-(

4) umount
[ 1931.725585] XFS (loop0): Unmounting Filesystem

5) run xfs_repair.
	This fails due to a log that first needs to be replayed

6) mount -t auto /dev/loop0 x
[ 2041.247530] XFS (loop0): Mounting V5 Filesystem
[ 2041.251838] XFS (loop0): Starting recovery (logdev: internal)
[ 2041.252877] XFS (loop0): Ending recovery (logdev: internal)

7) umount
[ 2047.218062] XFS (loop0): Unmounting Filesystem

8) run xfs_repair
	This is successful.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-17  6:49           ` Manfred Spraul
@ 2022-03-17  8:24             ` Dave Chinner
  2022-03-17 16:09               ` Manfred Spraul
  2022-03-17 14:50             ` Theodore Ts'o
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2022-03-17  8:24 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Theodore Ts'o, linux-xfs, Spraul Manfred (XC/QMM21-CT)

On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote:
> Hi Dave,
> 
> [+Ted as the topic also applies to ext4]
> 
> On 3/17/22 04:08, Dave Chinner wrote:
> > On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote:
> > > > Hi Dave,
> > > > 
> > > > On 3/14/22 16:18, Manfred Spraul wrote:
> > > > 
> > > > But:
> > > > 
> > > > I've checked the eMMC specification, and the spec allows that teared write
> > > > happen:
> > > Yes, most storage only guarantees that sector writes are atomic and
> > > so multi-sector writes have no guarantees of being written
> > > atomically.  IOWs, all storage technologies that currently exist are
> > > allowed to tear multi-sector writes.
> > > 
> > > However, FUA writes are guaranteed to be whole on persistent storage
> > > regardless of size when the hardware signals completion. And any
> > > write that the hardware has signalled as complete before a cache
> > > flush is received is also guaranteed to be whole on persistent
> > > storage when the cache flush is signalled as complete by the
> > > hardware. These mechanisms provide protection against torn writes.
> 
> My plan was to create a replay application that randomly creates disc images
> allowed by the writeback_cache_control documentation.
> 
> https://www.kernel.org/doc/html/latest/block/writeback_cache_control.html
>
> And then check that the filesystem behaves as expected/defined.

We already have that tool that exercises stepwise flush/fua aware
write recovery for filesystem testing: dm-logwrites was written and
integrated into fstests years ago (2016?) by Josef Bacik for testing
btrfs recovery, but it was a generic solution that all filesystems
can use to test failure recovery....

See, for example, common/dmlogwrites and tests/generic/482 - g/482
uses fsstress to randomly modify the filesystem while dm-logwrites
records all the writes made by the filesystem. It then replays them
one flush/fua at a time, mounting the filesystem to ensure that it
can recover the filesystem, then runs filesystem checkers to ensure
that the filesystem does not have any corrupt metadata. Then it
replays to the next flush/fua and repeats.

tools/dm-logwrite-replay provides a script and documents the
methodology to run step by step through replay of g/482 failures to
be able to reliably reproduce and diagnose the cause of the failure.

There's no need to re-invent the wheel if we've already got a
perfectly good one...

> > > > Is my understanding correct that XFS support neither eMMC nor NVM devices?
> > > > (unless there is a battery backup that exceeds the guarantees from the spec)
> > > Incorrect.
> > > 
> > > They are supported just fine because flush/FUA semantics provide
> > > guarantees against torn writes in normal operation. IOWs, torn
> > > writes are something that almost *never* happen in real life, even
> > > when power fails suddenly. Despite this, XFS can detect it has
> > > occurred (because broken storage is all too common!), and if it
> > > can't recovery automatically, it will shut down and ask the user to
> > > correct the problem.
> 
> So for xfs the behavior should be:
> 
> - without torn writes: Mount always successful, no errors when accessing the
> content.

Yes.

Of course, there are software bugs, so mounts, recovery and
subsequent repair testing can still fail.

> - with torn writes: There may be error that will be detected only at
> runtime. The errors may at the end cause a file system shutdown.

Yes, and they may even prevent the filesystem from being mounted
because recovery trips over them (e.g. processing pending unlinked
inodes or replaying incomplete intents).

> (commented dmesg is attached)
> 
> The application I have in mind are embedded systems.
> I.e. there is no user that can correct something, the recovery strategy must
> be included in the design.

Good luck with that - storage hardware fails in ways that no
existing filesystem can recover automatically from 100% of the time.
And very few even attempt to do so because it is largely an
impossible requirement to fulfil. Torn writes are just the tip of
the iceberg....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-17  8:24             ` Dave Chinner
@ 2022-03-17 16:09               ` Manfred Spraul
  0 siblings, 0 replies; 11+ messages in thread
From: Manfred Spraul @ 2022-03-17 16:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Theodore Ts'o, linux-xfs, Spraul Manfred (XC/QMM21-CT)

Hi Dave,

On 3/17/22 09:24, Dave Chinner wrote:
> On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote:
>> Hi Dave,
>>
>> [+Ted as the topic also applies to ext4]
>>
>> On 3/17/22 04:08, Dave Chinner wrote:
>>> On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote:
>>>> On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote:
>>>>> Hi Dave,
>>>>>
>>>>> On 3/14/22 16:18, Manfred Spraul wrote:
>>>>>
>>>>> But:
>>>>>
>>>>> I've checked the eMMC specification, and the spec allows that teared write
>>>>> happen:
>>>> Yes, most storage only guarantees that sector writes are atomic and
>>>> so multi-sector writes have no guarantees of being written
>>>> atomically.  IOWs, all storage technologies that currently exist are
>>>> allowed to tear multi-sector writes.
>>>>
>>>> However, FUA writes are guaranteed to be whole on persistent storage
>>>> regardless of size when the hardware signals completion. And any
>>>> write that the hardware has signalled as complete before a cache
>>>> flush is received is also guaranteed to be whole on persistent
>>>> storage when the cache flush is signalled as complete by the
>>>> hardware. These mechanisms provide protection against torn writes.
>> My plan was to create a replay application that randomly creates disc images
>> allowed by the writeback_cache_control documentation.
>>
>> https://www.kernel.org/doc/html/latest/block/writeback_cache_control.html
>>
>> And then check that the filesystem behaves as expected/defined.
> We already have that tool that exercises stepwise flush/fua aware
> write recovery for filesystem testing: dm-logwrites was written and
> integrated into fstests years ago (2016?) by Josef Bacik for testing
> btrfs recovery, but it was a generic solution that all filesystems
> can use to test failure recovery....
>
> See, for example, common/dmlogwrites and tests/generic/482 - g/482
> uses fsstress to randomly modify the filesystem while dm-logwrites
> records all the writes made by the filesystem. It then replays them
> one flush/fua at a time, mounting the filesystem to ensure that it
> can recover the filesystem, then runs filesystem checkers to ensure
> that the filesystem does not have any corrupt metadata. Then it
> replays to the next flush/fua and repeats.
>
> tools/dm-logwrite-replay provides a script and documents the
> methodology to run step by step through replay of g/482 failures to
> be able to reliably reproduce and diagnose the cause of the failure.
>
> There's no need to re-invent the wheel if we've already got a
> perfectly good one...

Thanks a lot for the hint!

I was thinking were a replay tool might exist and came up with nbd. 
Feedback was that it doesn't exist so I wrote something.

I didn't think about dm.

I'll look at dm-log-writes.

>>>>> Is my understanding correct that XFS support neither eMMC nor NVM devices?
>>>>> (unless there is a battery backup that exceeds the guarantees from the spec)
>>>> Incorrect.
>>>>
>>>> They are supported just fine because flush/FUA semantics provide
>>>> guarantees against torn writes in normal operation. IOWs, torn
>>>> writes are something that almost *never* happen in real life, even
>>>> when power fails suddenly. Despite this, XFS can detect it has
>>>> occurred (because broken storage is all too common!), and if it
>>>> can't recovery automatically, it will shut down and ask the user to
>>>> correct the problem.
>> So for xfs the behavior should be:
>>
>> - without torn writes: Mount always successful, no errors when accessing the
>> content.
> Yes.
>
> Of course, there are software bugs, so mounts, recovery and
> subsequent repair testing can still fail.
>
>> - with torn writes: There may be error that will be detected only at
>> runtime. The errors may at the end cause a file system shutdown.
> Yes, and they may even prevent the filesystem from being mounted
> because recovery trips over them (e.g. processing pending unlinked
> inodes or replaying incomplete intents).
>
>> (commented dmesg is attached)
>>
>> The application I have in mind are embedded systems.
>> I.e. there is no user that can correct something, the recovery strategy must
>> be included in the design.
> Good luck with that - storage hardware fails in ways that no
> existing filesystem can recover automatically from 100% of the time.
> And very few even attempt to do so because it is largely an
> impossible requirement to fulfil. Torn writes are just the tip of
> the iceberg....
Yes :-(

--

     Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-17  6:49           ` Manfred Spraul
  2022-03-17  8:24             ` Dave Chinner
@ 2022-03-17 14:50             ` Theodore Ts'o
  2022-03-17 16:03               ` Manfred Spraul
  1 sibling, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2022-03-17 14:50 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Dave Chinner, linux-xfs, Spraul Manfred (XC/QMM21-CT)

On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote:
> 
> > > BTRFS and ZFS can also detect torn writes, and if you use the
> > > (non-default) ext4 option "metadata_csum" it will also detect torn
> > Correction - metadata_csum is ienabled by default, I just ran the
> > wrong mkfs command when I tested it a few moments ago.
> 
> For ext4, I have seen so far only corrupted commit blocks that cause mount
> failures.
> 
> https://lore.kernel.org/all/8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com/

Ext4 uses FUA writes (if available) to write out the commit block.  If
a FUA write can result in torn writes, in my opinion that's a bug with
the storage device, or if eMMC devices don't respect FUA writes
correctly, then we should just disable FUA writes entirely.

In the absence of FUA, ext4 does assume that we can write out the
commit block as a 4k write, and then issue a cache flush.  If your
simulator assumes that the 4k write can be torn, on the assumption
that there is a narrow race between the issuance of the 4k write, the
device writing 1-3 512 byte sectors, and then due to a power failure,
the cache flush doesn't complete and the result is a torn write ---
quite frankly, I'm not sure how any system using checksums can deal
with that situation.  I think we can only assume that that case is in
reality quite rare, even if it's technically allowed by the spec.

	      	    	    	 - Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58
  2022-03-17 14:50             ` Theodore Ts'o
@ 2022-03-17 16:03               ` Manfred Spraul
  0 siblings, 0 replies; 11+ messages in thread
From: Manfred Spraul @ 2022-03-17 16:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Dave Chinner, linux-xfs, Spraul Manfred (XC/QMM21-CT)

On 3/17/22 15:50, Theodore Ts'o wrote:
> On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote:
>>>> BTRFS and ZFS can also detect torn writes, and if you use the
>>>> (non-default) ext4 option "metadata_csum" it will also detect torn
>>> Correction - metadata_csum is ienabled by default, I just ran the
>>> wrong mkfs command when I tested it a few moments ago.
>> For ext4, I have seen so far only corrupted commit blocks that cause mount
>> failures.
>>
>> https://lore.kernel.org/all/8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com/
> Ext4 uses FUA writes (if available) to write out the commit block.  If
> a FUA write can result in torn writes, in my opinion that's a bug with
> the storage device, or if eMMC devices don't respect FUA writes
> correctly, then we should just disable FUA writes entirely.
>
> In the absence of FUA, ext4 does assume that we can write out the
> commit block as a 4k write, and then issue a cache flush.  If your
> simulator assumes that the 4k write can be torn, on the assumption
> that there is a narrow race between the issuance of the 4k write, the
> device writing 1-3 512 byte sectors, and then due to a power failure,
> the cache flush doesn't complete and the result is a torn write ---
> quite frankly, I'm not sure how any system using checksums can deal
> with that situation.  I think we can only assume that that case is in
> reality quite rare, even if it's technically allowed by the spec.

Just checking the eMMC Spec (JESD 84-B51A)

Table 40, Admitted Data Sector Size, Address Mode and Reliable write 
Granularity:

Native sector size 4 kB devices with emulation mode off have a write 
granularity of 4 kB.

Otherwise the granularity is 512 bytes.

So, to avoid the risk of torn writes for ext4, emulation mode should be 
disabled.
For XFS, the spec provides no solution. (20 kB writes that crosses a 32 
kB boundary)

But, obviously:

The real issues identified were much simpler and I have no evidence that 
torn writes are a real risk.


--

     Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-03-17 16:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-13 15:47 Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 Manfred Spraul
2022-03-13 22:46 ` Dave Chinner
2022-03-14 15:18   ` Manfred Spraul
2022-03-16  8:55     ` Manfred Spraul
2022-03-17  2:47       ` Dave Chinner
2022-03-17  3:08         ` Dave Chinner
2022-03-17  6:49           ` Manfred Spraul
2022-03-17  8:24             ` Dave Chinner
2022-03-17 16:09               ` Manfred Spraul
2022-03-17 14:50             ` Theodore Ts'o
2022-03-17 16:03               ` Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox