* Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 @ 2022-03-13 15:47 Manfred Spraul 2022-03-13 22:46 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Manfred Spraul @ 2022-03-13 15:47 UTC (permalink / raw) To: linux-xfs; +Cc: Spraul Manfred (XC/QMM21-CT) Hello together, after a simulated power failure, I have observed: >>> Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [14768.047531] XFS (loop0): Unmount and run xfs_repair [14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer: [14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX <<< Is this a known issue? The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz As first question: Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are atomic? How were the power failures simulated: I added support to nbd to log all write operations, including the written data. This got merged into nbd-3.24 I've used that to create a log of running dbench (+ a few tar/rm/manual tests) on a 500 MB image file. In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long. If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802 blocks, the above listed error message is shown. After 1,821,799 or 1,821,803 sectors, everything is ok. (block numbers are 0-based) > > H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE) > O=0000000010deb000 L=00001000 > block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), > len 512 (0x200). > block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), > len 512 (0x200). > block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), > len 512 (0x200). << OK > block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), > len 512 (0x200). FAIL > block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), > len 512 (0x200). FAIL > block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), > len 512 (0x200). FAIL > block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), > len 512 (0x200). FAIL > block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), > len 512 (0x200). << OK > The output from xfs_repair is below. kernel: 5.16.12-200.fc35.x86_64 nbd:nbd-3.24-1.fc37.x86_64 mkfs options: mkfs.xfs /dev/nbd0 -m bigtime=1 -m finobt=1 -m rmapbt=1 mount options: mount -t xfs -o uqnoenforce /dev/nbd0 $tmpmnt Generator script: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/generator/maketr Further log file are also on github: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/tree/main/xfs-02/result <<< /dev/loop0: [0037]:17060 (/tmp/data-341131.img) Phase 1 - find and verify superblock... - block cache size set to 759616 entries Phase 2 - using internal log - zero log... zero_log: head block 734 tail block 734 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 Metadata CRC error detected at 0x563aa27804c3, xfs_dir3_block block 0x86f58/0x1000 corrupt block 0 in directory inode 551205 would junk block no . entry for directory 551205 no .. entry for directory 551205 problem with directory contents in inode 551205 would have cleared inode 551205 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 3 - agno = 2 - agno = 0 corrupt block 0 in directory inode 551205 would junk block no . entry for directory 551205 no .. entry for directory 551205 problem with directory contents in inode 551205 would have cleared inode 551205 entry "COREL" in shortform directory 789069 references free inode 551205 would have junked entry "COREL" in directory inode 789069 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 entry "COREL" in shortform directory inode 789069 points to free inode 551205 would junk entry - traversal finished ... - moving disconnected inodes to lost+found ... disconnected inode 551174, would move to lost+found disconnected inode 551176, would move to lost+found disconnected inode 551178, would move to lost+found disconnected inode 551180, would move to lost+found disconnected inode 551206, would move to lost+found disconnected inode 551207, would move to lost+found disconnected inode 551208, would move to lost+found disconnected inode 551209, would move to lost+found disconnected inode 551210, would move to lost+found disconnected inode 551211, would move to lost+found disconnected inode 551212, would move to lost+found disconnected inode 551213, would move to lost+found disconnected inode 551214, would move to lost+found disconnected inode 551215, would move to lost+found disconnected inode 551217, would move to lost+found Phase 7 - verify link counts... would have reset inode 789069 nlinks from 11 to 10 No modify flag set, skipping filesystem flush and exiting. <<<< >>> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-13 15:47 Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 Manfred Spraul @ 2022-03-13 22:46 ` Dave Chinner 2022-03-14 15:18 ` Manfred Spraul 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2022-03-13 22:46 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT) On Sun, Mar 13, 2022 at 04:47:19PM +0100, Manfred Spraul wrote: > Hello together, > > > after a simulated power failure, I have observed: > > >>> > > Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], > xfs_dir3_block block 0x86f58 > [14768.047531] XFS (loop0): Unmount and run xfs_repair > [14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer: > [14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 > XDB3..........oX For future reference, please paste the entire log message, from the time that the fs was mounted to the end of the hexdump output. You might not think the hexdump output is important, but as you'll see later.... > <<< > > Is this a known issue? Is what a known issue? All this is XFS finding a corrupt metadata block because a CRC is invalid, which is exactly what it's supposed to do. As it is, CRC errors are indicative of storage problem such as bit errors and torn writes, because what has been read from disk does not match what XFS wrote when it calculated the CRC. > The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz > > As first question: > > Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are > atomic? 512 byte *IO* is supported on devices that have 512 byte sector support, but there are other rules that XFS sets for metadata. e.g. that metadata writes are expected to be written completely or replayed completely as a whole unit regardless of their length. This is bookended by the use of cache flushes and FUAs to ensure that multi-sector writes are wholly completed before the recovery information is tossed away. If a cache flush has not been issued, then the metadata block recvoery information is whole in the journal, and so if we crash or lose power then journal recovery replays the changes and overwrites whatever is on the disk with the correct, consistent metadata. Log recovery will also issue large writes and cache flushes will occur as part of the process so that the recovered metadata is whole on stable storage before it is removed from the journal. IOWs, if the storage ends up doing a partial write as a result of a power failure, log recovery fixes that up if it is still in the journal. If it is not in the journal then a cache flush *must* have happened, and hence the metadata is complete on disk. So.... > How were the power failures simulated: > > I added support to nbd to log all write operations, including the written > data. This got merged into nbd-3.24 > > I've used that to create a log of running dbench (+ a few tar/rm/manual > tests) on a 500 MB image file. > > In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long. > > If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802 > blocks, the above listed error message is shown. > > After 1,821,799 or 1,821,803 sectors, everything is ok. > > (block numbers are 0-based) > > > > H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE) > > O=0000000010deb000 L=00001000 > > block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), len > > 512 (0x200). > > block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), len > > 512 (0x200). > > block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), len > > 512 (0x200). << OK > > block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), len > > 512 (0x200). FAIL > > block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), len > > 512 (0x200). FAIL > > block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), len > > 512 (0x200). FAIL > > block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), len > > 512 (0x200). FAIL > > block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), len > > 512 (0x200). << OK OK, this test is explicitly tearing writes at the storage level. When there is an update to multiple sectors of the metadata block, the metadata will be inconsistent on disk while those individual sector writes are replayed. For example the problem here is likely the LSN that this write stamps into the header along with the updated CRC. Log recovery doesn't actually check the incoming CRC because it might be invalid (say, due to a torn write) but it does check the magic number and then the LSN that is stamped into the metadata block to determine if it should be replayed or not (i.e. we have metadata version checks in recovery). If the LSN that is stamped into the header is more recent that the object version that log recovery is trying to replay, it will skip replay because that can result in unnecessary transient corruption of the metadata on disk that doesn't get corrected until later in the recovery process. This is bad - if log recovery then fails before we recover then more recent changes, we've created new on-disk corruption and made things worse, not better.... So, let's find the log recovery lsn (same in all images) via logprint - it's last logged as part of this transaction: LOG REC AT LSN cycle 15 block 604 (0xf, 0x25c) ============================================================================ TRANS: tid:0x6d1b8e4f #items:201 trans:0x6d1b8e4f q:0x5608eeb23bd0 And there are 3 data regions in it: BUF: cnt:4 total:4 a:0x5608eeb23f60 len:24 a:0x5608eeb20f70 len:128 a:0x5608eeb23970 len:384 a:0x5608eeb22130 len:256 BUF: #regs:4 start blkno:0x86f58 len:8 bmap size:1 flags:0x5000 BUF DATA BUF DATA BUF DATA The three regions are 128 bytes, 384 bytes and 256 bytes long. The first chunk is clearly the first 128 bytes of the sector: 40 69 4123c 85000 86f58 0 1 c000001d 4f8e1b6d buf item daddr TID 48 80000000 69 33424458 b21e33d9 0 586f0800 e000000 29580000 ophdr flags ID XDB3 CRC daddr CYCLE BLCK 50 23355a53 f14c2c57 b07cac8d b7eca938 0 25690800 400d2802 30006801 58 3000c801 0 0 25690800 22e01 40000000 0 4d0a0c00 60 22e2e02 50000000 0 26690800 5244430b 534c4f52 4746432e 60000001 68 0 27690800 4f8e1b6d So, when this item was logged, the LSN in the in memory buffer was (0xe,0x5829), and it is being replayed at (0xf,0x25c). That's good, it indicates what is in the journal is valid but what is in the block on disk? xfs_db> daddr 0x86f58 xfs_db> p 000: 58444233 9fabd7f4 00000000 00086f58 0000000f 0000025c 535a3523 572c4cf1 magic crc daddr block cycle .... Oh, I didn't need to get it off disk like this - it's in the second line of the hexdump output in the corruption reports: [15063.024355] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [15063.024466] XFS (loop0): Unmount and run xfs_repair [15063.024468] XFS (loop0): First 128 bytes of corrupted metadata buffer: [15063.024471] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [15063.024474] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. CYCLE BLOCK Yup, there we go. The LSN is (0xf,0x25c), which tells log recovery not to recover it because it's the same as the LSN as the last journal checkpoint that records changes to the block has. So the write that the test is tearing up is the in-place metadata overwrite, so it's creating physical metadata corruption in the storage. That corruption persists until all the sectors in the metadata block have been updated, at which point your test failures go away again. Hence to answer your original question: Yes, XFS is behaving exactly as it was designed to behave. The metadata verifiers have correctly detected corruption that has resulted from the storage tearing all it's writes to little pieces and that journal recovery couldn't automatically repair after the fact. We failed to repair it automatically beacuse the nature of the torn write told log recovery "don't recover this metadata from the journal because it is already up to date". Instead, the problem was detected on first access to the torn up metadata, and by xfs_repair. IOWs, there is no problems with XFS here. If there is any issue at all, it is with the assumption that filesystems can always cleanly recovery from massively (or randomly) torn writes. The fact is that they can't and that's why we have things like CRCs and self describing metadata to detect when unexpected or unrecoverable torn or misplaced writes occur deep down in the storage layers... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-13 22:46 ` Dave Chinner @ 2022-03-14 15:18 ` Manfred Spraul 2022-03-16 8:55 ` Manfred Spraul 0 siblings, 1 reply; 11+ messages in thread From: Manfred Spraul @ 2022-03-14 15:18 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT) Hi Dave, On 3/13/22 23:46, Dave Chinner wrote: > On Sun, Mar 13, 2022 at 04:47:19PM +0100, Manfred Spraul wrote: >> Hello together, >> >> >> after a simulated power failure, I have observed: >> >> Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], >> xfs_dir3_block block 0x86f58 >> [14768.047531] XFS (loop0): Unmount and run xfs_repair >> [14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer: >> [14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 >> XDB3..........oX > For future reference, please paste the entire log message, from > the time that the fs was mounted to the end of the hexdump output. > You might not think the hexdump output is important, but as you'll > see later.... Noted. I had to chose what I add into the mail, too much information. >> <<< >> >> Is this a known issue? > Is what a known issue? All this is XFS finding a corrupt metadata > block because a CRC is invalid, which is exactly what it's supposed > to do. > > As it is, CRC errors are indicative of storage problem such as bit > errors and torn writes, because what has been read from disk does > not match what XFS wrote when it calculated the CRC. > >> The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz >> >> As first question: >> >> Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are >> atomic? > 512 byte *IO* is supported on devices that have 512 byte sector > support, but there are other rules that XFS sets for metadata. e.g. > that metadata writes are expected to be written completely or > replayed completely as a whole unit regardless of their length. > This > is bookended by the use of cache flushes and FUAs to ensure that > multi-sector writes are wholly completed before the recovery > information is tossed away. [...] >> How were the power failures simulated: >> >> I added support to nbd to log all write operations, including the written >> data. This got merged into nbd-3.24 >> >> I've used that to create a log of running dbench (+ a few tar/rm/manual >> tests) on a 500 MB image file. >> >> In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long. >> >> If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802 >> blocks, the above listed error message is shown. >> >> After 1,821,799 or 1,821,803 sectors, everything is ok. (Correcting my own typo:) 1,821,798 or 1,821,803 are ok. >> >> (block numbers are 0-based) >> >>>> H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE) >>> O=0000000010deb000 L=00001000 >>> block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), len >>> 512 (0x200). >>> block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), len >>> 512 (0x200). >>> block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), len >>> 512 (0x200). << OK >>> block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), len >>> 512 (0x200). FAIL >>> block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), len >>> 512 (0x200). FAIL >>> block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), len >>> 512 (0x200). FAIL >>> block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), len >>> 512 (0x200). FAIL >>> block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), len >>> 512 (0x200). << OK > OK, this test is explicitly tearing writes at the storage level. > When there is an update to multiple sectors of the metadata block, > the metadata will be inconsistent on disk while those individual > sector writes are replayed. Thanks for the clarification. I'll modify the test application to never tear write operations and retry. If there are findings, then I'll distribute them. -- Manfred ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-14 15:18 ` Manfred Spraul @ 2022-03-16 8:55 ` Manfred Spraul 2022-03-17 2:47 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Manfred Spraul @ 2022-03-16 8:55 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT) Hi Dave, On 3/14/22 16:18, Manfred Spraul wrote: > Hi Dave, > > On 3/13/22 23:46, Dave Chinner wrote: >> OK, this test is explicitly tearing writes at the storage level. >> When there is an update to multiple sectors of the metadata block, >> the metadata will be inconsistent on disk while those individual >> sector writes are replayed. > > Thanks for the clarification. > > I'll modify the test application to never tear write operations and > retry. > > If there are findings, then I'll distribute them. > I've modified the test app, and with 4000 simulated power failures I have not seen any corruptions. Thus: - With teared write operations: 2 corruptions from ~800 simulated power failures - Without teared write operations: no corruptions from ~4000 simulated power failures. But: I've checked the eMMC specification, and the spec allows that teared write happen: JESD84-B51A, chapter 6.6.8.1: > All of the sectors being modified by the write operation that was interrupted may be in one of the following states: all sectors contain new data, all sectors contain old data or some sectors contain new data and some sectors contain old data. "some sectors contain new data and some sectors contain old data". NVM also appears to allow tearing for writes larger than a certain size (and the size is 2 kB in the example in the spec, and one observed corruption happened when tearing a 20 kB write that crosses a 32kB boundary) NVMe-NVM-Command-Set-Specification-1.0a-2021.07.26-Ratified, Chapter 2.1.4.2AWUPF/NAWUPF > If a write command is submitted with size greater than the > AWUPF/NAWUPF value or crosses an atomic > boundary, then there is no guarantee of the data returned on > subsequent reads of the associated logical > blocks. Is my understanding correct that XFS support neither eMMC nor NVM devices? (unless there is a battery backup that exceeds the guarantees from the spec) -- Manfred ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-16 8:55 ` Manfred Spraul @ 2022-03-17 2:47 ` Dave Chinner 2022-03-17 3:08 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2022-03-17 2:47 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT) On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote: > Hi Dave, > > On 3/14/22 16:18, Manfred Spraul wrote: > > Hi Dave, > > > > On 3/13/22 23:46, Dave Chinner wrote: > > > OK, this test is explicitly tearing writes at the storage level. > > > When there is an update to multiple sectors of the metadata block, > > > the metadata will be inconsistent on disk while those individual > > > sector writes are replayed. > > > > Thanks for the clarification. > > > > I'll modify the test application to never tear write operations and > > retry. > > > > If there are findings, then I'll distribute them. > > > I've modified the test app, and with 4000 simulated power failures I have > not seen any corruptions. > > > Thus: > > - With teared write operations: 2 corruptions from ~800 simulated power > failures > > - Without teared write operations: no corruptions from ~4000 simulated power > failures. Good to hear. > But: > > I've checked the eMMC specification, and the spec allows that teared write > happen: Yes, most storage only guarantees that sector writes are atomic and so multi-sector writes have no guarantees of being written atomically. IOWs, all storage technologies that currently exist are allowed to tear multi-sector writes. However, FUA writes are guaranteed to be whole on persistent storage regardless of size when the hardware signals completion. And any write that the hardware has signalled as complete before a cache flush is received is also guaranteed to be whole on persistent storage when the cache flush is signalled as complete by the hardware. These mechanisms provide protection against torn writes. IOWs, it's up to filesystems to guarantee data is on stable storage before they trust it fully. Filesystems are pretty good at using REQ_FLUSH, REQ_FUA and write completion ordering to ensure that anything they need whole and complete on stable storage is actually whole and complete. In the cases where torn writes occur because that haven't been covered by a FUA or cache flush guarantee (such as your test), filesystems need mechanisms in their metadata to detect such events. CRCs are the prime mechanism for this - that's what XFS uses, and it was XFS reporting a CRC failure when reading torn metadata that started this whole thread. > Is my understanding correct that XFS support neither eMMC nor NVM devices? > (unless there is a battery backup that exceeds the guarantees from the spec) Incorrect. They are supported just fine because flush/FUA semantics provide guarantees against torn writes in normal operation. IOWs, torn writes are something that almost *never* happen in real life, even when power fails suddenly. Despite this, XFS can detect it has occurred (because broken storage is all too common!), and if it can't recovery automatically, it will shut down and ask the user to correct the problem. BTRFS and ZFS can also detect torn writes, and if you use the (non-default) ext4 option "metadata_csum" it will also detect torn writes to metadata via CRC failures. There are other filesystems that can detect and correct torn writes, too. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-17 2:47 ` Dave Chinner @ 2022-03-17 3:08 ` Dave Chinner 2022-03-17 6:49 ` Manfred Spraul 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2022-03-17 3:08 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT) On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote: > On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote: > > Hi Dave, > > > > On 3/14/22 16:18, Manfred Spraul wrote: > > > Hi Dave, > > > > > > On 3/13/22 23:46, Dave Chinner wrote: > > > > OK, this test is explicitly tearing writes at the storage level. > > > > When there is an update to multiple sectors of the metadata block, > > > > the metadata will be inconsistent on disk while those individual > > > > sector writes are replayed. > > > > > > Thanks for the clarification. > > > > > > I'll modify the test application to never tear write operations and > > > retry. > > > > > > If there are findings, then I'll distribute them. > > > > > I've modified the test app, and with 4000 simulated power failures I have > > not seen any corruptions. > > > > > > Thus: > > > > - With teared write operations: 2 corruptions from ~800 simulated power > > failures > > > > - Without teared write operations: no corruptions from ~4000 simulated power > > failures. > > Good to hear. > > > But: > > > > I've checked the eMMC specification, and the spec allows that teared write > > happen: > > Yes, most storage only guarantees that sector writes are atomic and > so multi-sector writes have no guarantees of being written > atomically. IOWs, all storage technologies that currently exist are > allowed to tear multi-sector writes. > > However, FUA writes are guaranteed to be whole on persistent storage > regardless of size when the hardware signals completion. And any > write that the hardware has signalled as complete before a cache > flush is received is also guaranteed to be whole on persistent > storage when the cache flush is signalled as complete by the > hardware. These mechanisms provide protection against torn writes. > > IOWs, it's up to filesystems to guarantee data is on stable storage > before they trust it fully. Filesystems are pretty good at using > REQ_FLUSH, REQ_FUA and write completion ordering to ensure that > anything they need whole and complete on stable storage is actually > whole and complete. > > In the cases where torn writes occur because that haven't been > covered by a FUA or cache flush guarantee (such as your test), > filesystems need mechanisms in their metadata to detect such events. > CRCs are the prime mechanism for this - that's what XFS uses, and it > was XFS reporting a CRC failure when reading torn metadata that > started this whole thread. > > > Is my understanding correct that XFS support neither eMMC nor NVM devices? > > (unless there is a battery backup that exceeds the guarantees from the spec) > > Incorrect. > > They are supported just fine because flush/FUA semantics provide > guarantees against torn writes in normal operation. IOWs, torn > writes are something that almost *never* happen in real life, even > when power fails suddenly. Despite this, XFS can detect it has > occurred (because broken storage is all too common!), and if it > can't recovery automatically, it will shut down and ask the user to > correct the problem. > > BTRFS and ZFS can also detect torn writes, and if you use the > (non-default) ext4 option "metadata_csum" it will also detect torn Correction - metadata_csum is ienabled by default, I just ran the wrong mkfs command when I tested it a few moments ago. -Dave. > writes to metadata via CRC failures. There are other filesystems > that can detect and correct torn writes, too. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-17 3:08 ` Dave Chinner @ 2022-03-17 6:49 ` Manfred Spraul 2022-03-17 8:24 ` Dave Chinner 2022-03-17 14:50 ` Theodore Ts'o 0 siblings, 2 replies; 11+ messages in thread From: Manfred Spraul @ 2022-03-17 6:49 UTC (permalink / raw) To: Dave Chinner, Theodore Ts'o; +Cc: linux-xfs, Spraul Manfred (XC/QMM21-CT) [-- Attachment #1: Type: text/plain, Size: 2972 bytes --] Hi Dave, [+Ted as the topic also applies to ext4] On 3/17/22 04:08, Dave Chinner wrote: > On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote: >> On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote: >>> Hi Dave, >>> >>> On 3/14/22 16:18, Manfred Spraul wrote: >>> >>> But: >>> >>> I've checked the eMMC specification, and the spec allows that teared write >>> happen: >> Yes, most storage only guarantees that sector writes are atomic and >> so multi-sector writes have no guarantees of being written >> atomically. IOWs, all storage technologies that currently exist are >> allowed to tear multi-sector writes. >> >> However, FUA writes are guaranteed to be whole on persistent storage >> regardless of size when the hardware signals completion. And any >> write that the hardware has signalled as complete before a cache >> flush is received is also guaranteed to be whole on persistent >> storage when the cache flush is signalled as complete by the >> hardware. These mechanisms provide protection against torn writes. My plan was to create a replay application that randomly creates disc images allowed by the writeback_cache_control documentation. https://www.kernel.org/doc/html/latest/block/writeback_cache_control.html And then check that the filesystem behaves as expected/defined. The first step was: Implement the framework and just stop at a random location. >>> Is my understanding correct that XFS support neither eMMC nor NVM devices? >>> (unless there is a battery backup that exceeds the guarantees from the spec) >> Incorrect. >> >> They are supported just fine because flush/FUA semantics provide >> guarantees against torn writes in normal operation. IOWs, torn >> writes are something that almost *never* happen in real life, even >> when power fails suddenly. Despite this, XFS can detect it has >> occurred (because broken storage is all too common!), and if it >> can't recovery automatically, it will shut down and ask the user to >> correct the problem. So for xfs the behavior should be: - without torn writes: Mount always successful, no errors when accessing the content. - with torn writes: There may be error that will be detected only at runtime. The errors may at the end cause a file system shutdown. (commented dmesg is attached) The application I have in mind are embedded systems. I.e. there is no user that can correct something, the recovery strategy must be included in the design. >> BTRFS and ZFS can also detect torn writes, and if you use the >> (non-default) ext4 option "metadata_csum" it will also detect torn > Correction - metadata_csum is ienabled by default, I just ran the > wrong mkfs command when I tested it a few moments ago. For ext4, I have seen so far only corrupted commit blocks that cause mount failures. https://lore.kernel.org/all/8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com/ But Ted didn't confirm yet that this is per design :-) -- Manfred [-- Attachment #2: dmesg-final.txt --] [-- Type: text/plain, Size: 10860 bytes --] 1) setup [ 1591.878832] loop0: detected capacity change from 0 to 1024000 For info: md5sum of the image file: b7103b519ada7dc5281d7a42c29a4271 /tmp/mount_img-2536.img 2) mount. Command: mount -t auto /dev/loop0 x [ 1591.911516] XFS (loop0): Mounting V5 Filesystem [ 1591.945058] XFS (loop0): Starting recovery (logdev: internal) [ 1591.949055] XFS (loop0): resetting quota flags [ 1591.949590] XFS (loop0): Ending recovery (logdev: internal) Especially: Corruption not noticed at mount time. 3) find x -type f [ 1741.033535] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1741.033693] XFS (loop0): Unmount and run xfs_repair [ 1741.033696] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1741.033700] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1741.033704] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1741.033706] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1741.033708] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1741.033711] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1741.033713] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1741.033715] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1741.033717] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1741.033761] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1741.033886] XFS (loop0): Unmount and run xfs_repair [ 1741.033889] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1741.033893] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1741.033896] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1741.033899] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1741.033901] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1741.033903] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1741.033905] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1741.033907] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1741.033909] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1741.033920] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74 --> corruption noticed at run time. FS tries to continue. 4) manual playing around in the filesystem: rm -Rf mv a ../b [ 1824.642762] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1824.643000] XFS (loop0): Unmount and run xfs_repair [ 1824.643006] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1824.643014] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1824.643020] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1824.643025] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1824.643030] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1824.643035] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1824.643040] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1824.643044] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1824.643049] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1824.643145] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1824.643361] XFS (loop0): Unmount and run xfs_repair [ 1824.643366] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1824.643371] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1824.643377] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1824.643381] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1824.643386] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1824.643390] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1824.643395] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1824.643399] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1824.643403] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1824.643433] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74 [ 1824.643749] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1824.643880] XFS (loop0): Unmount and run xfs_repair [ 1824.643883] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1824.643887] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1824.643891] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1824.643893] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1824.643896] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1824.643898] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1824.643900] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1824.643903] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1824.643905] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1824.643948] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1824.644074] XFS (loop0): Unmount and run xfs_repair [ 1824.644076] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1824.644079] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1824.644081] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1824.644084] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1824.644086] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1824.644088] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1824.644091] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1824.644093] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1824.644095] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1824.644107] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74 [ 1838.578296] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1838.578452] XFS (loop0): Unmount and run xfs_repair [ 1838.578456] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1838.578460] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1838.578464] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1838.578467] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1838.578470] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1838.578472] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1838.578475] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1838.578477] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1838.578479] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1838.578529] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1838.578699] XFS (loop0): Unmount and run xfs_repair [ 1838.578703] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1838.578707] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1838.578711] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1838.578715] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1838.578719] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1838.578724] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1838.578728] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1838.578732] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1838.578736] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1838.578785] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74 [ 1876.302019] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 [ 1876.302177] XFS (loop0): Unmount and run xfs_repair [ 1876.302180] XFS (loop0): First 128 bytes of corrupted metadata buffer: [ 1876.302184] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 XDB3..........oX [ 1876.302188] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1 .......\SZ5#W,L. [ 1876.302191] 00000020: 8d ac 7c b0 38 a9 ec b7 00 00 00 00 00 08 69 25 ..|.8.........i% [ 1876.302194] 00000030: 02 28 0d 40 01 68 00 30 01 c8 00 30 00 00 00 00 .(.@.h.0...0.... [ 1876.302196] 00000040: 00 00 00 00 00 08 69 25 01 2e 02 00 00 00 00 40 ......i%.......@ [ 1876.302199] 00000050: 00 00 00 00 00 0c 0a 4d 02 2e 2e 02 00 00 00 50 .......M.......P [ 1876.302201] 00000060: 00 00 00 00 00 08 69 26 0b 43 44 52 52 4f 4c 53 ......i&.CDRROLS [ 1876.302204] 00000070: 2e 43 46 47 01 00 00 60 00 00 00 00 00 08 69 27 .CFG...`......i' [ 1876.302221] XFS (loop0): metadata I/O error in "xfs_da_read_buf+0xb1/0x110 [xfs]" at daddr 0x86f58 len 8 error 74 [ 1876.302656] XFS (loop0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x12f/0x2b0 [xfs] (fs/xfs/xfs_trans_buf.c:296). Shutting down filesystem. [ 1876.302912] XFS (loop0): Please unmount the filesystem and rectify the problem(s) -> Now I broke it :-( 4) umount [ 1931.725585] XFS (loop0): Unmounting Filesystem 5) run xfs_repair. This fails due to a log that first needs to be replayed 6) mount -t auto /dev/loop0 x [ 2041.247530] XFS (loop0): Mounting V5 Filesystem [ 2041.251838] XFS (loop0): Starting recovery (logdev: internal) [ 2041.252877] XFS (loop0): Ending recovery (logdev: internal) 7) umount [ 2047.218062] XFS (loop0): Unmounting Filesystem 8) run xfs_repair This is successful. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-17 6:49 ` Manfred Spraul @ 2022-03-17 8:24 ` Dave Chinner 2022-03-17 16:09 ` Manfred Spraul 2022-03-17 14:50 ` Theodore Ts'o 1 sibling, 1 reply; 11+ messages in thread From: Dave Chinner @ 2022-03-17 8:24 UTC (permalink / raw) To: Manfred Spraul; +Cc: Theodore Ts'o, linux-xfs, Spraul Manfred (XC/QMM21-CT) On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote: > Hi Dave, > > [+Ted as the topic also applies to ext4] > > On 3/17/22 04:08, Dave Chinner wrote: > > On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote: > > > On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote: > > > > Hi Dave, > > > > > > > > On 3/14/22 16:18, Manfred Spraul wrote: > > > > > > > > But: > > > > > > > > I've checked the eMMC specification, and the spec allows that teared write > > > > happen: > > > Yes, most storage only guarantees that sector writes are atomic and > > > so multi-sector writes have no guarantees of being written > > > atomically. IOWs, all storage technologies that currently exist are > > > allowed to tear multi-sector writes. > > > > > > However, FUA writes are guaranteed to be whole on persistent storage > > > regardless of size when the hardware signals completion. And any > > > write that the hardware has signalled as complete before a cache > > > flush is received is also guaranteed to be whole on persistent > > > storage when the cache flush is signalled as complete by the > > > hardware. These mechanisms provide protection against torn writes. > > My plan was to create a replay application that randomly creates disc images > allowed by the writeback_cache_control documentation. > > https://www.kernel.org/doc/html/latest/block/writeback_cache_control.html > > And then check that the filesystem behaves as expected/defined. We already have that tool that exercises stepwise flush/fua aware write recovery for filesystem testing: dm-logwrites was written and integrated into fstests years ago (2016?) by Josef Bacik for testing btrfs recovery, but it was a generic solution that all filesystems can use to test failure recovery.... See, for example, common/dmlogwrites and tests/generic/482 - g/482 uses fsstress to randomly modify the filesystem while dm-logwrites records all the writes made by the filesystem. It then replays them one flush/fua at a time, mounting the filesystem to ensure that it can recover the filesystem, then runs filesystem checkers to ensure that the filesystem does not have any corrupt metadata. Then it replays to the next flush/fua and repeats. tools/dm-logwrite-replay provides a script and documents the methodology to run step by step through replay of g/482 failures to be able to reliably reproduce and diagnose the cause of the failure. There's no need to re-invent the wheel if we've already got a perfectly good one... > > > > Is my understanding correct that XFS support neither eMMC nor NVM devices? > > > > (unless there is a battery backup that exceeds the guarantees from the spec) > > > Incorrect. > > > > > > They are supported just fine because flush/FUA semantics provide > > > guarantees against torn writes in normal operation. IOWs, torn > > > writes are something that almost *never* happen in real life, even > > > when power fails suddenly. Despite this, XFS can detect it has > > > occurred (because broken storage is all too common!), and if it > > > can't recovery automatically, it will shut down and ask the user to > > > correct the problem. > > So for xfs the behavior should be: > > - without torn writes: Mount always successful, no errors when accessing the > content. Yes. Of course, there are software bugs, so mounts, recovery and subsequent repair testing can still fail. > - with torn writes: There may be error that will be detected only at > runtime. The errors may at the end cause a file system shutdown. Yes, and they may even prevent the filesystem from being mounted because recovery trips over them (e.g. processing pending unlinked inodes or replaying incomplete intents). > (commented dmesg is attached) > > The application I have in mind are embedded systems. > I.e. there is no user that can correct something, the recovery strategy must > be included in the design. Good luck with that - storage hardware fails in ways that no existing filesystem can recover automatically from 100% of the time. And very few even attempt to do so because it is largely an impossible requirement to fulfil. Torn writes are just the tip of the iceberg.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-17 8:24 ` Dave Chinner @ 2022-03-17 16:09 ` Manfred Spraul 0 siblings, 0 replies; 11+ messages in thread From: Manfred Spraul @ 2022-03-17 16:09 UTC (permalink / raw) To: Dave Chinner; +Cc: Theodore Ts'o, linux-xfs, Spraul Manfred (XC/QMM21-CT) Hi Dave, On 3/17/22 09:24, Dave Chinner wrote: > On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote: >> Hi Dave, >> >> [+Ted as the topic also applies to ext4] >> >> On 3/17/22 04:08, Dave Chinner wrote: >>> On Thu, Mar 17, 2022 at 01:47:05PM +1100, Dave Chinner wrote: >>>> On Wed, Mar 16, 2022 at 09:55:04AM +0100, Manfred Spraul wrote: >>>>> Hi Dave, >>>>> >>>>> On 3/14/22 16:18, Manfred Spraul wrote: >>>>> >>>>> But: >>>>> >>>>> I've checked the eMMC specification, and the spec allows that teared write >>>>> happen: >>>> Yes, most storage only guarantees that sector writes are atomic and >>>> so multi-sector writes have no guarantees of being written >>>> atomically. IOWs, all storage technologies that currently exist are >>>> allowed to tear multi-sector writes. >>>> >>>> However, FUA writes are guaranteed to be whole on persistent storage >>>> regardless of size when the hardware signals completion. And any >>>> write that the hardware has signalled as complete before a cache >>>> flush is received is also guaranteed to be whole on persistent >>>> storage when the cache flush is signalled as complete by the >>>> hardware. These mechanisms provide protection against torn writes. >> My plan was to create a replay application that randomly creates disc images >> allowed by the writeback_cache_control documentation. >> >> https://www.kernel.org/doc/html/latest/block/writeback_cache_control.html >> >> And then check that the filesystem behaves as expected/defined. > We already have that tool that exercises stepwise flush/fua aware > write recovery for filesystem testing: dm-logwrites was written and > integrated into fstests years ago (2016?) by Josef Bacik for testing > btrfs recovery, but it was a generic solution that all filesystems > can use to test failure recovery.... > > See, for example, common/dmlogwrites and tests/generic/482 - g/482 > uses fsstress to randomly modify the filesystem while dm-logwrites > records all the writes made by the filesystem. It then replays them > one flush/fua at a time, mounting the filesystem to ensure that it > can recover the filesystem, then runs filesystem checkers to ensure > that the filesystem does not have any corrupt metadata. Then it > replays to the next flush/fua and repeats. > > tools/dm-logwrite-replay provides a script and documents the > methodology to run step by step through replay of g/482 failures to > be able to reliably reproduce and diagnose the cause of the failure. > > There's no need to re-invent the wheel if we've already got a > perfectly good one... Thanks a lot for the hint! I was thinking were a replay tool might exist and came up with nbd. Feedback was that it doesn't exist so I wrote something. I didn't think about dm. I'll look at dm-log-writes. >>>>> Is my understanding correct that XFS support neither eMMC nor NVM devices? >>>>> (unless there is a battery backup that exceeds the guarantees from the spec) >>>> Incorrect. >>>> >>>> They are supported just fine because flush/FUA semantics provide >>>> guarantees against torn writes in normal operation. IOWs, torn >>>> writes are something that almost *never* happen in real life, even >>>> when power fails suddenly. Despite this, XFS can detect it has >>>> occurred (because broken storage is all too common!), and if it >>>> can't recovery automatically, it will shut down and ask the user to >>>> correct the problem. >> So for xfs the behavior should be: >> >> - without torn writes: Mount always successful, no errors when accessing the >> content. > Yes. > > Of course, there are software bugs, so mounts, recovery and > subsequent repair testing can still fail. > >> - with torn writes: There may be error that will be detected only at >> runtime. The errors may at the end cause a file system shutdown. > Yes, and they may even prevent the filesystem from being mounted > because recovery trips over them (e.g. processing pending unlinked > inodes or replaying incomplete intents). > >> (commented dmesg is attached) >> >> The application I have in mind are embedded systems. >> I.e. there is no user that can correct something, the recovery strategy must >> be included in the design. > Good luck with that - storage hardware fails in ways that no > existing filesystem can recover automatically from 100% of the time. > And very few even attempt to do so because it is largely an > impossible requirement to fulfil. Torn writes are just the tip of > the iceberg.... Yes :-( -- Manfred ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-17 6:49 ` Manfred Spraul 2022-03-17 8:24 ` Dave Chinner @ 2022-03-17 14:50 ` Theodore Ts'o 2022-03-17 16:03 ` Manfred Spraul 1 sibling, 1 reply; 11+ messages in thread From: Theodore Ts'o @ 2022-03-17 14:50 UTC (permalink / raw) To: Manfred Spraul; +Cc: Dave Chinner, linux-xfs, Spraul Manfred (XC/QMM21-CT) On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote: > > > > BTRFS and ZFS can also detect torn writes, and if you use the > > > (non-default) ext4 option "metadata_csum" it will also detect torn > > Correction - metadata_csum is ienabled by default, I just ran the > > wrong mkfs command when I tested it a few moments ago. > > For ext4, I have seen so far only corrupted commit blocks that cause mount > failures. > > https://lore.kernel.org/all/8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com/ Ext4 uses FUA writes (if available) to write out the commit block. If a FUA write can result in torn writes, in my opinion that's a bug with the storage device, or if eMMC devices don't respect FUA writes correctly, then we should just disable FUA writes entirely. In the absence of FUA, ext4 does assume that we can write out the commit block as a 4k write, and then issue a cache flush. If your simulator assumes that the 4k write can be torn, on the assumption that there is a narrow race between the issuance of the 4k write, the device writing 1-3 512 byte sectors, and then due to a power failure, the cache flush doesn't complete and the result is a torn write --- quite frankly, I'm not sure how any system using checksums can deal with that situation. I think we can only assume that that case is in reality quite rare, even if it's technically allowed by the spec. - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 2022-03-17 14:50 ` Theodore Ts'o @ 2022-03-17 16:03 ` Manfred Spraul 0 siblings, 0 replies; 11+ messages in thread From: Manfred Spraul @ 2022-03-17 16:03 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-xfs, Spraul Manfred (XC/QMM21-CT) On 3/17/22 15:50, Theodore Ts'o wrote: > On Thu, Mar 17, 2022 at 07:49:02AM +0100, Manfred Spraul wrote: >>>> BTRFS and ZFS can also detect torn writes, and if you use the >>>> (non-default) ext4 option "metadata_csum" it will also detect torn >>> Correction - metadata_csum is ienabled by default, I just ran the >>> wrong mkfs command when I tested it a few moments ago. >> For ext4, I have seen so far only corrupted commit blocks that cause mount >> failures. >> >> https://lore.kernel.org/all/8fe067d0-6d57-9dd7-2c10-5a2c34037ee1@colorfullife.com/ > Ext4 uses FUA writes (if available) to write out the commit block. If > a FUA write can result in torn writes, in my opinion that's a bug with > the storage device, or if eMMC devices don't respect FUA writes > correctly, then we should just disable FUA writes entirely. > > In the absence of FUA, ext4 does assume that we can write out the > commit block as a 4k write, and then issue a cache flush. If your > simulator assumes that the 4k write can be torn, on the assumption > that there is a narrow race between the issuance of the 4k write, the > device writing 1-3 512 byte sectors, and then due to a power failure, > the cache flush doesn't complete and the result is a torn write --- > quite frankly, I'm not sure how any system using checksums can deal > with that situation. I think we can only assume that that case is in > reality quite rare, even if it's technically allowed by the spec. Just checking the eMMC Spec (JESD 84-B51A) Table 40, Admitted Data Sector Size, Address Mode and Reliable write Granularity: Native sector size 4 kB devices with emulation mode off have a write granularity of 4 kB. Otherwise the granularity is 512 bytes. So, to avoid the risk of torn writes for ext4, emulation mode should be disabled. For XFS, the spec provides no solution. (20 kB writes that crosses a 32 kB boundary) But, obviously: The real issues identified were much simpler and I have no evidence that torn writes are a real risk. -- Manfred ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-03-17 16:09 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-13 15:47 Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 Manfred Spraul 2022-03-13 22:46 ` Dave Chinner 2022-03-14 15:18 ` Manfred Spraul 2022-03-16 8:55 ` Manfred Spraul 2022-03-17 2:47 ` Dave Chinner 2022-03-17 3:08 ` Dave Chinner 2022-03-17 6:49 ` Manfred Spraul 2022-03-17 8:24 ` Dave Chinner 2022-03-17 16:09 ` Manfred Spraul 2022-03-17 14:50 ` Theodore Ts'o 2022-03-17 16:03 ` Manfred Spraul
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox