linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [QUESTION] Debugging some file data corruption
@ 2025-11-11 17:01 Calvin Owens
  2025-11-11 21:02 ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Calvin Owens @ 2025-11-11 17:01 UTC (permalink / raw)
  To: linux-block; +Cc: linux-raid, linux-btrfs, linux-crypto

Hello all,

I'm looking for help debugging some corruption I recently encountered.
It happened on 6.17.0, and I'm trying to reproduce it on 6.18-rc. This
is not really actionable yet, I'm just looking for advice.

After copying about 10TB of data to a btrfs+luks+mdraid1 across two 18TB
drives, a btrfs scrub of the filesystem on a second machine threw about
ten checksum failures. The filesystem has never experienced a powerfail
or any unclean shutdown.

I ran an MD check, which showed some differences, so I wrote a trivial
userspace tool to log the differences (appended to the end of this
mail). It found 21 locations where the drives differed (39 LBAs).

The corruptions are clearly not valid ciphertext (non-random). Here's
an example of one mismatched LBA from both mirrors:

    /dev/sdb                                                              /dev/sda

    e3 df 31 cf de 9b 5b 07  b7 c2 c2 ae 1e 3d 78 06  |..1...[......=x.|  e3 df 31 cf de 9b 5b 07  b7 c2 c2 ae 1e 3d 78 06  |..1...[......=x.|
    82 fc fe 1b 79 04 f0 89  4a 8d c3 03 ad 28 7f de  |....y...J....(..|  82 fc fe 1b 79 04 f0 89  4a 8d c3 03 ad 28 7f de  |....y...J....(..|
    59 8b b4 65 d3 1d dd a5  b7 77 58 a9 d9 a3 44 9c  |Y..e.....wX...D.|  59 8b b4 65 d3 1d dd a5  b7 77 58 a9 d9 a3 44 9c  |Y..e.....wX...D.|
    81 a8 03 1e bd e9 3f bb  98 0d ee a4 cd 4c 67 44  |......?......LgD|  81 a8 03 1e bd e9 3f bb  98 0d ee a4 cd 4c 67 44  |......?......LgD|
    5c a5 55 21 c1 ec cc 23  94 08 61 db cd b1 46 20  |\.U!...#..a...F |  5c a5 55 21 c1 ec cc 23  94 08 61 db cd b1 46 20  |\.U!...#..a...F |
    01 18 77 7e d9 c5 d7 27  b0 ef 5f b7 6f b7 d1 ab  |..w~...'.._.o...|  01 18 77 7e d9 c5 d7 27  b0 ef 5f b7 6f b7 d1 ab  |..w~...'.._.o...|
    42 e5 36 f0 d1 9a 11 2c  22 1b a7 16 34 1d 22 a1  |B.6....,"...4.".|  42 e5 36 f0 d1 9a 11 2c  22 1b a7 16 34 1d 22 a1  |B.6....,"...4.".|
    4b 26 62 c3 d0 fa 02 ef  c3 7b 59 a5 4d 29 c0 62  |K&b......{Y.M).b|  4b 26 62 c3 d0 fa 02 ef  c3 7b 59 a5 4d 29 c0 62  |K&b......{Y.M).b|
    aa 32 9e e2 0f b8 6d 2c  f2 4b 56 87 5f 70 cc 26  |.2....m,.KV._p.&|  aa 32 9e e2 0f b8 6d 2c  f2 4b 56 87 5f 70 cc 26  |.2....m,.KV._p.&|
    ae a2 18 a4 25 74 a6 88  98 07 26 75 10 a4 33 ae  |....%t....&u..3.|  ae a2 18 a4 25 74 a6 88  98 07 26 75 10 a4 33 ae  |....%t....&u..3.|
    9b 01 d9 b7 19 d7 5c b1  1d d6 a0 fe 63 0e b8 c5  |......\.....c...|  9b 01 d9 b7 19 d7 5c b1  1d d6 a0 fe 63 0e b8 c5  |......\.....c...|
    00 20 0f 78 80 0b 24 9c  09 c0 0f 2f d7 87 3a 8c  |. .x..$..../..:.|  00 20 0f 78 80 0b 24 9c  09 c0 0f 2f d7 87 3a 8c  |. .x..$..../..:.|
    2d c4 1e 21 c8 45 02 95  33 37 06 50 b0 7f 1c 36  |-..!.E..37.P...6|  2d c4 1e 21 c8 45 02 95  33 37 06 50 b0 7f 1c 36  |-..!.E..37.P...6|
    8e 65 8c a9 e1 11 2b 7b  4f 9b 8b bd c3 3b 97 d9  |.e....+{O....;..|  9f 46 1d 21 c8 45 16 2d  c9 1c 19 fc d8 b6 db ed  |.F.!.E.-........|
    69 0d 76 80 a4 60 c8 51  20 12 67 d8 b6 f5 fa 77  |i.v..`.Q .g....w|  c7 9b 18 21 c8 45 ea aa  20 3e e9 94 a0 8a e9 2c  |...!.E.. >.....,|
    09 2c 1a 16 73 a6 ea 48  c7 6e 39 5a e5 6d c4 34  |.,..s..H.n9Z.m.4|  f0 db 11 21 c8 45 41 29  a2 e2 32 4b c7 ae cb a5  |...!.EA)..2K....|
    ac 2e 12 3b d4 cc cc 7b  9d d7 6c 2c fa 19 9d b3  |...;...{..l,....|  8d 39 12 21 c8 45 78 47  4e 97 0c 46 43 ec 7a ba  |.9.!.ExGN..FC.z.|
    58 26 ab fb 01 32 d7 f8  fb ba c5 84 f4 fe 01 47  |X&...2.........G|  8a 55 16 21 c8 45 48 ad  28 32 9c 3c f3 17 eb a1  |.U.!.EH.(2.<....|
    92 b8 c9 40 d9 77 53 59  f8 3e 3f 41 49 03 4a c1  |...@.wSY.>?AI.J.|  ca cd 16 21 c8 45 ff 6b  c9 19 91 4c d4 0b 85 b2  |...!.E.k...L....|
    43 9a 7e 42 6b da a1 9e  7d de 03 f9 4c 61 ed a3  |C.~Bk...}...La..|  dd 1c 1d 21 c8 45 9d 54  f4 6c 0c cb e2 6c e5 ad  |...!.E.T.l...l..|
    7a 8f 54 f7 13 8f 7c ac  c2 db e0 ef c8 4c 80 80  |z.T...|......L..|  08 f7 19 21 c8 45 7e 49  90 47 7e 8d 1d 49 22 8c  |...!.E~I.G~..I".|
    cd f2 32 10 88 78 23 9f  d7 eb b3 da 98 77 3c 95  |..2..x#......w<.|  a5 8f 1e 21 c8 45 4c f9  b0 bc 74 6d 2e 80 7b 36  |...!.EL...tm..{6|
    fc 3b 8d 8f 46 82 cc 8c  cf f1 16 7d 01 59 d3 6e  |.;..F......}.Y.n|  18 0c 1b 21 c8 45 e3 51  09 5e a9 14 33 00 d9 30  |...!.E.Q.^..3..0|
    2d 8f 43 8a 12 e3 e7 26  8a 76 cc 28 89 4c 19 04  |-.C....&.v.(.L..|  70 51 13 21 c8 45 81 d9  54 45 89 e5 52 0d c2 ba  |pQ.!.E..TE..R...|
    d9 e1 a7 a5 a4 c9 ef 64  2b 6b 91 a3 b1 4a f8 98  |.......d+k...J..|  65 0c 18 21 c8 45 69 e0  a8 86 92 af 41 cf a0 40  |e..!.Ei.....A..@|
    69 18 c7 96 82 17 5b 57  34 75 84 c1 35 9e 38 d1  |i.....[W4u..5.8.|  7c 6d 10 21 c8 45 46 04  00 00 00 00 00 00 00 00  ||m.!.EF.........|
    93 87 c2 c7 28 3b 9d 48  f4 8d 14 17 cb b7 09 90  |....(;.H........|  19 5f 12 21 c8 45 b4 0c  27 37 cf 44 81 29 c7 8a  |._.!.E..'7.D.)..|
    c0 3c 04 ac e0 44 b2 1e  8a 77 7e 19 85 8f ff 6c  |.<...D...w~....l|  3d 17 1b 21 c8 45 50 d7  00 00 00 00 00 00 00 00  |=..!.EP.........|
    ca 7f 3a 5c d3 52 cb b3  cf a8 89 66 a0 a1 16 ef  |..:\.R.....f....|  87 e7 10 21 c8 45 eb 5c  0a 70 b0 de bf be 1d 5f  |...!.E.\.p....._|
    a4 87 d8 1d 57 c4 73 11  4d d8 0a cf a9 d2 43 08  |....W.s.M.....C.|  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    15 2d 4e c8 b2 d7 3b 55  30 59 a1 6e ff 36 83 33  |.-N...;U0Y.n.6.3|  0a f5 10 21 c8 45 d2 3e  00 00 00 00 00 00 00 00  |...!.E.>........|
    82 2b 4f 55 64 ec 5b 34  90 a6 7d 83 b1 af 44 75  |.+OUd.[4..}...Du|  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

The obvious "streak" of repeated bytes on a 16-byte stride followed by
chunks of zeros is very consistent across my 21 corrupt samples. The
streak always repeats a piece of the real data ("21 c8 45" above).

I uploaded all 21 samples here:

    https://github.com/jcalvinowens/lkml-debug-mdraid-618

Does this pattern of corruption mean anything to anybody?

Next, I "split" the mdraid1, using read-only loopback devices with an
offset so they appear as individual btrfs filesystems, and ran an
offline btrfs scrub against each individually.

The first drive:

    calvinow@hackpi1 ~ $ sudo losetup -f /dev/sda -o 135266304
    calvinow@hackpi1 ~ $ sudo cryptsetup luksOpen -r /dev/loop0 sda_only_crypt -d keyfiles/f1ce8b7f-cd32-4ccd-baee-8443f60d9d1d.key
    calvinow@hackpi1 ~ $ sudo btrfs check --check-data-csum /dev/mapper/sda_only_crypt
    [1/8] checking log skipped (none written)
    [2/8] checking root items
    [3/8] checking extents
    [4/8] checking free space tree
    [5/8] checking fs roots
    [6/8] checking csums against data
    mirror 1 bytenr 1850055942144 csum 0x7e5e39be4c95c330acb06620b4ad5647d2092d76ad217296213c835ead479608 expected csum 0xc91d405eb407360748a8a3300a098575cdfb559d169b7b0114f1901a76bc0168
    mirror 1 bytenr 2878389952512 csum 0xe8d67f41f097a790465ba475736b517ec18ebf5d4379c85e681d7ec0a2de148c expected csum 0xfa775cf8e22f6c493c9e4a4b35152146c6af6730972413f57472a0b84eaf1b48
    mirror 1 bytenr 3128346521600 csum 0x9ec824a9e4ddcffdf1062f1ba62b9451969a99be14be202f2439453d7d3004d2 expected csum 0xf671fa7c9ff422d63e33ab2308b743f2efec75bc665f4b39d2b3528ed67b8dd2
    mirror 1 bytenr 4804343029760 csum 0x120b9da3ad94bae74b87dd5348bffe2a63abae51ec1967fe4940ef4d0e04f6b2 expected csum 0xcb35ff0f65d3c83d58c973349e95ddca941db192704d90f117f5748d60847871
    mirror 1 bytenr 4864481669120 csum 0xc47f3d34ad469b42a6c4ac53082711d7a54adfc639b8cae0eca68fa85c553d9e expected csum 0x408d55c9ddb82d687d22cfd88cd97b92e9ada7eeafbf1ffe5361d6c8d668979e
    mirror 1 bytenr 5960080613376 csum 0x01b4b8ed8aeb58e10a817df137ecd7897d71c55e13a1e4b26057ad04545ec25f expected csum 0x2ab1dac1d5f588f945ef21c0e1cbaec6e80f0c1e70b7869b6d60ec10bf0904ad
    mirror 1 bytenr 5961402257408 csum 0xb0286e4ffb2dc14544aa9cbfc93b673f0bedabe2bf8897ea55ab56b360cdb2fd expected csum 0x94296eb48c729bc769b30249fd6f9a495e5bcd7ee53f5b9fa804e8e29b61a1a0
    mirror 1 bytenr 6542744866816 csum 0xa94702d753843de35c7c36aa421d577fc3ee266dcecf667ab889038e3816d9cf expected csum 0xa0f23e680a34bee49f92b93018f691a17c2a333b654969304e6f5fd7451efdf3
    mirror 1 bytenr 8062253760512 csum 0xba3a46a1c1b0305eeeea50576d45d42bc4c2df78dc4298ed8a0430f4ffe02b29 expected csum 0x7e0b0bc4e506d453b69fdd195575f5caafa33833fc690f790594eb9070737e92
    ERROR: errors found in csum tree
    [7/8] checking root refs
    [8/8] checking quota groups skipped (not enabled on this FS)
    Opening filesystem to check...
    Checking filesystem on /dev/mapper/sda_only_crypt
    UUID: 3bd1727b-c8ae-4876-96b2-9318c1f9556f
    found 8215070863360 bytes used, error(s) found
    total csum bytes: 63585280608
    total tree bytes: 76154945536
    total fs tree bytes: 379305984
    total extent tree bytes: 496074752
    btree space waste bytes: 11008701324
    file data blocks allocated: 8138915917824
     referenced 8335585824768

The second drive:

    calvinow@hackpi1 ~ $ sudo losetup -f /dev/sdb -o 135266304
    calvinow@hackpi1 ~ $ sudo cryptsetup luksOpen -r /dev/loop1 sdb_only_crypt -d keyfiles/f1ce8b7f-cd32-4ccd-baee-8443f60d9d1d.key
    calvinow@hackpi1 ~ $ sudo btrfs check --check-data-csum /dev/mapper/sdb_only_crypt
    [1/8] checking log skipped (none written)
    [2/8] checking root items
    [3/8] checking extents
    [4/8] checking free space tree
    [5/8] checking fs roots
    [6/8] checking csums against data
    mirror 1 bytenr 360507215872 csum 0xd73bd43f8bbe02623cb49d8c2d22098e9f8f7d1e168a4a3ce09afd6afb3fb7b7 expected csum 0xffd5b2ab2e2a988d1fd8161a770533e3fa12ce42a22a608b33da0e354423bcf2
    mirror 1 bytenr 948478869504 csum 0x7013cafbcede6444418ed6bc6dde69a074d95f54ada5c8f397d5a5a7c7401308 expected csum 0x9707f1e1d21f901608c7406928fb0ac4e4329ee84f0cd927711de1933088bfb9
    mirror 1 bytenr 1335329157120 csum 0x53b36149b0b1404e6033ccee7da2f78a771a216226570b153ec842f75579c01b expected csum 0x8c073e81cd73acbed85e10371bffd684b65480104c97484dd8c35dcd1d7292a4
    mirror 1 bytenr 3342155005952 csum 0xae6b7410b5b3525164b37162d62f0686ed69670dd0e636118520b750f4f90c98 expected csum 0x6813929e1c2858de05525d9a9d60b1a609a9c127ec0fbea386db872da06f911a
    mirror 1 bytenr 3394104115200 csum 0x0bbfe4d49e2831a3534ca5e2731d0913f7fb5a0742bfb584280b6e2ab140f79e expected csum 0xbb404a2115d4db21bca0376ec6a2b147438dd5dc12d2ce77657ffce3a341c791
    mirror 1 bytenr 3540474392576 csum 0xafdfb82fc21d4e9ee86f8f4eeeb80f72e821abc7efc16ec0d1543a8fe766f810 expected csum 0x6d431ad6c9a78ab9b1b4523135bf1f92d196cb3784c0012e82277fde1284f6bb
    mirror 1 bytenr 3764206202880 csum 0xd634900646ca8b656e0e461152abe4d56fcf266e66deaa7995723cdbaad1c3f6 expected csum 0x4bc0bb748c93d6764c4477e759fb0fff987398df31124e407a96ad88c1c08f12
    mirror 1 bytenr 5795747512320 csum 0x29ded608e1e53a33a018173cb1b9653ac1f5ae4c58b2728abcd4147dae1e9038 expected csum 0xdf19cadbca20a6b70673a78a18715e950ab9ff8dce9b5263b4a53106ae64c540
    mirror 1 bytenr 6096793878528 csum 0x73a9cf95307e303118f5a46c4150159b22be9300dd853987d44f162bd6ce3106 expected csum 0x1f8cb156b55dbdbf24ff126b182506f42987d59d470847cee0866a59a52d44c3
    mirror 1 bytenr 6684691709952 csum 0x809afb12bca20c21e167cb90d890144f74ac7860996348e78e8d4e8fe67a5815 expected csum 0xc67c8b6ced8d2e7ae7b3897b1ac7432bbec0c2566f759652ab0892095bca6689
    mirror 1 bytenr 7101833793536 csum 0xcaae7fa28faab629bb5826550cc94ed504bf3e8146c451d284b887405b649dde expected csum 0xa2ba3334fd5b18557139577fb0875a6ba142588654dead275407976ab8b6d80c
    mirror 1 bytenr 8067768504320 csum 0xafac926d7032c1f36efe0b0b4473e1eb1f165271a7ac913564c4b3824815f885 expected csum 0x8e2b6e08d78b7f0e60352091c7b4e329d7c1756806dcdbc7b66d3544499615dc
    ERROR: errors found in csum tree
    [7/8] checking root refs
    [8/8] checking quota groups skipped (not enabled on this FS)
    Opening filesystem to check...
    Checking filesystem on /dev/mapper/sdb_only_crypt
    UUID: 3bd1727b-c8ae-4876-96b2-9318c1f9556f
    found 8215070863360 bytes used, error(s) found
    total csum bytes: 63585280608
    total tree bytes: 76154945536
    total fs tree bytes: 379305984
    total extent tree bytes: 496074752
    btree space waste bytes: 11008701324
    file data blocks allocated: 8138915917824
     referenced 8335585824768

The full btrfs inspect-internal dump-tree output is about 160GB of text,
and was exactly identical from both mirrors.

Because the drives are encrypted, the real data are statistically
random, so I was able to use compression to test which blocks were
corrupt. I wrote all the "less" corrupt instances of each of the 21
blocks to drive B, and all the "more" corrupt instances to drive A,
using this python script:

    import bz2
    import os
    
    with open("combined-corrupt-lba-list.txt", "r") as f:
        lbas = [int(x.rstrip()) for x in f.readlines()]
    
    fd_sda = os.open("/dev/sda", os.O_RDWR | os.O_EXCL)
    fd_sdb = os.open("/dev/sdb", os.O_RDWR | os.O_EXCL)
    
    for lba in lbas:
        f1 = f"lba-{lba}.sda.bin"
        with open(f1, "rb") as f:
            d1 = f.read()
        l1 = len(bz2.compress(d1))
    
        f2 = f"lba-{lba}.sdb.bin"
        with open(f2, "rb") as f:
            d2 = f.read()
        l2 = len(bz2.compress(d2))
    
        if l1 < l2:
            print(f"sda bad, sdb good: {lba}")
            os.pwrite(fd_sda, d1, lba * 512)
            os.pwrite(fd_sdb, d2, lba * 512)
        elif l2 < l1:
            print(f"sda good, sdb bad: {lba}")
            os.pwrite(fd_sda, d2, lba * 512)
            os.pwrite(fd_sdb, d1, lba * 512)
        else:
            raise RuntimeError(f"can't tell: {lba}")

This was the output:

    sda good, sdb bad: 693943242
    sda good, sdb bad: 1850713991
    sda good, sdb bad: 2610475259
    sda bad, sdb good: 3630481074
    sda bad, sdb good: 5657885844
    sda bad, sdb good: 6146082269
    sda good, sdb bad: 6563676964
    sda good, sdb bad: 6677722982
    sda good, sdb bad: 6963602424
    sda good, sdb bad: 7400578618
    sda bad, sdb good: 9446775924
    sda bad, sdb good: 9564234206
    sda good, sdb bad: 11389404262
    sda bad, sdb good: 11712464500
    sda bad, sdb good: 11715045836
    sda good, sdb bad: 11979482596
    sda bad, sdb good: 12863063528
    sda good, sdb bad: 13140303458
    sda good, sdb bad: 13955034090
    sda bad, sdb good: 15845534402
    sda good, sdb bad: 15856305388

Then, I re-ran the offline scrubs: drive A now shows all the errors
originally seen across both drives, and drive B is now clean.

Finally, I ran userspace checksums of the full set of files on the
newly clean drive B: they perfectly match an older copy in my backups.

This proves that:

    1) RAID mismatches and btrfs checksum failures are strictly 1:1.
    2) For every RAID mismatch, strictly one mirror was corrupted.
    3) No slient corruption occurred, btrfs caught everything.

The hard drives are brand new, so that is my current suspicion. I've
used the same two-drive USB enclosure extensively with older HDDs and
never seen a problem. I'm running this FIO job to test them:

    [global]
    numjobs=1
    loops=20
    ioengine=io_uring
    rw=randrw
    percentage_random=5%
    rwmixwrite=95
    iodepth=32
    direct=1
    size=5%
    blocksize_range=1k-32m
    sync=none
    refill_buffers=1
    random_distribution=random
    random_generator=tausworthe64
    verify=xxhash
    verify_fatal=1
    verify_dump=1
    do_verify=1
    verify_async=$ncpus
    
    [hdd-sdb-test]
    filename=/dev/sdb
    
    [hdd-sdc-test]
    filename=/dev/sdc

...but no luck hitting anything after about 18 hours.

Unless anybody else has a better idea, my plan is to "work my way up"
the storage stack with FIO verify jobs like the above. Any advice would
be greatly appreciated :)

Thanks,
Calvin

--

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <string.h>

int main(int argc, char **argv)
{
	int fd1, fd2, m_ret = EXIT_SUCCESS;
	size_t lba = 0;

	if (argc < 3)
		errx(1, "Usage: ./test <drv1> <drv2>");

	fd1 = open(argv[1], O_RDONLY);
	if (fd1 == -1)
		err(1, "Can't open %s", argv[1]);

	fd2 = open(argv[2], O_RDONLY);
	if (fd2 == -1)
		err(1, "Can't open %s", argv[2]);

	while (1) {
		int ret1, ret2;
		char buf1[512];
		char buf2[512];

		ret1 = read(fd1, buf1, 512);
		if (ret1 == -1)
			err(2, "Bad read from input #1");

		ret2 = read(fd2, buf2, 512);
		if (ret2 == -1)
			err(2, "Bad read from input #2");

		if (ret1 != ret2)
			err(3, "Files differ in length at lba %lu (%d/%d)",
			    lba, ret1, ret2);

		if (ret1 == 0)
			break;

		if (ret1 != 512)
			err(4, "Short read!?");

		if (memcmp(buf1, buf2, 512)) {
			unsigned i;
			FILE *out;

			m_ret = EXIT_FAILURE;
			printf("LBA %lu differs:\n", lba);
			for (i = 0; i < 512; i++)
				buf1[i] ^= buf2[i];

			out = popen("hexdump -C", "w");
			if (!out)
				err(5, "Unable to dump XOR of blocks");

			fflush(stdout);
			fwrite(buf1, 1, 512, out);
			pclose(out);
		}

		lba++;
	}

	close(fd2);
	close(fd1);
	return m_ret;
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [QUESTION] Debugging some file data corruption
  2025-11-11 17:01 [QUESTION] Debugging some file data corruption Calvin Owens
@ 2025-11-11 21:02 ` Qu Wenruo
  2025-11-14 19:39   ` Calvin Owens
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2025-11-11 21:02 UTC (permalink / raw)
  To: Calvin Owens, linux-block; +Cc: linux-raid, linux-btrfs, linux-crypto



在 2025/11/12 03:31, Calvin Owens 写道:
> Hello all,
> 
> I'm looking for help debugging some corruption I recently encountered.
> It happened on 6.17.0, and I'm trying to reproduce it on 6.18-rc. This
> is not really actionable yet, I'm just looking for advice.
> 
> After copying about 10TB of data to a btrfs+luks+mdraid1 across two 18TB
> drives,

Recently there is a bug report about direct IO and mdraid1 where 
modifying the buffer on-the-fly can cause different contents on 
different mdraid1 mirrors.

Although I guess for your case, since it's btrfs on top and you have 
data checksum enabled, we can forget about the problem of direct IO.
As btrfs will force fall back to buffered IO if data csum is involved.

> a btrfs scrub of the filesystem on a second machine threw about
> ten checksum failures. The filesystem has never experienced a powerfail
> or any unclean shutdown.
> 
> I ran an MD check, which showed some differences, so I wrote a trivial
> userspace tool to log the differences (appended to the end of this
> mail). It found 21 locations where the drives differed (39 LBAs).
> 
> The corruptions are clearly not valid ciphertext (non-random). Here's
> an example of one mismatched LBA from both mirrors:
> 
>      /dev/sdb                                                              /dev/sda
> 
>      e3 df 31 cf de 9b 5b 07  b7 c2 c2 ae 1e 3d 78 06  |..1...[......=x.|  e3 df 31 cf de 9b 5b 07  b7 c2 c2 ae 1e 3d 78 06  |..1...[......=x.|
>      82 fc fe 1b 79 04 f0 89  4a 8d c3 03 ad 28 7f de  |....y...J....(..|  82 fc fe 1b 79 04 f0 89  4a 8d c3 03 ad 28 7f de  |....y...J....(..|
>      59 8b b4 65 d3 1d dd a5  b7 77 58 a9 d9 a3 44 9c  |Y..e.....wX...D.|  59 8b b4 65 d3 1d dd a5  b7 77 58 a9 d9 a3 44 9c  |Y..e.....wX...D.|
>      81 a8 03 1e bd e9 3f bb  98 0d ee a4 cd 4c 67 44  |......?......LgD|  81 a8 03 1e bd e9 3f bb  98 0d ee a4 cd 4c 67 44  |......?......LgD|
>      5c a5 55 21 c1 ec cc 23  94 08 61 db cd b1 46 20  |\.U!...#..a...F |  5c a5 55 21 c1 ec cc 23  94 08 61 db cd b1 46 20  |\.U!...#..a...F |
>      01 18 77 7e d9 c5 d7 27  b0 ef 5f b7 6f b7 d1 ab  |..w~...'.._.o...|  01 18 77 7e d9 c5 d7 27  b0 ef 5f b7 6f b7 d1 ab  |..w~...'.._.o...|
>      42 e5 36 f0 d1 9a 11 2c  22 1b a7 16 34 1d 22 a1  |B.6....,"...4.".|  42 e5 36 f0 d1 9a 11 2c  22 1b a7 16 34 1d 22 a1  |B.6....,"...4.".|
>      4b 26 62 c3 d0 fa 02 ef  c3 7b 59 a5 4d 29 c0 62  |K&b......{Y.M).b|  4b 26 62 c3 d0 fa 02 ef  c3 7b 59 a5 4d 29 c0 62  |K&b......{Y.M).b|
>      aa 32 9e e2 0f b8 6d 2c  f2 4b 56 87 5f 70 cc 26  |.2....m,.KV._p.&|  aa 32 9e e2 0f b8 6d 2c  f2 4b 56 87 5f 70 cc 26  |.2....m,.KV._p.&|
>      ae a2 18 a4 25 74 a6 88  98 07 26 75 10 a4 33 ae  |....%t....&u..3.|  ae a2 18 a4 25 74 a6 88  98 07 26 75 10 a4 33 ae  |....%t....&u..3.|
>      9b 01 d9 b7 19 d7 5c b1  1d d6 a0 fe 63 0e b8 c5  |......\.....c...|  9b 01 d9 b7 19 d7 5c b1  1d d6 a0 fe 63 0e b8 c5  |......\.....c...|
>      00 20 0f 78 80 0b 24 9c  09 c0 0f 2f d7 87 3a 8c  |. .x..$..../..:.|  00 20 0f 78 80 0b 24 9c  09 c0 0f 2f d7 87 3a 8c  |. .x..$..../..:.|
>      2d c4 1e 21 c8 45 02 95  33 37 06 50 b0 7f 1c 36  |-..!.E..37.P...6|  2d c4 1e 21 c8 45 02 95  33 37 06 50 b0 7f 1c 36  |-..!.E..37.P...6|
>      8e 65 8c a9 e1 11 2b 7b  4f 9b 8b bd c3 3b 97 d9  |.e....+{O....;..|  9f 46 1d 21 c8 45 16 2d  c9 1c 19 fc d8 b6 db ed  |.F.!.E.-........|
>      69 0d 76 80 a4 60 c8 51  20 12 67 d8 b6 f5 fa 77  |i.v..`.Q .g....w|  c7 9b 18 21 c8 45 ea aa  20 3e e9 94 a0 8a e9 2c  |...!.E.. >.....,|
>      09 2c 1a 16 73 a6 ea 48  c7 6e 39 5a e5 6d c4 34  |.,..s..H.n9Z.m.4|  f0 db 11 21 c8 45 41 29  a2 e2 32 4b c7 ae cb a5  |...!.EA)..2K....|
>      ac 2e 12 3b d4 cc cc 7b  9d d7 6c 2c fa 19 9d b3  |...;...{..l,....|  8d 39 12 21 c8 45 78 47  4e 97 0c 46 43 ec 7a ba  |.9.!.ExGN..FC.z.|
>      58 26 ab fb 01 32 d7 f8  fb ba c5 84 f4 fe 01 47  |X&...2.........G|  8a 55 16 21 c8 45 48 ad  28 32 9c 3c f3 17 eb a1  |.U.!.EH.(2.<....|
>      92 b8 c9 40 d9 77 53 59  f8 3e 3f 41 49 03 4a c1  |...@.wSY.>?AI.J.|  ca cd 16 21 c8 45 ff 6b  c9 19 91 4c d4 0b 85 b2  |...!.E.k...L....|
>      43 9a 7e 42 6b da a1 9e  7d de 03 f9 4c 61 ed a3  |C.~Bk...}...La..|  dd 1c 1d 21 c8 45 9d 54  f4 6c 0c cb e2 6c e5 ad  |...!.E.T.l...l..|
>      7a 8f 54 f7 13 8f 7c ac  c2 db e0 ef c8 4c 80 80  |z.T...|......L..|  08 f7 19 21 c8 45 7e 49  90 47 7e 8d 1d 49 22 8c  |...!.E~I.G~..I".|
>      cd f2 32 10 88 78 23 9f  d7 eb b3 da 98 77 3c 95  |..2..x#......w<.|  a5 8f 1e 21 c8 45 4c f9  b0 bc 74 6d 2e 80 7b 36  |...!.EL...tm..{6|
>      fc 3b 8d 8f 46 82 cc 8c  cf f1 16 7d 01 59 d3 6e  |.;..F......}.Y.n|  18 0c 1b 21 c8 45 e3 51  09 5e a9 14 33 00 d9 30  |...!.E.Q.^..3..0|
>      2d 8f 43 8a 12 e3 e7 26  8a 76 cc 28 89 4c 19 04  |-.C....&.v.(.L..|  70 51 13 21 c8 45 81 d9  54 45 89 e5 52 0d c2 ba  |pQ.!.E..TE..R...|
>      d9 e1 a7 a5 a4 c9 ef 64  2b 6b 91 a3 b1 4a f8 98  |.......d+k...J..|  65 0c 18 21 c8 45 69 e0  a8 86 92 af 41 cf a0 40  |e..!.Ei.....A..@|
>      69 18 c7 96 82 17 5b 57  34 75 84 c1 35 9e 38 d1  |i.....[W4u..5.8.|  7c 6d 10 21 c8 45 46 04  00 00 00 00 00 00 00 00  ||m.!.EF.........|
>      93 87 c2 c7 28 3b 9d 48  f4 8d 14 17 cb b7 09 90  |....(;.H........|  19 5f 12 21 c8 45 b4 0c  27 37 cf 44 81 29 c7 8a  |._.!.E..'7.D.)..|
>      c0 3c 04 ac e0 44 b2 1e  8a 77 7e 19 85 8f ff 6c  |.<...D...w~....l|  3d 17 1b 21 c8 45 50 d7  00 00 00 00 00 00 00 00  |=..!.EP.........|
>      ca 7f 3a 5c d3 52 cb b3  cf a8 89 66 a0 a1 16 ef  |..:\.R.....f....|  87 e7 10 21 c8 45 eb 5c  0a 70 b0 de bf be 1d 5f  |...!.E.\.p....._|
>      a4 87 d8 1d 57 c4 73 11  4d d8 0a cf a9 d2 43 08  |....W.s.M.....C.|  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>      15 2d 4e c8 b2 d7 3b 55  30 59 a1 6e ff 36 83 33  |.-N...;U0Y.n.6.3|  0a f5 10 21 c8 45 d2 3e  00 00 00 00 00 00 00 00  |...!.E.>........|
>      82 2b 4f 55 64 ec 5b 34  90 a6 7d 83 b1 af 44 75  |.+OUd.[4..}...Du|  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 
> The obvious "streak" of repeated bytes on a 16-byte stride followed by
> chunks of zeros is very consistent across my 21 corrupt samples. The
> streak always repeats a piece of the real data ("21 c8 45" above).
> 
> I uploaded all 21 samples here:
> 
>      https://github.com/jcalvinowens/lkml-debug-mdraid-618
> 
> Does this pattern of corruption mean anything to anybody?

With LUKS in the middle, it makes any corruption pattern very human 
unreadable.

I guess it's not really feasible to try to reproduce the problem again 
since it has 10TiB data involved?

But if you can spend a lot of time waiting for data copy, mind to try
the following combination(s)?

- btrfs on mdraid1
- btrfs RAID1 on raw two HDDs

> 
> Next, I "split" the mdraid1, using read-only loopback devices with an
> offset so they appear as individual btrfs filesystems, and ran an
> offline btrfs scrub against each individually.
> 
> The first drive:
> 
>      calvinow@hackpi1 ~ $ sudo losetup -f /dev/sda -o 135266304
>      calvinow@hackpi1 ~ $ sudo cryptsetup luksOpen -r /dev/loop0 sda_only_crypt -d keyfiles/f1ce8b7f-cd32-4ccd-baee-8443f60d9d1d.key
>      calvinow@hackpi1 ~ $ sudo btrfs check --check-data-csum /dev/mapper/sda_only_crypt
>      [1/8] checking log skipped (none written)
>      [2/8] checking root items
>      [3/8] checking extents
>      [4/8] checking free space tree
>      [5/8] checking fs roots
>      [6/8] checking csums against data
>      mirror 1 bytenr 1850055942144 csum 0x7e5e39be4c95c330acb06620b4ad5647d2092d76ad217296213c835ead479608 expected csum 0xc91d405eb407360748a8a3300a098575cdfb559d169b7b0114f1901a76bc0168
>      mirror 1 bytenr 2878389952512 csum 0xe8d67f41f097a790465ba475736b517ec18ebf5d4379c85e681d7ec0a2de148c expected csum 0xfa775cf8e22f6c493c9e4a4b35152146c6af6730972413f57472a0b84eaf1b48
>      mirror 1 bytenr 3128346521600 csum 0x9ec824a9e4ddcffdf1062f1ba62b9451969a99be14be202f2439453d7d3004d2 expected csum 0xf671fa7c9ff422d63e33ab2308b743f2efec75bc665f4b39d2b3528ed67b8dd2
>      mirror 1 bytenr 4804343029760 csum 0x120b9da3ad94bae74b87dd5348bffe2a63abae51ec1967fe4940ef4d0e04f6b2 expected csum 0xcb35ff0f65d3c83d58c973349e95ddca941db192704d90f117f5748d60847871
>      mirror 1 bytenr 4864481669120 csum 0xc47f3d34ad469b42a6c4ac53082711d7a54adfc639b8cae0eca68fa85c553d9e expected csum 0x408d55c9ddb82d687d22cfd88cd97b92e9ada7eeafbf1ffe5361d6c8d668979e
>      mirror 1 bytenr 5960080613376 csum 0x01b4b8ed8aeb58e10a817df137ecd7897d71c55e13a1e4b26057ad04545ec25f expected csum 0x2ab1dac1d5f588f945ef21c0e1cbaec6e80f0c1e70b7869b6d60ec10bf0904ad
>      mirror 1 bytenr 5961402257408 csum 0xb0286e4ffb2dc14544aa9cbfc93b673f0bedabe2bf8897ea55ab56b360cdb2fd expected csum 0x94296eb48c729bc769b30249fd6f9a495e5bcd7ee53f5b9fa804e8e29b61a1a0
>      mirror 1 bytenr 6542744866816 csum 0xa94702d753843de35c7c36aa421d577fc3ee266dcecf667ab889038e3816d9cf expected csum 0xa0f23e680a34bee49f92b93018f691a17c2a333b654969304e6f5fd7451efdf3
>      mirror 1 bytenr 8062253760512 csum 0xba3a46a1c1b0305eeeea50576d45d42bc4c2df78dc4298ed8a0430f4ffe02b29 expected csum 0x7e0b0bc4e506d453b69fdd195575f5caafa33833fc690f790594eb9070737e92
>      ERROR: errors found in csum tree
>      [7/8] checking root refs
>      [8/8] checking quota groups skipped (not enabled on this FS)
>      Opening filesystem to check...
>      Checking filesystem on /dev/mapper/sda_only_crypt
>      UUID: 3bd1727b-c8ae-4876-96b2-9318c1f9556f
>      found 8215070863360 bytes used, error(s) found
>      total csum bytes: 63585280608
>      total tree bytes: 76154945536
>      total fs tree bytes: 379305984
>      total extent tree bytes: 496074752
>      btree space waste bytes: 11008701324
>      file data blocks allocated: 8138915917824
>       referenced 8335585824768
> 
> The second drive:
> 
>      calvinow@hackpi1 ~ $ sudo losetup -f /dev/sdb -o 135266304
>      calvinow@hackpi1 ~ $ sudo cryptsetup luksOpen -r /dev/loop1 sdb_only_crypt -d keyfiles/f1ce8b7f-cd32-4ccd-baee-8443f60d9d1d.key
>      calvinow@hackpi1 ~ $ sudo btrfs check --check-data-csum /dev/mapper/sdb_only_crypt
>      [1/8] checking log skipped (none written)
>      [2/8] checking root items
>      [3/8] checking extents
>      [4/8] checking free space tree
>      [5/8] checking fs roots
>      [6/8] checking csums against data
>      mirror 1 bytenr 360507215872 csum 0xd73bd43f8bbe02623cb49d8c2d22098e9f8f7d1e168a4a3ce09afd6afb3fb7b7 expected csum 0xffd5b2ab2e2a988d1fd8161a770533e3fa12ce42a22a608b33da0e354423bcf2
>      mirror 1 bytenr 948478869504 csum 0x7013cafbcede6444418ed6bc6dde69a074d95f54ada5c8f397d5a5a7c7401308 expected csum 0x9707f1e1d21f901608c7406928fb0ac4e4329ee84f0cd927711de1933088bfb9
>      mirror 1 bytenr 1335329157120 csum 0x53b36149b0b1404e6033ccee7da2f78a771a216226570b153ec842f75579c01b expected csum 0x8c073e81cd73acbed85e10371bffd684b65480104c97484dd8c35dcd1d7292a4
>      mirror 1 bytenr 3342155005952 csum 0xae6b7410b5b3525164b37162d62f0686ed69670dd0e636118520b750f4f90c98 expected csum 0x6813929e1c2858de05525d9a9d60b1a609a9c127ec0fbea386db872da06f911a
>      mirror 1 bytenr 3394104115200 csum 0x0bbfe4d49e2831a3534ca5e2731d0913f7fb5a0742bfb584280b6e2ab140f79e expected csum 0xbb404a2115d4db21bca0376ec6a2b147438dd5dc12d2ce77657ffce3a341c791
>      mirror 1 bytenr 3540474392576 csum 0xafdfb82fc21d4e9ee86f8f4eeeb80f72e821abc7efc16ec0d1543a8fe766f810 expected csum 0x6d431ad6c9a78ab9b1b4523135bf1f92d196cb3784c0012e82277fde1284f6bb
>      mirror 1 bytenr 3764206202880 csum 0xd634900646ca8b656e0e461152abe4d56fcf266e66deaa7995723cdbaad1c3f6 expected csum 0x4bc0bb748c93d6764c4477e759fb0fff987398df31124e407a96ad88c1c08f12
>      mirror 1 bytenr 5795747512320 csum 0x29ded608e1e53a33a018173cb1b9653ac1f5ae4c58b2728abcd4147dae1e9038 expected csum 0xdf19cadbca20a6b70673a78a18715e950ab9ff8dce9b5263b4a53106ae64c540
>      mirror 1 bytenr 6096793878528 csum 0x73a9cf95307e303118f5a46c4150159b22be9300dd853987d44f162bd6ce3106 expected csum 0x1f8cb156b55dbdbf24ff126b182506f42987d59d470847cee0866a59a52d44c3
>      mirror 1 bytenr 6684691709952 csum 0x809afb12bca20c21e167cb90d890144f74ac7860996348e78e8d4e8fe67a5815 expected csum 0xc67c8b6ced8d2e7ae7b3897b1ac7432bbec0c2566f759652ab0892095bca6689
>      mirror 1 bytenr 7101833793536 csum 0xcaae7fa28faab629bb5826550cc94ed504bf3e8146c451d284b887405b649dde expected csum 0xa2ba3334fd5b18557139577fb0875a6ba142588654dead275407976ab8b6d80c
>      mirror 1 bytenr 8067768504320 csum 0xafac926d7032c1f36efe0b0b4473e1eb1f165271a7ac913564c4b3824815f885 expected csum 0x8e2b6e08d78b7f0e60352091c7b4e329d7c1756806dcdbc7b66d3544499615dc
>      ERROR: errors found in csum tree
>      [7/8] checking root refs
>      [8/8] checking quota groups skipped (not enabled on this FS)
>      Opening filesystem to check...
>      Checking filesystem on /dev/mapper/sdb_only_crypt
>      UUID: 3bd1727b-c8ae-4876-96b2-9318c1f9556f
>      found 8215070863360 bytes used, error(s) found
>      total csum bytes: 63585280608
>      total tree bytes: 76154945536
>      total fs tree bytes: 379305984
>      total extent tree bytes: 496074752
>      btree space waste bytes: 11008701324
>      file data blocks allocated: 8138915917824
>       referenced 8335585824768
> 
> The full btrfs inspect-internal dump-tree output is about 160GB of text,
> and was exactly identical from both mirrors.
> 
> Because the drives are encrypted, the real data are statistically
> random, so I was able to use compression to test which blocks were
> corrupt. I wrote all the "less" corrupt instances of each of the 21
> blocks to drive B, and all the "more" corrupt instances to drive A,
> using this python script:
> 
>      import bz2
>      import os
>      
>      with open("combined-corrupt-lba-list.txt", "r") as f:
>          lbas = [int(x.rstrip()) for x in f.readlines()]
>      
>      fd_sda = os.open("/dev/sda", os.O_RDWR | os.O_EXCL)
>      fd_sdb = os.open("/dev/sdb", os.O_RDWR | os.O_EXCL)
>      
>      for lba in lbas:
>          f1 = f"lba-{lba}.sda.bin"
>          with open(f1, "rb") as f:
>              d1 = f.read()
>          l1 = len(bz2.compress(d1))
>      
>          f2 = f"lba-{lba}.sdb.bin"
>          with open(f2, "rb") as f:
>              d2 = f.read()
>          l2 = len(bz2.compress(d2))
>      
>          if l1 < l2:
>              print(f"sda bad, sdb good: {lba}")
>              os.pwrite(fd_sda, d1, lba * 512)
>              os.pwrite(fd_sdb, d2, lba * 512)
>          elif l2 < l1:
>              print(f"sda good, sdb bad: {lba}")
>              os.pwrite(fd_sda, d2, lba * 512)
>              os.pwrite(fd_sdb, d1, lba * 512)
>          else:
>              raise RuntimeError(f"can't tell: {lba}")
> 
> This was the output:
> 
>      sda good, sdb bad: 693943242
>      sda good, sdb bad: 1850713991
>      sda good, sdb bad: 2610475259
>      sda bad, sdb good: 3630481074
>      sda bad, sdb good: 5657885844
>      sda bad, sdb good: 6146082269
>      sda good, sdb bad: 6563676964
>      sda good, sdb bad: 6677722982
>      sda good, sdb bad: 6963602424
>      sda good, sdb bad: 7400578618
>      sda bad, sdb good: 9446775924
>      sda bad, sdb good: 9564234206
>      sda good, sdb bad: 11389404262
>      sda bad, sdb good: 11712464500
>      sda bad, sdb good: 11715045836
>      sda good, sdb bad: 11979482596
>      sda bad, sdb good: 12863063528
>      sda good, sdb bad: 13140303458
>      sda good, sdb bad: 13955034090
>      sda bad, sdb good: 15845534402
>      sda good, sdb bad: 15856305388

Considering there is no bad/bad combinations, I strongly doubt if it's 
mdraid1 itself causing problems.

Does the mdraid1 has something like write-behind feature enabled?
> 
> Then, I re-ran the offline scrubs: drive A now shows all the errors
> originally seen across both drives, and drive B is now clean.
> 
> Finally, I ran userspace checksums of the full set of files on the
> newly clean drive B: they perfectly match an older copy in my backups.
> 
> This proves that:
> 
>      1) RAID mismatches and btrfs checksum failures are strictly 1:1.
>      2) For every RAID mismatch, strictly one mirror was corrupted.
>      3) No slient corruption occurred, btrfs caught everything.
> 
> The hard drives are brand new, so that is my current suspicion.

I won't suspect HDD as the first culprit. Since no powerloss there is no 
FLUSH/FUA bugs involved, and all corruptions are related to data but not 
metadata, if it's really HDD I guess we should have at least one or two 
metadata corruption too.


> I've
> used the same two-drive USB enclosure extensively with older HDDs and
> never seen a problem. I'm running this FIO job to test them:
> 
>      [global]
>      numjobs=1
>      loops=20
>      ioengine=io_uring
>      rw=randrw
>      percentage_random=5%
>      rwmixwrite=95
>      iodepth=32
>      direct=1
>      size=5%
>      blocksize_range=1k-32m
>      sync=none
>      refill_buffers=1
>      random_distribution=random
>      random_generator=tausworthe64
>      verify=xxhash
>      verify_fatal=1
>      verify_dump=1
>      do_verify=1
>      verify_async=$ncpus
>      
>      [hdd-sdb-test]
>      filename=/dev/sdb
>      
>      [hdd-sdc-test]
>      filename=/dev/sdc
> 
> ...but no luck hitting anything after about 18 hours.

I didn't have a good experience using fio to find corruption, thus if 
you hit the problem by simplying copying data (I guess through 'cp'?), 
then maybe stick to the working reproducer?

Although copying 10TiB into HDDs will take over 40 hours, still much 
longer than your fio workload.

Thanks,
Qu
> 
> Unless anybody else has a better idea, my plan is to "work my way up"
> the storage stack with FIO verify jobs like the above. Any advice would
> be greatly appreciated :)
> 
> Thanks,
> Calvin
> 
> --
> 
> #define _GNU_SOURCE
> #include <stdlib.h>
> #include <stdio.h>
> #include <err.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <string.h>
> 
> int main(int argc, char **argv)
> {
> 	int fd1, fd2, m_ret = EXIT_SUCCESS;
> 	size_t lba = 0;
> 
> 	if (argc < 3)
> 		errx(1, "Usage: ./test <drv1> <drv2>");
> 
> 	fd1 = open(argv[1], O_RDONLY);
> 	if (fd1 == -1)
> 		err(1, "Can't open %s", argv[1]);
> 
> 	fd2 = open(argv[2], O_RDONLY);
> 	if (fd2 == -1)
> 		err(1, "Can't open %s", argv[2]);
> 
> 	while (1) {
> 		int ret1, ret2;
> 		char buf1[512];
> 		char buf2[512];
> 
> 		ret1 = read(fd1, buf1, 512);
> 		if (ret1 == -1)
> 			err(2, "Bad read from input #1");
> 
> 		ret2 = read(fd2, buf2, 512);
> 		if (ret2 == -1)
> 			err(2, "Bad read from input #2");
> 
> 		if (ret1 != ret2)
> 			err(3, "Files differ in length at lba %lu (%d/%d)",
> 			    lba, ret1, ret2);
> 
> 		if (ret1 == 0)
> 			break;
> 
> 		if (ret1 != 512)
> 			err(4, "Short read!?");
> 
> 		if (memcmp(buf1, buf2, 512)) {
> 			unsigned i;
> 			FILE *out;
> 
> 			m_ret = EXIT_FAILURE;
> 			printf("LBA %lu differs:\n", lba);
> 			for (i = 0; i < 512; i++)
> 				buf1[i] ^= buf2[i];
> 
> 			out = popen("hexdump -C", "w");
> 			if (!out)
> 				err(5, "Unable to dump XOR of blocks");
> 
> 			fflush(stdout);
> 			fwrite(buf1, 1, 512, out);
> 			pclose(out);
> 		}
> 
> 		lba++;
> 	}
> 
> 	close(fd2);
> 	close(fd1);
> 	return m_ret;
> }
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [QUESTION] Debugging some file data corruption
  2025-11-11 21:02 ` Qu Wenruo
@ 2025-11-14 19:39   ` Calvin Owens
  0 siblings, 0 replies; 3+ messages in thread
From: Calvin Owens @ 2025-11-14 19:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-block, linux-raid, linux-btrfs, linux-crypto

On Wednesday 11/12 at 07:32 +1030, Qu Wenruo wrote:
> With LUKS in the middle, it makes any corruption pattern very human
> unreadable.
> 
> I guess it's not really feasible to try to reproduce the problem again since
> it has 10TiB data involved?

I can try again. It only takes 10 minutes of my time to get it started,
even if it takes a few days to run, I've got spare machines for it.

> But if you can spend a lot of time waiting for data copy, mind to try
> the following combination(s)?
> 
> - btrfs on mdraid1
> - btrfs RAID1 on raw two HDDs

Will do.

> Considering there is no bad/bad combinations, I strongly doubt if it's
> mdraid1 itself causing problems.
> 
> Does the mdraid1 has something like write-behind feature enabled?

No, nothing special, I'm creating the array using:

    mdadm --create -l 1 -n 2 --write-zeroes /dev/sda /dev/sdb

> > Then, I re-ran the offline scrubs: drive A now shows all the errors
> > originally seen across both drives, and drive B is now clean.
> > 
> > Finally, I ran userspace checksums of the full set of files on the
> > newly clean drive B: they perfectly match an older copy in my backups.
> > 
> > This proves that:
> > 
> >      1) RAID mismatches and btrfs checksum failures are strictly 1:1.
> >      2) For every RAID mismatch, strictly one mirror was corrupted.
> >      3) No slient corruption occurred, btrfs caught everything.
> > 
> > The hard drives are brand new, so that is my current suspicion.
> 
> I won't suspect HDD as the first culprit. Since no powerloss there is no
> FLUSH/FUA bugs involved, and all corruptions are related to data but not
> metadata, if it's really HDD I guess we should have at least one or two
> metadata corruption too.

I should have mentioned this: there were a few metadata corruptions, but
the first online scrub fixed them (DUP), so I didn't get a chance to see
what their contents were. Here's the full log:

    [Oct 5 00:45] [ T307779] BTRFS: device fsid 3bd1727b-c8ae-4876-96b2-9318c1f9556f devid 1 transid 121 /dev/mapper/md0_crypt (253:1) scanned by mount (307779)
    [  +0.000875] [ T307779] BTRFS info (device dm-1): first mount of filesystem 3bd1727b-c8ae-4876-96b2-9318c1f9556f
    [  +0.000054] [ T307779] BTRFS info (device dm-1): using blake2b (blake2b-256-generic) checksum algorithm
    [  +3.510409] [ T307779] BTRFS info (device dm-1): enabling ssd optimizations
    [  +0.000052] [ T307779] BTRFS info (device dm-1): enabling free space tree
    [  +0.000015] [ T307779] BTRFS info (device dm-1): use zstd compression, level 1
    [  +9.793525] [ T307802] BTRFS info (device dm-1): scrub: started on devid 1
    [Oct 5 01:30] [ T307812] BTRFS error (device dm-1): scrub: fixed up error at logical 316151431168 on dev /dev/mapper/md0_crypt physical 310791110656
    [  +0.000070] [ T307812] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
    [Oct 5 05:16] [ T308052] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 1850055917568 on dev /dev/mapper/md0_crypt physical 1858654240768
    [  +0.082724] [ T308052] BTRFS warning (device dm-1): scrub: checksum error at logical 1850055917568 on dev /dev/mapper/md0_crypt, physical 1858654240768 root 5 inode 387 offset 40929132544 length 4096 links 1 (path: REDACTED)
    [  +0.000077] [ T308052] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
    [Oct 5 06:51] [ T312207] BTRFS warning (device dm-1): scrub: tree block 2493640359936 mirror 2 has bad csum, has 0x4086e4014eeb997db83ae7255c333697ed4d740338405795861d6f3d0c7848af want 0xbdbcdc764674915f8899d9c164916bd7aad34693e78e9e9ba479b2339e12ef1c
    [  +0.077111] [ T312207] BTRFS error (device dm-1): scrub: fixed up error at logical 2493640343552 on dev /dev/mapper/md0_crypt physical 2510828601344
    [  +0.000065] [ T312207] BTRFS error (device dm-1): scrub: fixed up error at logical 2493640343552 on dev /dev/mapper/md0_crypt physical 2510828601344
    [  +0.000026] [ T312207] BTRFS error (device dm-1): scrub: fixed up error at logical 2493640343552 on dev /dev/mapper/md0_crypt physical 2510828601344
    [  +0.000016] [ T312207] BTRFS error (device dm-1): scrub: fixed up error at logical 2493640343552 on dev /dev/mapper/md0_crypt physical 2510828601344
    [Oct 5 07:47] [ T312650] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 2878389944320 on dev /dev/mapper/md0_crypt physical 2896685498368
    [  +0.079176] [ T312650] BTRFS warning (device dm-1): scrub: checksum error at logical 2878389944320 on dev /dev/mapper/md0_crypt, physical 2896685498368 root 5 inode 431 offset 2979594240 length 4096 links 1 (path: REDACTED)
    [  +0.000079] [ T312650] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
    [Oct 5 08:42] [ T312757] BTRFS error (device dm-1): scrub: fixed up error at logical 3255455711232 on dev /dev/mapper/md0_crypt physical 3273751265280
    [  +0.000066] [ T312757] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
    [Oct 5 12:31] [ T316045] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 4804342972416 on dev /dev/mapper/md0_crypt physical 4836597170176
    [  +0.106254] [ T316045] BTRFS warning (device dm-1): scrub: checksum error at logical 4804342972416 on dev /dev/mapper/md0_crypt, physical 4836597170176 root 5 inode 12231 offset 626524160 length 4096 links 1 (path: REDACTED)
    [  +0.000082] [ T316045] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
    [Oct 5 12:40] [ T316489] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 4864481624064 on dev /dev/mapper/md0_crypt physical 4896735821824
    [  +0.018418] [ T316489] BTRFS warning (device dm-1): scrub: checksum error at logical 4864481624064 on dev /dev/mapper/md0_crypt, physical 4896735821824 root 5 inode 12268 offset 635633664 length 4096 links 1 (path: REDACTED)
    [  +0.000067] [ T316489] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
    [Oct 5 15:18] [ T317014] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 5960080556032 on dev /dev/mapper/md0_crypt physical 5996629721088
    [  +0.088397] [ T317014] BTRFS warning (device dm-1): scrub: checksum error at logical 5960080556032 on dev /dev/mapper/md0_crypt, physical 5996629721088 root 5 inode 15079 offset 43859968 length 4096 links 1 (path: REDACTED)
    [  +0.000078] [ T317014] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
    [ +11.379069] [ T319380] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 5961402220544 on dev /dev/mapper/md0_crypt physical 5997951385600
    [  +0.000245] [ T319380] BTRFS warning (device dm-1): scrub: checksum error at logical 5961402220544 on dev /dev/mapper/md0_crypt, physical 5997951385600 root 5 inode 15079 offset 612937728 length 4096 links 1 (path: REDACTED)
    [  +0.000038] [ T319380] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
    [Oct 5 16:39] [ T320461] BTRFS error (device dm-1): scrub: fixed up error at logical 6508309970944 on dev /dev/mapper/md0_crypt physical 6551301586944
    [  +0.000066] [ T320461] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
    [Oct 5 16:44] [ T319972] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 6542744813568 on dev /dev/mapper/md0_crypt physical 6585736429568
    [  +0.072550] [ T319972] BTRFS warning (device dm-1): scrub: checksum error at logical 6542744813568 on dev /dev/mapper/md0_crypt, physical 6585736429568 root 5 inode 16518 offset 103481344 length 4096 links 1 (path: REDACTED)
    [  +0.000079] [ T319972] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
    [Oct 5 20:28] [ T320144] BTRFS error (device dm-1): scrub: unable to fixup (regular) error at logical 8062253727744 on dev /dev/mapper/md0_crypt physical 8112761536512
    [  +0.059861] [ T320144] BTRFS warning (device dm-1): scrub: checksum error at logical 8062253727744 on dev /dev/mapper/md0_crypt, physical 8112761536512 root 5 inode 349790 offset 16443854848 length 4096 links 1 (path: REDACTED)
    [  +0.000069] [ T320144] BTRFS error (device dm-1): bdev /dev/mapper/md0_crypt errs: wr 0, rd 0, flush 0, corrupt 11, gen 0
    [Oct 5 21:01] [ T307802] BTRFS info (device dm-1): scrub: finished on devid 1 with status: 0

    UUID:             3bd1727b-c8ae-4876-96b2-9318c1f9556f
    Scrub started:    Sun Oct  5 00:45:20 2025
    Status:           finished
    Duration:         20:15:56
    Total to scrub:   7.54TiB
    Rate:             108.38MiB/s
    Error summary:    verify=4 csum=11
      Corrected:      7
      Uncorrectable:  8
      Unverified:     0

That scrub missed about half the data errors, which makes sense to me:
the mdraid1 "randomly" reads a given block from one or the other
underlying drive.

But the metadata doesn't make sense to me: the one scrub appears to have
seen and fixed *all* the metadata errors, because my later offline
scrubs of each individual drive saw no metadata errors. Seems unlikely.

I'll make sure to only run offline scrubs next time, so I can inspect
the metadata corruptions too (and for the btrfs-raid1, so I can inspect
the corruptions at all before they're fixed up).

> > I've
> > used the same two-drive USB enclosure extensively with older HDDs and
> > never seen a problem. I'm running this FIO job to test them:
> > 
> >      [global]
> >      numjobs=1
> >      loops=20
> >      ioengine=io_uring
> >      rw=randrw
> >      percentage_random=5%
> >      rwmixwrite=95
> >      iodepth=32
> >      direct=1
> >      size=5%
> >      blocksize_range=1k-32m
> >      sync=none
> >      refill_buffers=1
> >      random_distribution=random
> >      random_generator=tausworthe64
> >      verify=xxhash
> >      verify_fatal=1
> >      verify_dump=1
> >      do_verify=1
> >      verify_async=$ncpus
> >      [hdd-sdb-test]
> >      filename=/dev/sdb
> >      [hdd-sdc-test]
> >      filename=/dev/sdc
> > 
> > ...but no luck hitting anything after about 18 hours.
> 
> I didn't have a good experience using fio to find corruption, thus if you
> hit the problem by simplying copying data (I guess through 'cp'?), then
> maybe stick to the working reproducer?
> 
> Although copying 10TiB into HDDs will take over 40 hours, still much longer
> than your fio workload.

Makes sense. I'll stick to making copies like the original workload,
and report back if I can trigger it again.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-11-14 19:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 17:01 [QUESTION] Debugging some file data corruption Calvin Owens
2025-11-11 21:02 ` Qu Wenruo
2025-11-14 19:39   ` Calvin Owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).