Mysterious disappearing corruption and how to diagnose

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Andy Smith <andy@strugglers.net>
To: linux-btrfs@vger.kernel.org
Subject: Mysterious disappearing corruption and how to diagnose
Date: Fri, 29 Aug 2025 20:45:50 +0000	[thread overview]
Message-ID: <aLIRfvDUohR/2mnv@mail.bitfolk.com> (raw)

Hi,

I have a btrfs filesystem with 7 devices. Needing a little more
capacity, I decided to replace two of the smaller devices with larger
ones. I ordered two identical 4TB SSDs and used a "btrfs replace …" for
the first and then a "btrfs device remove …" plus "btrfs device add …"
for the second to get them both in there.

After the second of the new SSDs was added in I started receiving logs
about corruption on the newest added device (sdh):

2025-08-25T04:52:36.719565+00:00 strangebrew kernel: [15861945.864876] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526171987968 have 0
2025-08-25T04:52:36.719578+00:00 strangebrew kernel: [15861945.867728] BTRFS info (device sdh): read error corrected: ino 0 off 18526171987968 (dev /dev/sdh sector 238168896)
2025-08-25T05:44:42.139479+00:00 strangebrew kernel: [15865071.325433] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526179364864 have 0
2025-08-25T05:44:42.139493+00:00 strangebrew kernel: [15865071.328345] BTRFS info (device sdh): read error corrected: ino 0 off 18526179364864 (dev /dev/sdh sector 238183304)

These messages were seen 19,207 times with sector numbers ranging from
2093128 to 556538024.

Upon seeing this I did a "btrfs device remove …" for sdh, shuffled
things about so I could attach an extra device, added back one of the
older SSDs and used "btrfs device add" to add that one back in. So at
this point the filesystem still has 7 devices, sdh is still in the
machine but not part of the filesystem and the filesystem just has
slightly less capacity than it could have.

I did a scrub of the filesystem. This came back clean, as expected (all
of the error logs said errors were corrected).

A "long" SMART self-test of sdh came back clean, which wasn't surprising
because at no point has there been an actual I/O error, only notices of
corruption.

I put an ext4 filesystem on sdh, mounted it and did a run of stress-ng:

$ sudo stress-ng --hdd 32 \
  --hdd-opts wr-seq,rd-rnd \
  --hdd-write-size 8k \
  --hdd-bytes 30g \
  --temp-path /mnt/stress --verify -t 6h

After more than an hour this hadn't detected a single problem so I
aborted it.

I put a btrfs filesystem on sdh and did stress-ng again. No issues
reported.

As mentioned, this was a pair of new SSDs and the other one is already
part of the filesystem and not giving me any cause for concern. They are
Crucial model CT4000BX500SSD1 (4TB SATA SSD).

It may be difficult to get a replacement or refund if I can't
reproduce broken behaviour.

The shuffling of devices that I had to do can only be temporary, so I
need to decide what I am going to do. The smaller device I had intended
to remove (but now had to add back in for capacity reasons) is 1.7T and
is currently /dev/sdg. I could "btrfs replace /dev/sdg /dev/sdh …" and
assuming no errors seen do a scrub, but if errors were seen I'd want to
remove sdh again quickly. replace then wouldn't be an option since sdg
is smaller than sdh. "btrfs remove sdh …" takes a really long time.

Maybe I should make a partition on sdh that is only 1.7T of the device
and replace that in, so I could still replace it out if errors are seen?
Though if it behaves I am then going to want to replace it out anyway in
order to replace the full device back in!

Basically I'm totally confused as to how this device was misbehaving
but now apparently isn't. I had thought just maybe it could be the slot
on the backplane that had gone bad but it's still in that slot and I
can't reproduce the problem now.

Any ideas?

Debian 12, kernel 6.1.0-38-amd64, btrfs-progs v6.2 (all from Debian
packages).

Thanks,
Andy

next             reply	other threads:[~2025-08-29 21:05 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29 20:45 Andy Smith [this message]
2025-08-29 21:15 ` Mysterious disappearing corruption and how to diagnose Qu Wenruo
2025-08-29 22:17   ` Andy Smith
2025-08-29 22:58     ` Qu Wenruo
2025-08-29 23:48       ` Andy Smith
2025-08-30  0:03         ` Qu Wenruo
2025-10-13 15:58           ` Andy Smith
2025-08-30  8:20         ` Martin Steigerwald
2025-08-30 22:41 ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aLIRfvDUohR/2mnv@mail.bitfolk.com \
    --to=andy@strugglers.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox