public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Andy Smith <andy@strugglers.net>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Mysterious disappearing corruption and how to diagnose
Date: Fri, 29 Aug 2025 22:17:09 +0000	[thread overview]
Message-ID: <aLIm5djb6Ee4T1ot@mail.bitfolk.com> (raw)
In-Reply-To: <a48ac216-38f1-4a69-970e-f2ddee2ae8f2@suse.com>

Hi Qu,

Thanks for your reply.

On Sat, Aug 30, 2025 at 06:45:15AM +0930, Qu Wenruo wrote:
> 在 2025/8/30 06:15, Andy Smith 写道:
> > After the second of the new SSDs was added in I started receiving logs
> > about corruption on the newest added device (sdh):
> 
> Is this during dev replace/add/remove?

I have been mistaken about the order of operations. The one that
introduced sdh was:

2025-08-25T00:13:22.804904+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs replace start /dev/sdb /dev/sdh /srv/tank

> Would it be possible to provide more context/full dmesg for the incident?

There's just thousands of the message I gave and nothing else, but for
context:

2025-08-25T00:16:13.551484+00:00 strangebrew kernel: [15845362.486304] BTRFS info (device sdh): dev_replace from /dev/sdb (devid 9) to /dev/sdh started

2025-08-25T02:31:55.547470+00:00 strangebrew kernel: [15853504.586725] BTRFS info (device sdh): dev_replace from /dev/sdb (devid 9) to /dev/sdh finished

So actually these errors appear not during the replace that introduced
sdh but later on. I guess that makes sense since sdh is not being read
from when it's empty!

Let me see what other operations were done after this…

2025-08-25T02:58:36.452870+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs replace start /dev/sdf /dev/sdb /srv/tank
2025-08-25T03:01:29.103489+00:00 strangebrew kernel: [15855278.164899] BTRFS info (device sdh): dev_replace from /dev/sdf (devid 12) to /dev/sdb started
2025-08-25T04:52:36.719565+00:00 strangebrew kernel: [15861945.864876] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526171987968 have 0
2025-08-25T04:52:36.719578+00:00 strangebrew kernel: [15861945.867728] BTRFS info (device sdh): read error corrected: ino 0 off 18526171987968 (dev /dev/sdh sector 238168896)
2025-08-25T05:44:42.139479+00:00 strangebrew kernel: [15865071.325433] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526179364864 have 0
2025-08-25T05:44:42.139493+00:00 strangebrew kernel: [15865071.328345] BTRFS info (device sdh): read error corrected: ino 0 off 18526179364864 (dev /dev/sdh sector 238183304)
2025-08-25T11:34:15.115468+00:00 strangebrew kernel: [15886044.574930] BTRFS info (device sdh): dev_replace from /dev/sdf (devid 12) to /dev/sdb finished

I have not omitted any "read error corrected" messages between these
times so during replace of sdf with sdb in fact only two of the
corrupted reads occurred.

The vast majority of the "read error corrected" messages happen later
between:

2025-08-26T01:15:18.179736+00:00 strangebrew kernel: [15935308.276936] BTRFS info (device sdh): read error corrected: ino 0 off 18526369787904 (dev /dev/sdh sector 238555224)

and

2025-08-27T04:37:04.973808+00:00 strangebrew kernel: [ 6683.406728] BTRFS info (device sdb): read error corrected: ino 450 off 981975040 (dev /dev/sdh sector 56445920)

(last one ever)

There was also a reboot in between those times.

After seeing the read errors I had made a drive bay available and
re-attached one of the old devices:

2025-08-26T20:52:58.644481+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs dev add /dev/sdf /srv/tank

I then removed sdh:

2025-08-26T21:03:32.781261+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs dev remove /dev/sdh /srv/tank

2025-08-27T04:50:16.085385+00:00 strangebrew kernel: [ 7474.522398] BTRFS info (device sdb): device deleted: /dev/sdh

Except for a scrub no further management is done of the btrfs filesystem
at /srv/tank after this. After this is only my investigations of what is
going on with sdh.

So thousands of the corrected read errors happen after
2025-08-26T01:15:18.179736+00:00 but there was no btrfs management
operation happening until 2025-08-26T20:52:58.644481+00:00 therefore a
large number of them happened during normal operation of the filesystem,
long after the replacement in of both sdh and sdb.

These look slightly different in that the number after "ino" isn't 0:

2025-08-26T01:15:18.823475+00:00 strangebrew kernel: [15935308.917636] BTRFS warning (device sdh): csum failed root 534 ino 17578 off 524288 csum 0x8941f998 expected csum 0xec2689f0 mirror 1
2025-08-26T01:15:18.823486+00:00 strangebrew kernel: [15935308.917652] BTRFS warning (device sdh): csum failed root 534 ino 17578 off 655360 csum 0x8941f998 expected csum 0xf3ada24a mirror 1
2025-08-26T01:15:18.823487+00:00 strangebrew kernel: [15935308.918200] BTRFS error (device sdh): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
2025-08-26T01:15:18.823488+00:00 strangebrew kernel: [15935308.918928] BTRFS error (device sdh): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

> And what's the raid profile?

It is RAID1 profile for data, metadata and system.

Sorry, I realised just after sending that I did not give that
information.

> > Debian 12, kernel 6.1.0-38-amd64, btrfs-progs v6.2 (all from Debian
> > packages).
> 
> And the kernel version is not that ideal, it's still supported and receiving
> backports, but I'm not really sure how many big refactor/rework/fixes are
> missing on that LTS kernel.

Okay. Yes it is still supported by Debian so they are still publishing
updates for the related LTS kernel but I am relying here on fixes going
in to LTS kernel in the first place.

Thanks,
Andy

  reply	other threads:[~2025-08-29 22:17 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29 20:45 Mysterious disappearing corruption and how to diagnose Andy Smith
2025-08-29 21:15 ` Qu Wenruo
2025-08-29 22:17   ` Andy Smith [this message]
2025-08-29 22:58     ` Qu Wenruo
2025-08-29 23:48       ` Andy Smith
2025-08-30  0:03         ` Qu Wenruo
2025-10-13 15:58           ` Andy Smith
2025-08-30  8:20         ` Martin Steigerwald
2025-08-30 22:41 ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aLIm5djb6Ee4T1ot@mail.bitfolk.com \
    --to=andy@strugglers.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox