Mysterious disappearing corruption and how to diagnose

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Mysterious disappearing corruption and how to diagnose
@ 2025-08-29 20:45 Andy Smith
  2025-08-29 21:15 ` Qu Wenruo
  2025-08-30 22:41 ` Chris Murphy
  0 siblings, 2 replies; 9+ messages in thread
From: Andy Smith @ 2025-08-29 20:45 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have a btrfs filesystem with 7 devices. Needing a little more
capacity, I decided to replace two of the smaller devices with larger
ones. I ordered two identical 4TB SSDs and used a "btrfs replace …" for
the first and then a "btrfs device remove …" plus "btrfs device add …"
for the second to get them both in there.

After the second of the new SSDs was added in I started receiving logs
about corruption on the newest added device (sdh):

2025-08-25T04:52:36.719565+00:00 strangebrew kernel: [15861945.864876] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526171987968 have 0
2025-08-25T04:52:36.719578+00:00 strangebrew kernel: [15861945.867728] BTRFS info (device sdh): read error corrected: ino 0 off 18526171987968 (dev /dev/sdh sector 238168896)
2025-08-25T05:44:42.139479+00:00 strangebrew kernel: [15865071.325433] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526179364864 have 0
2025-08-25T05:44:42.139493+00:00 strangebrew kernel: [15865071.328345] BTRFS info (device sdh): read error corrected: ino 0 off 18526179364864 (dev /dev/sdh sector 238183304)

These messages were seen 19,207 times with sector numbers ranging from
2093128 to 556538024.

Upon seeing this I did a "btrfs device remove …" for sdh, shuffled
things about so I could attach an extra device, added back one of the
older SSDs and used "btrfs device add" to add that one back in. So at
this point the filesystem still has 7 devices, sdh is still in the
machine but not part of the filesystem and the filesystem just has
slightly less capacity than it could have.

I did a scrub of the filesystem. This came back clean, as expected (all
of the error logs said errors were corrected).

A "long" SMART self-test of sdh came back clean, which wasn't surprising
because at no point has there been an actual I/O error, only notices of
corruption.

I put an ext4 filesystem on sdh, mounted it and did a run of stress-ng:

$ sudo stress-ng --hdd 32 \
  --hdd-opts wr-seq,rd-rnd \
  --hdd-write-size 8k \
  --hdd-bytes 30g \
  --temp-path /mnt/stress --verify -t 6h

After more than an hour this hadn't detected a single problem so I
aborted it.

I put a btrfs filesystem on sdh and did stress-ng again. No issues
reported.

As mentioned, this was a pair of new SSDs and the other one is already
part of the filesystem and not giving me any cause for concern. They are
Crucial model CT4000BX500SSD1 (4TB SATA SSD).

It may be difficult to get a replacement or refund if I can't
reproduce broken behaviour.

The shuffling of devices that I had to do can only be temporary, so I
need to decide what I am going to do. The smaller device I had intended
to remove (but now had to add back in for capacity reasons) is 1.7T and
is currently /dev/sdg. I could "btrfs replace /dev/sdg /dev/sdh …" and
assuming no errors seen do a scrub, but if errors were seen I'd want to
remove sdh again quickly. replace then wouldn't be an option since sdg
is smaller than sdh. "btrfs remove sdh …" takes a really long time.

Maybe I should make a partition on sdh that is only 1.7T of the device
and replace that in, so I could still replace it out if errors are seen?
Though if it behaves I am then going to want to replace it out anyway in
order to replace the full device back in!

Basically I'm totally confused as to how this device was misbehaving
but now apparently isn't. I had thought just maybe it could be the slot
on the backplane that had gone bad but it's still in that slot and I
can't reproduce the problem now.

Any ideas?

Debian 12, kernel 6.1.0-38-amd64, btrfs-progs v6.2 (all from Debian
packages).

Thanks,
Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 20:45 Mysterious disappearing corruption and how to diagnose Andy Smith
@ 2025-08-29 21:15 ` Qu Wenruo
  2025-08-29 22:17   ` Andy Smith
  2025-08-30 22:41 ` Chris Murphy
  1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-08-29 21:15 UTC (permalink / raw)
  To: Andy Smith, linux-btrfs



在 2025/8/30 06:15, Andy Smith 写道:
> Hi,
> 
> I have a btrfs filesystem with 7 devices. Needing a little more
> capacity, I decided to replace two of the smaller devices with larger
> ones. I ordered two identical 4TB SSDs and used a "btrfs replace …" for
> the first and then a "btrfs device remove …" plus "btrfs device add …"
> for the second to get them both in there.
> 
> After the second of the new SSDs was added in I started receiving logs
> about corruption on the newest added device (sdh):


Is this during dev replace/add/remove?
Would it be possible to provide more context/full dmesg for the incident?

And what's the raid profile?

> 2025-08-25T04:52:36.719565+00:00 strangebrew kernel: [15861945.864876] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526171987968 have 0
> 2025-08-25T04:52:36.719578+00:00 strangebrew kernel: [15861945.867728] BTRFS info (device sdh): read error corrected: ino 0 off 18526171987968 (dev /dev/sdh sector 238168896)
> 2025-08-25T05:44:42.139479+00:00 strangebrew kernel: [15865071.325433] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526179364864 have 0

This mostly means the metadata is not there (all zero), thus btrfs went 
with other good copy and re-write the good one back.

> 2025-08-25T05:44:42.139493+00:00 strangebrew kernel: [15865071.328345] BTRFS info (device sdh): read error corrected: ino 0 off 18526179364864 (dev /dev/sdh sector 238183304)
> 
> These messages were seen 19,207 times with sector numbers ranging from
> 2093128 to 556538024.
> 
> Upon seeing this I did a "btrfs device remove …" for sdh, shuffled
> things about so I could attach an extra device, added back one of the
> older SSDs and used "btrfs device add" to add that one back in. So at
> this point the filesystem still has 7 devices, sdh is still in the
> machine but not part of the filesystem and the filesystem just has
> slightly less capacity than it could have.
> 
> I did a scrub of the filesystem. This came back clean, as expected (all
> of the error logs said errors were corrected).
> 
> A "long" SMART self-test of sdh came back clean, which wasn't surprising
> because at no point has there been an actual I/O error, only notices of
> corruption.
> 
> I put an ext4 filesystem on sdh, mounted it and did a run of stress-ng:
> 
> $ sudo stress-ng --hdd 32 \
>    --hdd-opts wr-seq,rd-rnd \
>    --hdd-write-size 8k \
>    --hdd-bytes 30g \
>    --temp-path /mnt/stress --verify -t 6h
> 
> After more than an hour this hadn't detected a single problem so I
> aborted it.
> 
> I put a btrfs filesystem on sdh and did stress-ng again. No issues
> reported.
> 
> As mentioned, this was a pair of new SSDs and the other one is already
> part of the filesystem and not giving me any cause for concern. They are
> Crucial model CT4000BX500SSD1 (4TB SATA SSD).
> 
> It may be difficult to get a replacement or refund if I can't
> reproduce broken behaviour.

So I believe this is related to certain raid profile handling.

And I hope it's not RAID56. As it's known to have write hole problems.

> 
> The shuffling of devices that I had to do can only be temporary, so I
> need to decide what I am going to do. The smaller device I had intended
> to remove (but now had to add back in for capacity reasons) is 1.7T and
> is currently /dev/sdg. I could "btrfs replace /dev/sdg /dev/sdh …" and
> assuming no errors seen do a scrub, but if errors were seen I'd want to
> remove sdh again quickly. replace then wouldn't be an option since sdg
> is smaller than sdh. "btrfs remove sdh …" takes a really long time.
> 
> Maybe I should make a partition on sdh that is only 1.7T of the device
> and replace that in, so I could still replace it out if errors are seen?
> Though if it behaves I am then going to want to replace it out anyway in
> order to replace the full device back in!
> 
> Basically I'm totally confused as to how this device was misbehaving
> but now apparently isn't. I had thought just maybe it could be the slot
> on the backplane that had gone bad but it's still in that slot and I
> can't reproduce the problem now.
> 
> Any ideas?
> 
> Debian 12, kernel 6.1.0-38-amd64, btrfs-progs v6.2 (all from Debian
> packages).

And the kernel version is not that ideal, it's still supported and 
receiving backports, but I'm not really sure how many big 
refactor/rework/fixes are missing on that LTS kernel.

Thanks,
Qu

> 
> Thanks,
> Andy
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 21:15 ` Qu Wenruo
@ 2025-08-29 22:17   ` Andy Smith
  2025-08-29 22:58     ` Qu Wenruo
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Smith @ 2025-08-29 22:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

Thanks for your reply.

On Sat, Aug 30, 2025 at 06:45:15AM +0930, Qu Wenruo wrote:
> 在 2025/8/30 06:15, Andy Smith 写道:
> > After the second of the new SSDs was added in I started receiving logs
> > about corruption on the newest added device (sdh):
> 
> Is this during dev replace/add/remove?

I have been mistaken about the order of operations. The one that
introduced sdh was:

2025-08-25T00:13:22.804904+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs replace start /dev/sdb /dev/sdh /srv/tank

> Would it be possible to provide more context/full dmesg for the incident?

There's just thousands of the message I gave and nothing else, but for
context:

2025-08-25T00:16:13.551484+00:00 strangebrew kernel: [15845362.486304] BTRFS info (device sdh): dev_replace from /dev/sdb (devid 9) to /dev/sdh started

2025-08-25T02:31:55.547470+00:00 strangebrew kernel: [15853504.586725] BTRFS info (device sdh): dev_replace from /dev/sdb (devid 9) to /dev/sdh finished

So actually these errors appear not during the replace that introduced
sdh but later on. I guess that makes sense since sdh is not being read
from when it's empty!

Let me see what other operations were done after this…

2025-08-25T02:58:36.452870+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs replace start /dev/sdf /dev/sdb /srv/tank
2025-08-25T03:01:29.103489+00:00 strangebrew kernel: [15855278.164899] BTRFS info (device sdh): dev_replace from /dev/sdf (devid 12) to /dev/sdb started
2025-08-25T04:52:36.719565+00:00 strangebrew kernel: [15861945.864876] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526171987968 have 0
2025-08-25T04:52:36.719578+00:00 strangebrew kernel: [15861945.867728] BTRFS info (device sdh): read error corrected: ino 0 off 18526171987968 (dev /dev/sdh sector 238168896)
2025-08-25T05:44:42.139479+00:00 strangebrew kernel: [15865071.325433] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526179364864 have 0
2025-08-25T05:44:42.139493+00:00 strangebrew kernel: [15865071.328345] BTRFS info (device sdh): read error corrected: ino 0 off 18526179364864 (dev /dev/sdh sector 238183304)
2025-08-25T11:34:15.115468+00:00 strangebrew kernel: [15886044.574930] BTRFS info (device sdh): dev_replace from /dev/sdf (devid 12) to /dev/sdb finished

I have not omitted any "read error corrected" messages between these
times so during replace of sdf with sdb in fact only two of the
corrupted reads occurred.

The vast majority of the "read error corrected" messages happen later
between:

2025-08-26T01:15:18.179736+00:00 strangebrew kernel: [15935308.276936] BTRFS info (device sdh): read error corrected: ino 0 off 18526369787904 (dev /dev/sdh sector 238555224)

and

2025-08-27T04:37:04.973808+00:00 strangebrew kernel: [ 6683.406728] BTRFS info (device sdb): read error corrected: ino 450 off 981975040 (dev /dev/sdh sector 56445920)

(last one ever)

There was also a reboot in between those times.

After seeing the read errors I had made a drive bay available and
re-attached one of the old devices:

2025-08-26T20:52:58.644481+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs dev add /dev/sdf /srv/tank

I then removed sdh:

2025-08-26T21:03:32.781261+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs dev remove /dev/sdh /srv/tank

2025-08-27T04:50:16.085385+00:00 strangebrew kernel: [ 7474.522398] BTRFS info (device sdb): device deleted: /dev/sdh

Except for a scrub no further management is done of the btrfs filesystem
at /srv/tank after this. After this is only my investigations of what is
going on with sdh.

So thousands of the corrected read errors happen after
2025-08-26T01:15:18.179736+00:00 but there was no btrfs management
operation happening until 2025-08-26T20:52:58.644481+00:00 therefore a
large number of them happened during normal operation of the filesystem,
long after the replacement in of both sdh and sdb.

These look slightly different in that the number after "ino" isn't 0:

2025-08-26T01:15:18.823475+00:00 strangebrew kernel: [15935308.917636] BTRFS warning (device sdh): csum failed root 534 ino 17578 off 524288 csum 0x8941f998 expected csum 0xec2689f0 mirror 1
2025-08-26T01:15:18.823486+00:00 strangebrew kernel: [15935308.917652] BTRFS warning (device sdh): csum failed root 534 ino 17578 off 655360 csum 0x8941f998 expected csum 0xf3ada24a mirror 1
2025-08-26T01:15:18.823487+00:00 strangebrew kernel: [15935308.918200] BTRFS error (device sdh): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
2025-08-26T01:15:18.823488+00:00 strangebrew kernel: [15935308.918928] BTRFS error (device sdh): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

> And what's the raid profile?

It is RAID1 profile for data, metadata and system.

Sorry, I realised just after sending that I did not give that
information.

> > Debian 12, kernel 6.1.0-38-amd64, btrfs-progs v6.2 (all from Debian
> > packages).
> 
> And the kernel version is not that ideal, it's still supported and receiving
> backports, but I'm not really sure how many big refactor/rework/fixes are
> missing on that LTS kernel.

Okay. Yes it is still supported by Debian so they are still publishing
updates for the related LTS kernel but I am relying here on fixes going
in to LTS kernel in the first place.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 22:17   ` Andy Smith
@ 2025-08-29 22:58     ` Qu Wenruo
  2025-08-29 23:48       ` Andy Smith
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-08-29 22:58 UTC (permalink / raw)
  To: Andy Smith, Qu Wenruo; +Cc: linux-btrfs



在 2025/8/30 07:47, Andy Smith 写道:
> Hi Qu,
> 
> Thanks for your reply.
> 
> On Sat, Aug 30, 2025 at 06:45:15AM +0930, Qu Wenruo wrote:
>> 在 2025/8/30 06:15, Andy Smith 写道:
>>> After the second of the new SSDs was added in I started receiving logs
>>> about corruption on the newest added device (sdh):
>>
>> Is this during dev replace/add/remove?
> 
> I have been mistaken about the order of operations. The one that
> introduced sdh was:
> 
> 2025-08-25T00:13:22.804904+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs replace start /dev/sdb /dev/sdh /srv/tank
> 
>> Would it be possible to provide more context/full dmesg for the incident?
> 
> There's just thousands of the message I gave and nothing else, but for
> context:
> 
> 2025-08-25T00:16:13.551484+00:00 strangebrew kernel: [15845362.486304] BTRFS info (device sdh): dev_replace from /dev/sdb (devid 9) to /dev/sdh started
> 
> 2025-08-25T02:31:55.547470+00:00 strangebrew kernel: [15853504.586725] BTRFS info (device sdh): dev_replace from /dev/sdb (devid 9) to /dev/sdh finished
> 
> So actually these errors appear not during the replace that introduced
> sdh but later on. I guess that makes sense since sdh is not being read
> from when it's empty!
> 
> Let me see what other operations were done after this…
> 
> 2025-08-25T02:58:36.452870+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs replace start /dev/sdf /dev/sdb /srv/tank
> 2025-08-25T03:01:29.103489+00:00 strangebrew kernel: [15855278.164899] BTRFS info (device sdh): dev_replace from /dev/sdf (devid 12) to /dev/sdb started
> 2025-08-25T04:52:36.719565+00:00 strangebrew kernel: [15861945.864876] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526171987968 have 0
> 2025-08-25T04:52:36.719578+00:00 strangebrew kernel: [15861945.867728] BTRFS info (device sdh): read error corrected: ino 0 off 18526171987968 (dev /dev/sdh sector 238168896)
> 2025-08-25T05:44:42.139479+00:00 strangebrew kernel: [15865071.325433] BTRFS error (device sdh): bad tree block start, mirror 1 want 18526179364864 have 0
> 2025-08-25T05:44:42.139493+00:00 strangebrew kernel: [15865071.328345] BTRFS info (device sdh): read error corrected: ino 0 off 18526179364864 (dev /dev/sdh sector 238183304)
> 2025-08-25T11:34:15.115468+00:00 strangebrew kernel: [15886044.574930] BTRFS info (device sdh): dev_replace from /dev/sdf (devid 12) to /dev/sdb finished
> 
> I have not omitted any "read error corrected" messages between these
> times so during replace of sdf with sdb in fact only two of the
> corrupted reads occurred.
> 
> The vast majority of the "read error corrected" messages happen later
> between:
> 
> 2025-08-26T01:15:18.179736+00:00 strangebrew kernel: [15935308.276936] BTRFS info (device sdh): read error corrected: ino 0 off 18526369787904 (dev /dev/sdh sector 238555224)
> 
> and
> 
> 2025-08-27T04:37:04.973808+00:00 strangebrew kernel: [ 6683.406728] BTRFS info (device sdb): read error corrected: ino 450 off 981975040 (dev /dev/sdh sector 56445920)
> 
> (last one ever)
> 
> There was also a reboot in between those times.
> 
> After seeing the read errors I had made a drive bay available and
> re-attached one of the old devices:
> 
> 2025-08-26T20:52:58.644481+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs dev add /dev/sdf /srv/tank
> 
> I then removed sdh:
> 
> 2025-08-26T21:03:32.781261+00:00 strangebrew sudo:     andy : TTY=pts/9 ; PWD=/home/andy ; USER=root ; COMMAND=/usr/bin/btrfs dev remove /dev/sdh /srv/tank
> 
> 2025-08-27T04:50:16.085385+00:00 strangebrew kernel: [ 7474.522398] BTRFS info (device sdb): device deleted: /dev/sdh
> 
> Except for a scrub no further management is done of the btrfs filesystem
> at /srv/tank after this. After this is only my investigations of what is
> going on with sdh.
> 
> So thousands of the corrected read errors happen after
> 2025-08-26T01:15:18.179736+00:00 but there was no btrfs management
> operation happening until 2025-08-26T20:52:58.644481+00:00 therefore a
> large number of them happened during normal operation of the filesystem,
> long after the replacement in of both sdh and sdb.
> 
> These look slightly different in that the number after "ino" isn't 0:
> 
> 2025-08-26T01:15:18.823475+00:00 strangebrew kernel: [15935308.917636] BTRFS warning (device sdh): csum failed root 534 ino 17578 off 524288 csum 0x8941f998 expected csum 0xec2689f0 mirror 1
> 2025-08-26T01:15:18.823486+00:00 strangebrew kernel: [15935308.917652] BTRFS warning (device sdh): csum failed root 534 ino 17578 off 655360 csum 0x8941f998 expected csum 0xf3ada24a mirror 1
> 2025-08-26T01:15:18.823487+00:00 strangebrew kernel: [15935308.918200] BTRFS error (device sdh): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> 2025-08-26T01:15:18.823488+00:00 strangebrew kernel: [15935308.918928] BTRFS error (device sdh): bdev /dev/sdh errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> 
>> And what's the raid profile?
> 
> It is RAID1 profile for data, metadata and system.

So it means no RAID56 write holes, and any error really means error.

And it's really RAID1 saving the day, or you will lose both metadata and 
data.

My concern is, it may be that something wrong related to device replace, 
that duplicated writes (write into both old and new devices) only reach 
the older device.

Thus the metadata are all zero.

Furthermore the data csum 0x8941f998 is a special, it's the result of 
CRC32C of a 4K block with all zero.


So this means, either btrfs is doing something wrong replacing the disk, 
thus the old data is not properly written into sdh.

Or the disk has something quirks missing certain writes.

> 
> Sorry, I realised just after sending that I did not give that
> information.
> 
>>> Debian 12, kernel 6.1.0-38-amd64, btrfs-progs v6.2 (all from Debian
>>> packages).
>>
>> And the kernel version is not that ideal, it's still supported and receiving
>> backports, but I'm not really sure how many big refactor/rework/fixes are
>> missing on that LTS kernel.
> 
> Okay. Yes it is still supported by Debian so they are still publishing
> updates for the related LTS kernel but I am relying here on fixes going
> in to LTS kernel in the first place.

In v6.4 we reworked the scrub code (and of course introduced some 
regression), but overall it should make the error reporting more consistent.

I didn't remember the old behavior, but the newer behavior will still 
report errors on recoverable errors.


I know you have ran scrub and it should have fixed all the missing 
writes, but mind to use some liveUSB or newer LTS kernel (6.12 
recommended) and re-run the scrub to see if any error reported?


It looks like minor missing errors, nor fully losing all writes, but it 
still looks a little worrying.

Thanks,
Qu

> 
> Thanks,
> Andy
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 22:58     ` Qu Wenruo
@ 2025-08-29 23:48       ` Andy Smith
  2025-08-30  0:03         ` Qu Wenruo
  2025-08-30  8:20         ` Martin Steigerwald
  0 siblings, 2 replies; 9+ messages in thread
From: Andy Smith @ 2025-08-29 23:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

Hi Qu,

On Sat, Aug 30, 2025 at 08:28:25AM +0930, Qu Wenruo wrote:
> 在 2025/8/30 07:47, Andy Smith 写道:
> > Okay. Yes it is still supported by Debian so they are still publishing
> > updates for the related LTS kernel but I am relying here on fixes going
> > in to LTS kernel in the first place.
> 
> In v6.4 we reworked the scrub code (and of course introduced some
> regression), but overall it should make the error reporting more consistent.
> 
> I didn't remember the old behavior, but the newer behavior will still report
> errors on recoverable errors.
> 
> I know you have ran scrub and it should have fixed all the missing writes,
> but mind to use some liveUSB or newer LTS kernel (6.12 recommended) and
> re-run the scrub to see if any error reported?

I do a bit of travelling the next few days and I will not like to change
kernels on this non-server-grade system with no out-of-band management
while I am not close by. So, I will leave things with sdh outside of the
filesystem for now.

When I return I will upgrade the kernel, scrub and if clean put sdh back
into the filesystem then scrub again. The Debian bookworm-backports
repository has a linux-image-amd64 package at version 6.12.38-1~bpo12+1.

When I go to put sdh back in to the filesystem, I can do so with a
"replace" because sdh > sdg. Unless you think it would be better in some
way to do a "remove" and then an "add" this time?

Thanks,
Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 23:48       ` Andy Smith
@ 2025-08-30  0:03         ` Qu Wenruo
  2025-10-13 15:58           ` Andy Smith
  2025-08-30  8:20         ` Martin Steigerwald
  1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-08-30  0:03 UTC (permalink / raw)
  To: Andy Smith; +Cc: Qu Wenruo, linux-btrfs



在 2025/8/30 09:18, Andy Smith 写道:
> Hi Qu,
> 
> On Sat, Aug 30, 2025 at 08:28:25AM +0930, Qu Wenruo wrote:
>> 在 2025/8/30 07:47, Andy Smith 写道:
>>> Okay. Yes it is still supported by Debian so they are still publishing
>>> updates for the related LTS kernel but I am relying here on fixes going
>>> in to LTS kernel in the first place.
>>
>> In v6.4 we reworked the scrub code (and of course introduced some
>> regression), but overall it should make the error reporting more consistent.
>>
>> I didn't remember the old behavior, but the newer behavior will still report
>> errors on recoverable errors.
>>
>> I know you have ran scrub and it should have fixed all the missing writes,
>> but mind to use some liveUSB or newer LTS kernel (6.12 recommended) and
>> re-run the scrub to see if any error reported?
> 
> I do a bit of travelling the next few days and I will not like to change
> kernels on this non-server-grade system with no out-of-band management
> while I am not close by. So, I will leave things with sdh outside of the
> filesystem for now.

Have a good travel.

> 
> When I return I will upgrade the kernel, scrub and if clean put sdh back
> into the filesystem then scrub again. The Debian bookworm-backports
> repository has a linux-image-amd64 package at version 6.12.38-1~bpo12+1.
> 
> When I go to put sdh back in to the filesystem, I can do so with a
> "replace" because sdh > sdg. Unless you think it would be better in some
> way to do a "remove" and then an "add" this time?

Replace is way more efficient/faster than remove then add.

The latter will relocate all chunks of the source device to other 
devices (which may not even have enough space), and add the new device 
empty.
Thus you will need to rebalance all those chunks again to reach the new 
device.

So just plain dev-replace will be the best.

Thanks,
Qu
> 
> Thanks,
> Andy


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-30  0:03         ` Qu Wenruo
@ 2025-10-13 15:58           ` Andy Smith
  0 siblings, 0 replies; 9+ messages in thread
From: Andy Smith @ 2025-10-13 15:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

On Sat, Aug 30, 2025 at 09:33:49AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/8/30 09:18, Andy Smith 写道:
> > Hi Qu,
> > 
> > On Sat, Aug 30, 2025 at 08:28:25AM +0930, Qu Wenruo wrote:
> > > 在 2025/8/30 07:47, Andy Smith 写道:
> > > > Okay. Yes it is still supported by Debian so they are still publishing
> > > > updates for the related LTS kernel but I am relying here on fixes going
> > > > in to LTS kernel in the first place.
> > > 
> > > In v6.4 we reworked the scrub code (and of course introduced some
> > > regression), but overall it should make the error reporting more consistent.
> > > 
> > > I didn't remember the old behavior, but the newer behavior will still report
> > > errors on recoverable errors.
> > > 
> > > I know you have ran scrub and it should have fixed all the missing writes,
> > > but mind to use some liveUSB or newer LTS kernel (6.12 recommended) and
> > > re-run the scrub to see if any error reported?
> > 
> > I do a bit of travelling the next few days and I will not like to change
> > kernels on this non-server-grade system with no out-of-band management
> > while I am not close by. So, I will leave things with sdh outside of the
> > filesystem for now.
> 
> Have a good travel.
> 
> > 
> > When I return I will upgrade the kernel, scrub and if clean put sdh back
> > into the filesystem then scrub again. The Debian bookworm-backports
> > repository has a linux-image-amd64 package at version 6.12.38-1~bpo12+1.
> > 
> > When I go to put sdh back in to the filesystem, I can do so with a
> > "replace" because sdh > sdg. Unless you think it would be better in some
> > way to do a "remove" and then an "add" this time?
> 
> Replace is way more efficient/faster than remove then add.
> 
> The latter will relocate all chunks of the source device to other devices
> (which may not even have enough space), and add the new device empty.
> Thus you will need to rebalance all those chunks again to reach the new
> device.
> 
> So just plain dev-replace will be the best.

Just to close this off in a slightly unsatisfactory way:

I found time this last few days to upgrade the machine to Debian 13, so
that is kernel 6.12.48+deb13 and btrfs-progs v6.14.

I did a scrub in the state that it was when we last corresponded
(suspect SSD not in the filesystem) and that came back all clean.

I did a replace to get the temporary drive out and the suspect SSD back
in, which went without incident.

Then I did a scrub again and again all was fine.

So I now can't see any problem. I can't reproduce what was happening
before.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 23:48       ` Andy Smith
  2025-08-30  0:03         ` Qu Wenruo
@ 2025-08-30  8:20         ` Martin Steigerwald
  1 sibling, 0 replies; 9+ messages in thread
From: Martin Steigerwald @ 2025-08-30  8:20 UTC (permalink / raw)
  To: Qu Wenruo, Andy Smith; +Cc: Qu Wenruo, linux-btrfs

Hi.

Andy Smith - 30.08.25, 01:48:45 CEST:
> > I know you have ran scrub and it should have fixed all the missing
> > writes, but mind to use some liveUSB or newer LTS kernel (6.12
> > recommended) and re-run the scrub to see if any error reported?
> 
> I do a bit of travelling the next few days and I will not like to change
> kernels on this non-server-grade system with no out-of-band management
> while I am not close by. So, I will leave things with sdh outside of
> the filesystem for now.
> 
> When I return I will upgrade the kernel, scrub and if clean put sdh back
> into the filesystem then scrub again. The Debian bookworm-backports
> repository has a linux-image-amd64 package at version
> 6.12.38-1~bpo12+1.

Backports also has btrfs-progs 6.14-1~bpo12+1 which might be helpful to 
have over default btrfs-progs for Debian 12 which is 6.2-1+deb12u1.

Of course there is the option to upgrade to Debian 13 as well which got 
released recently.

Best,
-- 
Martin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Mysterious disappearing corruption and how to diagnose
  2025-08-29 20:45 Mysterious disappearing corruption and how to diagnose Andy Smith
  2025-08-29 21:15 ` Qu Wenruo
@ 2025-08-30 22:41 ` Chris Murphy
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2025-08-30 22:41 UTC (permalink / raw)
  To: Andy Smith, Btrfs BTRFS

On Fri, Aug 29, 2025, at 4:45 PM, Andy Smith wrote:

> The shuffling of devices that I had to do can only be temporary, so I
> need to decide what I am going to do. The smaller device I had intended
> to remove (but now had to add back in for capacity reasons) is 1.7T and
> is currently /dev/sdg. I could "btrfs replace /dev/sdg /dev/sdh …" and
> assuming no errors seen do a scrub, but if errors were seen I'd want to
> remove sdh again quickly. replace then wouldn't be an option since sdg
> is smaller than sdh. "btrfs remove sdh …" takes a really long time.

I haven't checked in a while, but I think  `replace` does not do a file system resize following completion. The dev_item.total_bytes remains the same, regardless of block device size. If that's still true, then don't resize it following replace for the time being.

Or alternatively if it is or must be resized, you can shrink it first. Now you can use `btrfs replace` instead of `btrfs device remove`. This partial shrink is significantly less than the one implied by device remove.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-10-13 16:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29 20:45 Mysterious disappearing corruption and how to diagnose Andy Smith
2025-08-29 21:15 ` Qu Wenruo
2025-08-29 22:17   ` Andy Smith
2025-08-29 22:58     ` Qu Wenruo
2025-08-29 23:48       ` Andy Smith
2025-08-30  0:03         ` Qu Wenruo
2025-10-13 15:58           ` Andy Smith
2025-08-30  8:20         ` Martin Steigerwald
2025-08-30 22:41 ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox