BTRFS error count 754 after reboot on Debian kernel 6.12.17

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* BTRFS error count 754 after reboot on Debian kernel 6.12.17
@ 2025-03-22 13:44 Russell Coker
  2025-03-22 21:36 ` Qu Wenruo
  2025-04-01  4:04 ` Chris Murphy
  0 siblings, 2 replies; 9+ messages in thread
From: Russell Coker @ 2025-03-22 13:44 UTC (permalink / raw)
  To: linux-btrfs

[/dev/sdd1].write_io_errs    753
[/dev/sdd1].read_io_errs     1

I have a test system which has a strange problem where the BTRFS error count 
on one device (out of four) goes to 754 after a reboot.

There are no BTRFS errors in the kernel message log after booting up.  There 
are no log entries in /var/log/kern.log about BTRFS issues.  When I look at 
the console as it's shutting down I don't see any errors being logged, so 
either there are no errors logged or there are 753 errors logged in the final 
split second before power off or reboot so that I don't even see them.

This is repeatable and it's 754 every time.

After I get the error I remove the device from the array and add it again.  I 
can run it for days without problem with data being written to that device and 
read from it without error.

But when I reboot it says 754 errors.  When I swapped that device with another 
one in a different drive bay the same device has errors and the other device 
doesn't.  So it's not related to the drive bay it's related to the SSD.

The system is a Dell PowerEdge T630.

The SSD could have a fault, but if so why does it only show up on reboot and 
why 754 errors every time?

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-03-22 13:44 BTRFS error count 754 after reboot on Debian kernel 6.12.17 Russell Coker
@ 2025-03-22 21:36 ` Qu Wenruo
  2025-03-23  4:11   ` Russell Coker
  2025-04-01  4:04 ` Chris Murphy
  1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2025-03-22 21:36 UTC (permalink / raw)
  To: russell, linux-btrfs



在 2025/3/23 00:14, Russell Coker 写道:
> [/dev/sdd1].write_io_errs    753
> [/dev/sdd1].read_io_errs     1
>
> I have a test system which has a strange problem where the BTRFS error count
> on one device (out of four) goes to 754 after a reboot.

Mind to provide the full dmesg just in case?

It's better to cover the initial device removal/add to be extra safe.

>
> There are no BTRFS errors in the kernel message log after booting up.  There
> are no log entries in /var/log/kern.log about BTRFS issues.  When I look at
> the console as it's shutting down I don't see any errors being logged, so
> either there are no errors logged or there are 753 errors logged in the final
> split second before power off or reboot so that I don't even see them.
>
> This is repeatable and it's 754 every time.
>
> After I get the error I remove the device from the array and add it again.

How did you do the removal and add?

"btrfs device remove" then "btrfs device add"? Or just power the machine
down and physically add/remove the device?

In the later case it won't reset the internal error count inside btrfs.

>  I
> can run it for days without problem with data being written to that device and
> read from it without error.
>
> But when I reboot it says 754 errors.  When I swapped that device with another
> one in a different drive bay the same device has errors and the other device
> doesn't.  So it's not related to the drive bay it's related to the SSD.

Again, how did you do the swap?

>
> The system is a Dell PowerEdge T630.
>
> The SSD could have a fault, but if so why does it only show up on reboot and
> why 754 errors every time?
>
The error counters are stored inside the fs, it records all the history
errors a device hit in the past.

You need to inform btrfs by either proper btrfs device removal/add, or
make btrfs to zero the counters.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-03-22 21:36 ` Qu Wenruo
@ 2025-03-23  4:11   ` Russell Coker
  0 siblings, 0 replies; 9+ messages in thread
From: Russell Coker @ 2025-03-23  4:11 UTC (permalink / raw)
  To: linux-btrfs, Qu Wenruo

On Sunday, 23 March 2025 08:36:56 AEDT Qu Wenruo wrote:
> > This is repeatable and it's 754 every time.
> > 
> > After I get the error I remove the device from the array and add it again.
> 
> How did you do the removal and add?

btrfs dev rem ...
btrfs dev add ...

> "btrfs device remove" then "btrfs device add"? Or just power the machine
> down and physically add/remove the device?

I didn't physically remove the devices because working out which device is 
which /dev/sd node is not easy at all.  For unrelated reasons the assignment 
of /dev/sd nodes is apparently random.

> > The system is a Dell PowerEdge T630.
> > 
> > The SSD could have a fault, but if so why does it only show up on reboot
> > and why 754 errors every time?
> 
> The error counters are stored inside the fs, it records all the history
> errors a device hit in the past.
> 
> You need to inform btrfs by either proper btrfs device removal/add, or
> make btrfs to zero the counters.

Yes I had done the proper device removal, but either the counters weren't 
properly reset or some other errors were occuring.  I even tried dding 1G of 
data from /dev/zero over the device after removal and that didn't help.

The problem has gone away now.  I was about to do another run to get dmesg 
output for you but accidentally removed the wrong device.  After I added that 
back then removed and re-added the correct device the problem went away.  I 
can now reboot without seeing an error count of 754.  I presume that forcing a 
bunch of metadata to be moved around changed things somehow.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-03-22 13:44 BTRFS error count 754 after reboot on Debian kernel 6.12.17 Russell Coker
  2025-03-22 21:36 ` Qu Wenruo
@ 2025-04-01  4:04 ` Chris Murphy
  2025-04-01  5:18   ` Russell Coker
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2025-04-01  4:04 UTC (permalink / raw)
  To: Russell Coker, Btrfs BTRFS



On Sat, Mar 22, 2025, at 9:44 AM, Russell Coker wrote:
> [/dev/sdd1].write_io_errs    753
> [/dev/sdd1].read_io_errs     1
>
> I have a test system which has a strange problem where the BTRFS error count 
> on one device (out of four) goes to 754 after a reboot.
>
> There are no BTRFS errors in the kernel message log after booting up.  There 
> are no log entries in /var/log/kern.log about BTRFS issues.  When I look at 
> the console as it's shutting down I don't see any errors being logged, so 
> either there are no errors logged or there are 753 errors logged in the final 
> split second before power off or reboot so that I don't even see them.
>
> This is repeatable and it's 754 every time.
>
> After I get the error I remove the device from the array and add it again.  I 
> can run it for days without problem with data being written to that device and 
> read from it without error.
>
> But when I reboot it says 754 errors.  When I swapped that device with another 
> one in a different drive bay the same device has errors and the other device 
> doesn't.  So it's not related to the drive bay it's related to the SSD.
>
> The system is a Dell PowerEdge T630.
>
> The SSD could have a fault, but if so why does it only show up on reboot and 
> why 754 errors every time?


These are likely old errors. You'd need to check old logs to see when the write errors occurred. These statistics are just a counter. You can reset them with `btrfs dev stats -z` and they'll go back to zero.

It's simple counter. It could be 754 errors seen one time. Or it could be `1 error seen 754 times. Or any combination of multiple errors multiple times adding up to 754 errors.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-04-01  4:04 ` Chris Murphy
@ 2025-04-01  5:18   ` Russell Coker
  2025-04-01 16:00     ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Russell Coker @ 2025-04-01  5:18 UTC (permalink / raw)
  To: Btrfs BTRFS, Chris Murphy

On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote:
> These are likely old errors. You'd need to check old logs to see when the
> write errors occurred. These statistics are just a counter. You can reset
> them with `btrfs dev stats -z` and they'll go back to zero.
> 
> It's simple counter. It could be 754 errors seen one time. Or it could be `1
> error seen 754 times. Or any combination of multiple errors multiple times
> adding up to 754 errors.

Is "btrfs dev stats -z" covered by removing the device from the set and adding 
it again?  If so I did that but it kept recurring.  The fact that the error 
count was there in the first place wasn't the unexpected thing, it was the 
fact that it kept coming back and had no log entries about it.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-04-01  5:18   ` Russell Coker
@ 2025-04-01 16:00     ` Chris Murphy
  2025-04-02  2:32       ` Russell Coker
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2025-04-01 16:00 UTC (permalink / raw)
  To: Russell Coker, Btrfs BTRFS



On Tue, Apr 1, 2025, at 1:18 AM, Russell Coker wrote:
> On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote:
>> These are likely old errors. You'd need to check old logs to see when the
>> write errors occurred. These statistics are just a counter. You can reset
>> them with `btrfs dev stats -z` and they'll go back to zero.
>> 
>> It's simple counter. It could be 754 errors seen one time. Or it could be `1
>> error seen 754 times. Or any combination of multiple errors multiple times
>> adding up to 754 errors.
>
> Is "btrfs dev stats -z" covered by removing the device from the set and adding 
> it again?  If so I did that but it kept recurring.  The fact that the error 
> count was there in the first place wasn't the unexpected thing, it was the 
> fact that it kept coming back and had no log entries about it.

Removing it with a `btrfs` command? Or physically disconnecting and reconnecting?

The statistics are per device, persistently stored in the device b-tree which is metadata block group. So this metadata could be on any device in a multiple device Btrfs, not necessarily on the device that produced the errors.

I'd like to think upon `btrfs device remove` or `btrfs replace` the device's stats are also removed from dev tree. But I haven' tested it, and I'm not sure what the code says should happen.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-04-01 16:00     ` Chris Murphy
@ 2025-04-02  2:32       ` Russell Coker
  2025-04-02  2:42         ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Russell Coker @ 2025-04-02  2:32 UTC (permalink / raw)
  To: Btrfs BTRFS, Chris Murphy

On Wednesday, 2 April 2025 03:00:33 AEDT Chris Murphy wrote:
> On Tue, Apr 1, 2025, at 1:18 AM, Russell Coker wrote:
> > On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote:
> >> These are likely old errors. You'd need to check old logs to see when the
> >> write errors occurred. These statistics are just a counter. You can reset
> >> them with `btrfs dev stats -z` and they'll go back to zero.
> >> 
> >> It's simple counter. It could be 754 errors seen one time. Or it could be
> >> `1 error seen 754 times. Or any combination of multiple errors multiple
> >> times adding up to 754 errors.
> > 
> > Is "btrfs dev stats -z" covered by removing the device from the set and
> > adding it again?  If so I did that but it kept recurring.  The fact that
> > the error count was there in the first place wasn't the unexpected thing,
> > it was the fact that it kept coming back and had no log entries about it.
> 
> Removing it with a `btrfs` command? Or physically disconnecting and
> reconnecting?

btrfs commands.

> The statistics are per device, persistently stored in the device b-tree
> which is metadata block group. So this metadata could be on any device in a
> multiple device Btrfs, not necessarily on the device that produced the
> errors.

It should be on the device itself and once the device is subject to a btrfs 
dev rem command it should be gone for good.

> I'd like to think upon `btrfs device remove` or `btrfs replace` the device's
> stats are also removed from dev tree. But I haven' tested it, and I'm not
> sure what the code says should happen.

After doing the btrfs dev add it reports 0 errors until after reboot.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-04-02  2:32       ` Russell Coker
@ 2025-04-02  2:42         ` Chris Murphy
  2025-04-02  5:01           ` Russell Coker
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2025-04-02  2:42 UTC (permalink / raw)
  To: Russell Coker, Btrfs BTRFS



On Tue, Apr 1, 2025, at 10:32 PM, Russell Coker wrote:
> On Wednesday, 2 April 2025 03:00:33 AEDT Chris Murphy wrote:
>> On Tue, Apr 1, 2025, at 1:18 AM, Russell Coker wrote:
>> > On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote:
>> >> These are likely old errors. You'd need to check old logs to see when the
>> >> write errors occurred. These statistics are just a counter. You can reset
>> >> them with `btrfs dev stats -z` and they'll go back to zero.
>> >> 
>> >> It's simple counter. It could be 754 errors seen one time. Or it could be
>> >> `1 error seen 754 times. Or any combination of multiple errors multiple
>> >> times adding up to 754 errors.
>> > 
>> > Is "btrfs dev stats -z" covered by removing the device from the set and
>> > adding it again?  If so I did that but it kept recurring.  The fact that
>> > the error count was there in the first place wasn't the unexpected thing,
>> > it was the fact that it kept coming back and had no log entries about it.
>> 
>> Removing it with a `btrfs` command? Or physically disconnecting and
>> reconnecting?
>
> btrfs commands.
>
>> The statistics are per device, persistently stored in the device b-tree
>> which is metadata block group. So this metadata could be on any device in a
>> multiple device Btrfs, not necessarily on the device that produced the
>> errors.
>
> It should be on the device itself and once the device is subject to a btrfs 
> dev rem command it should be gone for good.
>
>> I'd like to think upon `btrfs device remove` or `btrfs replace` the device's
>> stats are also removed from dev tree. But I haven' tested it, and I'm not
>> sure what the code says should happen.
>
> After doing the btrfs dev add it reports 0 errors until after reboot.

Uhh, well I'm confused then.

What happens if you use `btrfs dev stats -z $MNT` and then reboot? 



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17
  2025-04-02  2:42         ` Chris Murphy
@ 2025-04-02  5:01           ` Russell Coker
  0 siblings, 0 replies; 9+ messages in thread
From: Russell Coker @ 2025-04-02  5:01 UTC (permalink / raw)
  To: Btrfs BTRFS, Chris Murphy

On Wednesday, 2 April 2025 13:42:32 AEDT Chris Murphy wrote:
> What happens if you use `btrfs dev stats -z $MNT` and then reboot?

I didn't try that and after I accidentally removed a different device the 
problem stopped recurring so I can't try that now.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-02  5:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-22 13:44 BTRFS error count 754 after reboot on Debian kernel 6.12.17 Russell Coker
2025-03-22 21:36 ` Qu Wenruo
2025-03-23  4:11   ` Russell Coker
2025-04-01  4:04 ` Chris Murphy
2025-04-01  5:18   ` Russell Coker
2025-04-01 16:00     ` Chris Murphy
2025-04-02  2:32       ` Russell Coker
2025-04-02  2:42         ` Chris Murphy
2025-04-02  5:01           ` Russell Coker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox