* BTRFS error count 754 after reboot on Debian kernel 6.12.17 @ 2025-03-22 13:44 Russell Coker 2025-03-22 21:36 ` Qu Wenruo 2025-04-01 4:04 ` Chris Murphy 0 siblings, 2 replies; 9+ messages in thread From: Russell Coker @ 2025-03-22 13:44 UTC (permalink / raw) To: linux-btrfs [/dev/sdd1].write_io_errs 753 [/dev/sdd1].read_io_errs 1 I have a test system which has a strange problem where the BTRFS error count on one device (out of four) goes to 754 after a reboot. There are no BTRFS errors in the kernel message log after booting up. There are no log entries in /var/log/kern.log about BTRFS issues. When I look at the console as it's shutting down I don't see any errors being logged, so either there are no errors logged or there are 753 errors logged in the final split second before power off or reboot so that I don't even see them. This is repeatable and it's 754 every time. After I get the error I remove the device from the array and add it again. I can run it for days without problem with data being written to that device and read from it without error. But when I reboot it says 754 errors. When I swapped that device with another one in a different drive bay the same device has errors and the other device doesn't. So it's not related to the drive bay it's related to the SSD. The system is a Dell PowerEdge T630. The SSD could have a fault, but if so why does it only show up on reboot and why 754 errors every time? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-03-22 13:44 BTRFS error count 754 after reboot on Debian kernel 6.12.17 Russell Coker @ 2025-03-22 21:36 ` Qu Wenruo 2025-03-23 4:11 ` Russell Coker 2025-04-01 4:04 ` Chris Murphy 1 sibling, 1 reply; 9+ messages in thread From: Qu Wenruo @ 2025-03-22 21:36 UTC (permalink / raw) To: russell, linux-btrfs 在 2025/3/23 00:14, Russell Coker 写道: > [/dev/sdd1].write_io_errs 753 > [/dev/sdd1].read_io_errs 1 > > I have a test system which has a strange problem where the BTRFS error count > on one device (out of four) goes to 754 after a reboot. Mind to provide the full dmesg just in case? It's better to cover the initial device removal/add to be extra safe. > > There are no BTRFS errors in the kernel message log after booting up. There > are no log entries in /var/log/kern.log about BTRFS issues. When I look at > the console as it's shutting down I don't see any errors being logged, so > either there are no errors logged or there are 753 errors logged in the final > split second before power off or reboot so that I don't even see them. > > This is repeatable and it's 754 every time. > > After I get the error I remove the device from the array and add it again. How did you do the removal and add? "btrfs device remove" then "btrfs device add"? Or just power the machine down and physically add/remove the device? In the later case it won't reset the internal error count inside btrfs. > I > can run it for days without problem with data being written to that device and > read from it without error. > > But when I reboot it says 754 errors. When I swapped that device with another > one in a different drive bay the same device has errors and the other device > doesn't. So it's not related to the drive bay it's related to the SSD. Again, how did you do the swap? > > The system is a Dell PowerEdge T630. > > The SSD could have a fault, but if so why does it only show up on reboot and > why 754 errors every time? > The error counters are stored inside the fs, it records all the history errors a device hit in the past. You need to inform btrfs by either proper btrfs device removal/add, or make btrfs to zero the counters. Thanks, Qu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-03-22 21:36 ` Qu Wenruo @ 2025-03-23 4:11 ` Russell Coker 0 siblings, 0 replies; 9+ messages in thread From: Russell Coker @ 2025-03-23 4:11 UTC (permalink / raw) To: linux-btrfs, Qu Wenruo On Sunday, 23 March 2025 08:36:56 AEDT Qu Wenruo wrote: > > This is repeatable and it's 754 every time. > > > > After I get the error I remove the device from the array and add it again. > > How did you do the removal and add? btrfs dev rem ... btrfs dev add ... > "btrfs device remove" then "btrfs device add"? Or just power the machine > down and physically add/remove the device? I didn't physically remove the devices because working out which device is which /dev/sd node is not easy at all. For unrelated reasons the assignment of /dev/sd nodes is apparently random. > > The system is a Dell PowerEdge T630. > > > > The SSD could have a fault, but if so why does it only show up on reboot > > and why 754 errors every time? > > The error counters are stored inside the fs, it records all the history > errors a device hit in the past. > > You need to inform btrfs by either proper btrfs device removal/add, or > make btrfs to zero the counters. Yes I had done the proper device removal, but either the counters weren't properly reset or some other errors were occuring. I even tried dding 1G of data from /dev/zero over the device after removal and that didn't help. The problem has gone away now. I was about to do another run to get dmesg output for you but accidentally removed the wrong device. After I added that back then removed and re-added the correct device the problem went away. I can now reboot without seeing an error count of 754. I presume that forcing a bunch of metadata to be moved around changed things somehow. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-03-22 13:44 BTRFS error count 754 after reboot on Debian kernel 6.12.17 Russell Coker 2025-03-22 21:36 ` Qu Wenruo @ 2025-04-01 4:04 ` Chris Murphy 2025-04-01 5:18 ` Russell Coker 1 sibling, 1 reply; 9+ messages in thread From: Chris Murphy @ 2025-04-01 4:04 UTC (permalink / raw) To: Russell Coker, Btrfs BTRFS On Sat, Mar 22, 2025, at 9:44 AM, Russell Coker wrote: > [/dev/sdd1].write_io_errs 753 > [/dev/sdd1].read_io_errs 1 > > I have a test system which has a strange problem where the BTRFS error count > on one device (out of four) goes to 754 after a reboot. > > There are no BTRFS errors in the kernel message log after booting up. There > are no log entries in /var/log/kern.log about BTRFS issues. When I look at > the console as it's shutting down I don't see any errors being logged, so > either there are no errors logged or there are 753 errors logged in the final > split second before power off or reboot so that I don't even see them. > > This is repeatable and it's 754 every time. > > After I get the error I remove the device from the array and add it again. I > can run it for days without problem with data being written to that device and > read from it without error. > > But when I reboot it says 754 errors. When I swapped that device with another > one in a different drive bay the same device has errors and the other device > doesn't. So it's not related to the drive bay it's related to the SSD. > > The system is a Dell PowerEdge T630. > > The SSD could have a fault, but if so why does it only show up on reboot and > why 754 errors every time? These are likely old errors. You'd need to check old logs to see when the write errors occurred. These statistics are just a counter. You can reset them with `btrfs dev stats -z` and they'll go back to zero. It's simple counter. It could be 754 errors seen one time. Or it could be `1 error seen 754 times. Or any combination of multiple errors multiple times adding up to 754 errors. -- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-04-01 4:04 ` Chris Murphy @ 2025-04-01 5:18 ` Russell Coker 2025-04-01 16:00 ` Chris Murphy 0 siblings, 1 reply; 9+ messages in thread From: Russell Coker @ 2025-04-01 5:18 UTC (permalink / raw) To: Btrfs BTRFS, Chris Murphy On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote: > These are likely old errors. You'd need to check old logs to see when the > write errors occurred. These statistics are just a counter. You can reset > them with `btrfs dev stats -z` and they'll go back to zero. > > It's simple counter. It could be 754 errors seen one time. Or it could be `1 > error seen 754 times. Or any combination of multiple errors multiple times > adding up to 754 errors. Is "btrfs dev stats -z" covered by removing the device from the set and adding it again? If so I did that but it kept recurring. The fact that the error count was there in the first place wasn't the unexpected thing, it was the fact that it kept coming back and had no log entries about it. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-04-01 5:18 ` Russell Coker @ 2025-04-01 16:00 ` Chris Murphy 2025-04-02 2:32 ` Russell Coker 0 siblings, 1 reply; 9+ messages in thread From: Chris Murphy @ 2025-04-01 16:00 UTC (permalink / raw) To: Russell Coker, Btrfs BTRFS On Tue, Apr 1, 2025, at 1:18 AM, Russell Coker wrote: > On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote: >> These are likely old errors. You'd need to check old logs to see when the >> write errors occurred. These statistics are just a counter. You can reset >> them with `btrfs dev stats -z` and they'll go back to zero. >> >> It's simple counter. It could be 754 errors seen one time. Or it could be `1 >> error seen 754 times. Or any combination of multiple errors multiple times >> adding up to 754 errors. > > Is "btrfs dev stats -z" covered by removing the device from the set and adding > it again? If so I did that but it kept recurring. The fact that the error > count was there in the first place wasn't the unexpected thing, it was the > fact that it kept coming back and had no log entries about it. Removing it with a `btrfs` command? Or physically disconnecting and reconnecting? The statistics are per device, persistently stored in the device b-tree which is metadata block group. So this metadata could be on any device in a multiple device Btrfs, not necessarily on the device that produced the errors. I'd like to think upon `btrfs device remove` or `btrfs replace` the device's stats are also removed from dev tree. But I haven' tested it, and I'm not sure what the code says should happen. -- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-04-01 16:00 ` Chris Murphy @ 2025-04-02 2:32 ` Russell Coker 2025-04-02 2:42 ` Chris Murphy 0 siblings, 1 reply; 9+ messages in thread From: Russell Coker @ 2025-04-02 2:32 UTC (permalink / raw) To: Btrfs BTRFS, Chris Murphy On Wednesday, 2 April 2025 03:00:33 AEDT Chris Murphy wrote: > On Tue, Apr 1, 2025, at 1:18 AM, Russell Coker wrote: > > On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote: > >> These are likely old errors. You'd need to check old logs to see when the > >> write errors occurred. These statistics are just a counter. You can reset > >> them with `btrfs dev stats -z` and they'll go back to zero. > >> > >> It's simple counter. It could be 754 errors seen one time. Or it could be > >> `1 error seen 754 times. Or any combination of multiple errors multiple > >> times adding up to 754 errors. > > > > Is "btrfs dev stats -z" covered by removing the device from the set and > > adding it again? If so I did that but it kept recurring. The fact that > > the error count was there in the first place wasn't the unexpected thing, > > it was the fact that it kept coming back and had no log entries about it. > > Removing it with a `btrfs` command? Or physically disconnecting and > reconnecting? btrfs commands. > The statistics are per device, persistently stored in the device b-tree > which is metadata block group. So this metadata could be on any device in a > multiple device Btrfs, not necessarily on the device that produced the > errors. It should be on the device itself and once the device is subject to a btrfs dev rem command it should be gone for good. > I'd like to think upon `btrfs device remove` or `btrfs replace` the device's > stats are also removed from dev tree. But I haven' tested it, and I'm not > sure what the code says should happen. After doing the btrfs dev add it reports 0 errors until after reboot. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-04-02 2:32 ` Russell Coker @ 2025-04-02 2:42 ` Chris Murphy 2025-04-02 5:01 ` Russell Coker 0 siblings, 1 reply; 9+ messages in thread From: Chris Murphy @ 2025-04-02 2:42 UTC (permalink / raw) To: Russell Coker, Btrfs BTRFS On Tue, Apr 1, 2025, at 10:32 PM, Russell Coker wrote: > On Wednesday, 2 April 2025 03:00:33 AEDT Chris Murphy wrote: >> On Tue, Apr 1, 2025, at 1:18 AM, Russell Coker wrote: >> > On Tuesday, 1 April 2025 15:04:20 AEDT Chris Murphy wrote: >> >> These are likely old errors. You'd need to check old logs to see when the >> >> write errors occurred. These statistics are just a counter. You can reset >> >> them with `btrfs dev stats -z` and they'll go back to zero. >> >> >> >> It's simple counter. It could be 754 errors seen one time. Or it could be >> >> `1 error seen 754 times. Or any combination of multiple errors multiple >> >> times adding up to 754 errors. >> > >> > Is "btrfs dev stats -z" covered by removing the device from the set and >> > adding it again? If so I did that but it kept recurring. The fact that >> > the error count was there in the first place wasn't the unexpected thing, >> > it was the fact that it kept coming back and had no log entries about it. >> >> Removing it with a `btrfs` command? Or physically disconnecting and >> reconnecting? > > btrfs commands. > >> The statistics are per device, persistently stored in the device b-tree >> which is metadata block group. So this metadata could be on any device in a >> multiple device Btrfs, not necessarily on the device that produced the >> errors. > > It should be on the device itself and once the device is subject to a btrfs > dev rem command it should be gone for good. > >> I'd like to think upon `btrfs device remove` or `btrfs replace` the device's >> stats are also removed from dev tree. But I haven' tested it, and I'm not >> sure what the code says should happen. > > After doing the btrfs dev add it reports 0 errors until after reboot. Uhh, well I'm confused then. What happens if you use `btrfs dev stats -z $MNT` and then reboot? -- Chris Murphy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: BTRFS error count 754 after reboot on Debian kernel 6.12.17 2025-04-02 2:42 ` Chris Murphy @ 2025-04-02 5:01 ` Russell Coker 0 siblings, 0 replies; 9+ messages in thread From: Russell Coker @ 2025-04-02 5:01 UTC (permalink / raw) To: Btrfs BTRFS, Chris Murphy On Wednesday, 2 April 2025 13:42:32 AEDT Chris Murphy wrote: > What happens if you use `btrfs dev stats -z $MNT` and then reboot? I didn't try that and after I accidentally removed a different device the problem stopped recurring so I can't try that now. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-04-02 5:01 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-22 13:44 BTRFS error count 754 after reboot on Debian kernel 6.12.17 Russell Coker 2025-03-22 21:36 ` Qu Wenruo 2025-03-23 4:11 ` Russell Coker 2025-04-01 4:04 ` Chris Murphy 2025-04-01 5:18 ` Russell Coker 2025-04-01 16:00 ` Chris Murphy 2025-04-02 2:32 ` Russell Coker 2025-04-02 2:42 ` Chris Murphy 2025-04-02 5:01 ` Russell Coker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox