* Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
@ 2016-02-23 21:59 Marc MERLIN
2016-02-23 23:17 ` Duncan
2016-03-07 15:13 ` Marc MERLIN
0 siblings, 2 replies; 6+ messages in thread
From: Marc MERLIN @ 2016-02-23 21:59 UTC (permalink / raw)
To: linux-btrfs
I have a freshly created md5 array, with drives that I specifically
scanned one by one block by block, and for good measure, I also scanned
the entire software raid with a check command which took 3 days to run.
Everything passed.
Then, I made a bcache of that device, an ssd that seems to work fine
otherwise (brand new), and dmcrypted the result
md5 - bache - dmcrypt - btrfs
ssd /
Now, I'm copying data over with btrfs send, and I'm seeing these slowly
show up and the write counter go up one by one.
BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0, flush 0, corrupt 0, gen 0
Where is the documentation for those counters?
Is the write error fatal, or a recovered error?
Should I consider that my filesystem is corrupted as soon as any of
those counters go up?
(I couldn't find an exact meaning of each of them)
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
@ 2016-02-23 23:17 ` Duncan
2016-02-23 23:22 ` Duncan
2016-02-24 0:19 ` Marc MERLIN
2016-03-07 15:13 ` Marc MERLIN
1 sibling, 2 replies; 6+ messages in thread
From: Duncan @ 2016-02-23 23:17 UTC (permalink / raw)
To: linux-btrfs
Marc MERLIN posted on Tue, 23 Feb 2016 13:59:11 -0800 as excerpted:
> I have a freshly created md5 array, with drives that I specifically
> scanned one by one block by block, and for good measure, I also scanned
> the entire software raid with a check command which took 3 days to run.
>
> Everything passed.
>
> Then, I made a bcache of that device, an ssd that seems to work fine
> otherwise (brand new), and dmcrypted the result
>
> md5 - bache - dmcrypt - btrfs ssd /
>
> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
> show up and the write counter go up one by one.
> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0,
> flush 0, corrupt 0, gen 0
>
> Where is the documentation for those counters?
> Is the write error fatal, or a recovered error?
> Should I consider that my filesystem is corrupted as soon as any of
> those counters go up?
> (I couldn't find an exact meaning of each of them)
I believe all formal documentation of what the error counters actually
mean is developer-level -- "Trust the Source, Luke."
Unless something has recently been added to the wiki documenting them,
admin/user level documentation is only the simple mention in the
btrfs-device manpage under stats, and what can be gathered, often by
reading between the lines or from simply observing real behavior and the
kernel log when errors increment, from the simple error counter names and
comments here on this list.
Yet another point supporting the "btrfs is still stabilizing, not yet
fully stable" position, I suppose, as it could definitely be argued that
those counters and their visibility, including display in the kernel log
at mount time, are definitely intended to be consumed at the admin-user
level, and that it follows that they should be documented at the admin-
user level before the filesystem can properly be defined as fully stable.
Meanwhile, not saying my own admin-user viewpoint is gospel, by any
stretch, but with the intent of hopefully helping make sense of things...
>From my own experience of some months with a failing ssd (as part of a
raid1 pair with an ssd that was working fine, so I could and did
regularly scrub the errors and took advantage of the checksummed raid1
pairing to let it go much further than I would have in other
circumstances, simply to observe how things worked as it degraded)...
Write error counter increments should be accompanied by kernel log events
telling you more -- what level of the device stack is returning the
errors that propagate up to the filesystem level, for instance. Expected
would be either bus level timeouts and resets, or storage device errors.
If it's storage device errors, SMART data should show increasing raw
value relocated sectors or the like (smartctl -A). If it's bus errors,
it could be bad cabling (bad connections or bad shielding, or using
SATA-150 certified cables for SATA-600 or some such), or, as I saw on an
old and failing mobo (when I pulled it there were bulging and some
exploded capacitors) a few years ago, failing filter-capacitors on the
mobo signalling paths. Bad power, including the possibility of an
overloaded UPS that hit one guy I know, is notorious for both this sort
of issue and memory problems, as well.
Of course bus timeout errors can also be due to lower timeouts on the bus
(typically 30-second) than on the device (often 2-minute retry time, on
consumer-level devices), but there's others here with far more knowledge
in that area, including what to do to try to fix it, than I have, and the
various options to fix it have been posted multiple times by now, and
likely will be posted here again.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
2016-02-23 23:17 ` Duncan
@ 2016-02-23 23:22 ` Duncan
2016-02-24 0:19 ` Marc MERLIN
1 sibling, 0 replies; 6+ messages in thread
From: Duncan @ 2016-02-23 23:22 UTC (permalink / raw)
To: linux-btrfs
Duncan posted on Tue, 23 Feb 2016 23:17:06 +0000 as excerpted:
> Marc MERLIN posted on Tue, 23 Feb 2016 13:59:11 -0800 as excerpted:
>
>> I have a freshly created md5 array, with drives that I specifically
>> scanned one by one block by block, and for good measure, I also scanned
>> the entire software raid with a check command which took 3 days to run.
>>
>> Everything passed.
>>
>> Then, I made a bcache of that device, an ssd that seems to work fine
>> otherwise (brand new), and dmcrypted the result
>>
>> md5 - bache - dmcrypt - btrfs ssd /
>>
>> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
>> show up and the write counter go up one by one.
>> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0,
>> flush 0, corrupt 0, gen 0
>>
>> Where is the documentation for those counters?
>> Is the write error fatal, or a recovered error?
>> Should I consider that my filesystem is corrupted as soon as any of
>> those counters go up?
>> (I couldn't find an exact meaning of each of them)
>
> I believe all formal documentation of what the error counters actually
> mean is developer-level -- "Trust the Source, Luke."
Forgot to mention, tho you're probably already considering it, if this is
the same raid5-backed btrfs you were complaining about being slow in the
other thread, and considering redoing with bcache to an ssd added, as
seems very likely, if it /is/ actually storage device or bus errors, that
could be one reason the previous one was getting so slow... Maybe it
wasn't btrfs after all.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
2016-02-23 23:17 ` Duncan
2016-02-23 23:22 ` Duncan
@ 2016-02-24 0:19 ` Marc MERLIN
2016-02-24 0:38 ` Duncan
1 sibling, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2016-02-24 0:19 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
On Tue, Feb 23, 2016 at 11:22:47PM +0000, Duncan wrote:
> Forgot to mention, tho you're probably already considering it, if this is
> the same raid5-backed btrfs you were complaining about being slow in the
> other thread,
No, that's another one :)
This one was remade from scratch after the filesystem on it got
corrupted.
5 x 4TB swraid5 64GB SSD
bcache
dmcrypt
btrfs
Smart is 100% for all 5 drives, and they passed an extensive test before
I built the new raid and filesystem on them.
> and considering redoing with bcache to an ssd added, as
> seems very likely, if it /is/ actually storage device or bus errors, that
> could be one reason the previous one was getting so slow... Maybe it
> wasn't btrfs after all.
Good thinking, although in this case, it's a different filesystem.
This filesystem is however on a Sata port multiplier with a 2 meter
cable to an external disk array.
As a result, bandwidth to it is going to be slow-ish, and the long cable
could be adding I/O errors.
On Tue, Feb 23, 2016 at 11:17:06PM +0000, Duncan wrote:
> I believe all formal documentation of what the error counters actually
> mean is developer-level -- "Trust the Source, Luke."
Haha, I know that one :)
Although to be fair I was more offering for someone to tell me what
they're supposed to mean, and me updating the wiki to capture that info.
> Yet another point supporting the "btrfs is still stabilizing, not yet
> fully stable" position, I suppose, as it could definitely be argued that
> those counters and their visibility, including display in the kernel log
> at mount time, are definitely intended to be consumed at the admin-user
> level, and that it follows that they should be documented at the admin-
> user level before the filesystem can properly be defined as fully stable.
Yes :) and I'm happy to help make this reality in the wiki at least.
> Write error counter increments should be accompanied by kernel log events
> telling you more -- what level of the device stack is returning the
> errors that propagate up to the filesystem level, for instance. Expected
> would be either bus level timeouts and resets, or storage device errors.
I agree, and I get 0 such errors here, which is why it's weird.
> If it's storage device errors, SMART data should show increasing raw
> value relocated sectors or the like (smartctl -A). If it's bus errors,
Correct, and they are all at 0.
> it could be bad cabling (bad connections or bad shielding, or using
> SATA-150 certified cables for SATA-600 or some such), or, as I saw on an
Cabling is indeed a likely culprit, I'm just surprised that if it's the
case, the sata layer is showing me nothing (I'm doing tail -f
/var/log/kern.log and usually I'd see sata or PMP errors there)
> old and failing mobo (when I pulled it there were bulging and some
> exploded capacitors) a few years ago, failing filter-capacitors on the
> mobo signalling paths. Bad power, including the possibility of an
> overloaded UPS that hit one guy I know, is notorious for both this sort
> of issue and memory problems, as well.
All true, but wouldn't all of these show up as actual disk errors by the
underlying driver involved too?
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
2016-02-24 0:19 ` Marc MERLIN
@ 2016-02-24 0:38 ` Duncan
0 siblings, 0 replies; 6+ messages in thread
From: Duncan @ 2016-02-24 0:38 UTC (permalink / raw)
To: linux-btrfs
Marc MERLIN posted on Tue, 23 Feb 2016 16:19:44 -0800 as excerpted:
> Cabling is indeed a likely culprit, I'm just surprised that if it's the
> case, the sata layer is showing me nothing (I'm doing tail -f
> /var/log/kern.log and usually I'd see sata or PMP errors there)
That /is/ surprising. No explanation, there, tho I don't know enough
about such errors to know if they /always/ tend to show up in the logs,
or not, only that mine generally have.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
2016-02-23 23:17 ` Duncan
@ 2016-03-07 15:13 ` Marc MERLIN
1 sibling, 0 replies; 6+ messages in thread
From: Marc MERLIN @ 2016-03-07 15:13 UTC (permalink / raw)
To: linux-btrfs
On Tue, Feb 23, 2016 at 01:59:11PM -0800, Marc MERLIN wrote:
> I have a freshly created md5 array, with drives that I specifically
> scanned one by one block by block, and for good measure, I also scanned
> the entire software raid with a check command which took 3 days to run.
>
> Everything passed.
>
> Then, I made a bcache of that device, an ssd that seems to work fine
> otherwise (brand new), and dmcrypted the result
>
> md5 - bache - dmcrypt - btrfs
> ssd /
>
> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
> show up and the write counter go up one by one.
> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0, flush 0, corrupt 0, gen 0
>
> Where is the documentation for those counters?
> Is the write error fatal, or a recovered error?
> Should I consider that my filesystem is corrupted as soon as any of
> those counters go up?
> (I couldn't find an exact meaning of each of them)
>
Sadly, this problem hasn't gone away
[ 2381.333412] BTRFS error (device dm-5): bdev /dev/mapper/oldds1 errs: wr 298, rd 0, flush 0, corrupt 0, gen 0
I'm really trying to make sense out of it.
Are those recovered errors (bad IO, command was retried, things worked
after that), fatal errors (data loss)
That md5 is in a disk shelf at the end of a longish esata cable. It's
possible that the cable is bad, or it couuld be something else entirely.
I'm still trying to understand the error so that I can diagnose and
address it properly.
Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-03-07 15:13 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
2016-02-23 23:17 ` Duncan
2016-02-23 23:22 ` Duncan
2016-02-24 0:19 ` Marc MERLIN
2016-02-24 0:38 ` Duncan
2016-03-07 15:13 ` Marc MERLIN
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).