Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
@ 2016-02-23 21:59 Marc MERLIN
  2016-02-23 23:17 ` Duncan
  2016-03-07 15:13 ` Marc MERLIN
  0 siblings, 2 replies; 6+ messages in thread
From: Marc MERLIN @ 2016-02-23 21:59 UTC (permalink / raw)
  To: linux-btrfs

I have a freshly created md5 array, with drives that I specifically
scanned one by one block by block, and for good measure, I also scanned
the entire software raid with a check command which took 3 days to run.

Everything passed.

Then, I made a bcache of that device, an ssd that seems to work fine
otherwise (brand new), and dmcrypted the result

md5 - bache - dmcrypt - btrfs
ssd /

Now, I'm copying data over with btrfs send, and I'm seeing these slowly
show up and the write counter go up one by one.
BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0, flush 0, corrupt 0, gen 0

Where is the documentation for those counters?
Is the write error fatal, or a recovered error?
Should I consider that my filesystem is corrupted as soon as any of
those counters go up?
(I couldn't find an exact meaning of each of them)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
  2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
@ 2016-02-23 23:17 ` Duncan
  2016-02-23 23:22   ` Duncan
  2016-02-24  0:19   ` Marc MERLIN
  2016-03-07 15:13 ` Marc MERLIN
  1 sibling, 2 replies; 6+ messages in thread
From: Duncan @ 2016-02-23 23:17 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Tue, 23 Feb 2016 13:59:11 -0800 as excerpted:

> I have a freshly created md5 array, with drives that I specifically
> scanned one by one block by block, and for good measure, I also scanned
> the entire software raid with a check command which took 3 days to run.
> 
> Everything passed.
> 
> Then, I made a bcache of that device, an ssd that seems to work fine
> otherwise (brand new), and dmcrypted the result
> 
> md5 - bache - dmcrypt - btrfs ssd /
> 
> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
> show up and the write counter go up one by one.
> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0,
> flush 0, corrupt 0, gen 0
> 
> Where is the documentation for those counters?
> Is the write error fatal, or a recovered error?
> Should I consider that my filesystem is corrupted as soon as any of
> those counters go up?
> (I couldn't find an exact meaning of each of them)

I believe all formal documentation of what the error counters actually 
mean is developer-level -- "Trust the Source, Luke."

Unless something has recently been added to the wiki documenting them, 
admin/user level documentation is only the simple mention in the
btrfs-device manpage under stats, and what can be gathered, often by 
reading between the lines or from simply observing real behavior and the 
kernel log when errors increment, from the simple error counter names and 
comments here on this list.

Yet another point supporting the "btrfs is still stabilizing, not yet 
fully stable" position, I suppose, as it could definitely be argued that 
those counters and their visibility, including display in the kernel log 
at mount time, are definitely intended to be consumed at the admin-user 
level, and that it follows that they should be documented at the admin-
user level before the filesystem can properly be defined as fully stable.

Meanwhile, not saying my own admin-user viewpoint is gospel, by any 
stretch, but with the intent of hopefully helping make sense of things...

>From my own experience of some months with a failing ssd (as part of a 
raid1 pair with an ssd that was working fine, so I could and did 
regularly scrub the errors and took advantage of the checksummed raid1 
pairing to let it go much further than I would have in other 
circumstances, simply to observe how things worked as it degraded)...

Write error counter increments should be accompanied by kernel log events 
telling you more -- what level of the device stack is returning the 
errors that propagate up to the filesystem level, for instance.  Expected 
would be either bus level timeouts and resets, or storage device errors.  

If it's storage device errors, SMART data should show increasing raw 
value relocated sectors or the like (smartctl -A).  If it's bus errors, 
it could be bad cabling (bad connections or bad shielding, or using 
SATA-150 certified cables for SATA-600 or some such), or, as I saw on an 
old and failing mobo (when I pulled it there were bulging and some 
exploded capacitors) a few years ago, failing filter-capacitors on the 
mobo signalling paths.  Bad power, including the possibility of an 
overloaded UPS that hit one guy I know, is notorious for both this sort 
of issue and memory problems, as well.

Of course bus timeout errors can also be due to lower timeouts on the bus 
(typically 30-second) than on the device (often 2-minute retry time, on 
consumer-level devices), but there's others here with far more knowledge 
in that area, including what to do to try to fix it, than I have, and the 
various options to fix it have been posted multiple times by now, and 
likely will be posted here again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
  2016-02-23 23:17 ` Duncan
@ 2016-02-23 23:22   ` Duncan
  2016-02-24  0:19   ` Marc MERLIN
  1 sibling, 0 replies; 6+ messages in thread
From: Duncan @ 2016-02-23 23:22 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Tue, 23 Feb 2016 23:17:06 +0000 as excerpted:

> Marc MERLIN posted on Tue, 23 Feb 2016 13:59:11 -0800 as excerpted:
> 
>> I have a freshly created md5 array, with drives that I specifically
>> scanned one by one block by block, and for good measure, I also scanned
>> the entire software raid with a check command which took 3 days to run.
>> 
>> Everything passed.
>> 
>> Then, I made a bcache of that device, an ssd that seems to work fine
>> otherwise (brand new), and dmcrypted the result
>> 
>> md5 - bache - dmcrypt - btrfs ssd /
>> 
>> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
>> show up and the write counter go up one by one.
>> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0,
>> flush 0, corrupt 0, gen 0
>> 
>> Where is the documentation for those counters?
>> Is the write error fatal, or a recovered error?
>> Should I consider that my filesystem is corrupted as soon as any of
>> those counters go up?
>> (I couldn't find an exact meaning of each of them)
> 
> I believe all formal documentation of what the error counters actually
> mean is developer-level -- "Trust the Source, Luke."

Forgot to mention, tho you're probably already considering it, if this is 
the same raid5-backed btrfs you were complaining about being slow in the 
other thread, and considering redoing with bcache to an ssd added, as 
seems very likely, if it /is/ actually storage device or bus errors, that 
could be one reason the previous one was getting so slow...  Maybe it 
wasn't btrfs after all.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
  2016-02-23 23:17 ` Duncan
  2016-02-23 23:22   ` Duncan
@ 2016-02-24  0:19   ` Marc MERLIN
  2016-02-24  0:38     ` Duncan
  1 sibling, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2016-02-24  0:19 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Tue, Feb 23, 2016 at 11:22:47PM +0000, Duncan wrote:
> Forgot to mention, tho you're probably already considering it, if this is 
> the same raid5-backed btrfs you were complaining about being slow in the 
> other thread, 

No, that's another one :)
This one was remade from scratch after the filesystem on it got
corrupted.
5 x 4TB swraid5      64GB SSD
          bcache
	  dmcrypt
	  btrfs

Smart is 100% for all 5 drives, and they passed an extensive test before
I built the new raid and filesystem on them.

> and considering redoing with bcache to an ssd added, as 
> seems very likely, if it /is/ actually storage device or bus errors, that 
> could be one reason the previous one was getting so slow...  Maybe it 
> wasn't btrfs after all.

Good thinking, although in this case, it's a different filesystem.

This filesystem is however on a Sata port multiplier with a 2 meter
cable to an external disk array. 
As a result, bandwidth to it is going to be slow-ish, and the long cable
could be adding I/O errors.

On Tue, Feb 23, 2016 at 11:17:06PM +0000, Duncan wrote:
> I believe all formal documentation of what the error counters actually 
> mean is developer-level -- "Trust the Source, Luke."
 
Haha, I know that one :)
Although to be fair I was more offering for someone to tell me what
they're supposed to mean, and me updating the wiki to capture that info.

> Yet another point supporting the "btrfs is still stabilizing, not yet 
> fully stable" position, I suppose, as it could definitely be argued that 
> those counters and their visibility, including display in the kernel log 
> at mount time, are definitely intended to be consumed at the admin-user 
> level, and that it follows that they should be documented at the admin-
> user level before the filesystem can properly be defined as fully stable.
 
Yes :) and I'm happy to help make this reality in the wiki at least.
 
> Write error counter increments should be accompanied by kernel log events 
> telling you more -- what level of the device stack is returning the 
> errors that propagate up to the filesystem level, for instance.  Expected 
> would be either bus level timeouts and resets, or storage device errors.  
 
I agree, and I get 0 such errors here, which is why it's weird.

> If it's storage device errors, SMART data should show increasing raw 
> value relocated sectors or the like (smartctl -A).  If it's bus errors, 

Correct, and they are all at 0.

> it could be bad cabling (bad connections or bad shielding, or using 
> SATA-150 certified cables for SATA-600 or some such), or, as I saw on an 

Cabling is indeed a likely culprit, I'm just surprised that if it's the
case, the sata layer is showing me nothing (I'm doing tail -f
/var/log/kern.log and usually I'd see sata or PMP errors there)

> old and failing mobo (when I pulled it there were bulging and some 
> exploded capacitors) a few years ago, failing filter-capacitors on the 
> mobo signalling paths.  Bad power, including the possibility of an 
> overloaded UPS that hit one guy I know, is notorious for both this sort 
> of issue and memory problems, as well.

All true, but wouldn't all of these show up as actual disk errors by the
underlying driver involved too?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
  2016-02-24  0:19   ` Marc MERLIN
@ 2016-02-24  0:38     ` Duncan
  0 siblings, 0 replies; 6+ messages in thread
From: Duncan @ 2016-02-24  0:38 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Tue, 23 Feb 2016 16:19:44 -0800 as excerpted:

> Cabling is indeed a likely culprit, I'm just surprised that if it's the
> case, the sata layer is showing me nothing (I'm doing tail -f
> /var/log/kern.log and usually I'd see sata or PMP errors there)

That /is/ surprising.  No explanation, there, tho I don't know enough 
about such errors to know if they /always/ tend to show up in the logs, 
or not, only that mine generally have.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
  2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
  2016-02-23 23:17 ` Duncan
@ 2016-03-07 15:13 ` Marc MERLIN
  1 sibling, 0 replies; 6+ messages in thread
From: Marc MERLIN @ 2016-03-07 15:13 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Feb 23, 2016 at 01:59:11PM -0800, Marc MERLIN wrote:
> I have a freshly created md5 array, with drives that I specifically
> scanned one by one block by block, and for good measure, I also scanned
> the entire software raid with a check command which took 3 days to run.
> 
> Everything passed.
> 
> Then, I made a bcache of that device, an ssd that seems to work fine
> otherwise (brand new), and dmcrypted the result
> 
> md5 - bache - dmcrypt - btrfs
> ssd /
> 
> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
> show up and the write counter go up one by one.
> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0, flush 0, corrupt 0, gen 0
> 
> Where is the documentation for those counters?
> Is the write error fatal, or a recovered error?
> Should I consider that my filesystem is corrupted as soon as any of
> those counters go up?
> (I couldn't find an exact meaning of each of them)
> 

Sadly, this problem hasn't gone away
[ 2381.333412] BTRFS error (device dm-5): bdev /dev/mapper/oldds1 errs: wr 298, rd 0, flush 0, corrupt 0, gen 0

I'm really trying to make sense out of it.
Are those recovered errors (bad IO, command was retried, things worked
after that), fatal errors (data loss)

That md5 is in a disk shelf at the end of a longish esata cable. It's
possible that the cable is bad, or it couuld be something else entirely.
I'm still trying to understand the error so that I can diagnose and
address it properly.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-03-07 15:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
2016-02-23 23:17 ` Duncan
2016-02-23 23:22   ` Duncan
2016-02-24  0:19   ` Marc MERLIN
2016-02-24  0:38     ` Duncan
2016-03-07 15:13 ` Marc MERLIN

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).