Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marc MERLIN <marc@merlins.org>
To: Duncan <1i5t5.duncan@cox.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
Date: Tue, 23 Feb 2016 16:19:44 -0800	[thread overview]
Message-ID: <20160224001944.GX22487@merlins.org> (raw)
In-Reply-To: <pan$84c48$1f5c952c$5c23be23$5799c3a@cox.net> <pan$1ce2f$38765775$42544d39$1c9fd0a5@cox.net>

On Tue, Feb 23, 2016 at 11:22:47PM +0000, Duncan wrote:
> Forgot to mention, tho you're probably already considering it, if this is 
> the same raid5-backed btrfs you were complaining about being slow in the 
> other thread, 

No, that's another one :)
This one was remade from scratch after the filesystem on it got
corrupted.
5 x 4TB swraid5      64GB SSD
          bcache
	  dmcrypt
	  btrfs

Smart is 100% for all 5 drives, and they passed an extensive test before
I built the new raid and filesystem on them.

> and considering redoing with bcache to an ssd added, as 
> seems very likely, if it /is/ actually storage device or bus errors, that 
> could be one reason the previous one was getting so slow...  Maybe it 
> wasn't btrfs after all.

Good thinking, although in this case, it's a different filesystem.

This filesystem is however on a Sata port multiplier with a 2 meter
cable to an external disk array. 
As a result, bandwidth to it is going to be slow-ish, and the long cable
could be adding I/O errors.

On Tue, Feb 23, 2016 at 11:17:06PM +0000, Duncan wrote:
> I believe all formal documentation of what the error counters actually 
> mean is developer-level -- "Trust the Source, Luke."
 
Haha, I know that one :)
Although to be fair I was more offering for someone to tell me what
they're supposed to mean, and me updating the wiki to capture that info.

> Yet another point supporting the "btrfs is still stabilizing, not yet 
> fully stable" position, I suppose, as it could definitely be argued that 
> those counters and their visibility, including display in the kernel log 
> at mount time, are definitely intended to be consumed at the admin-user 
> level, and that it follows that they should be documented at the admin-
> user level before the filesystem can properly be defined as fully stable.
 
Yes :) and I'm happy to help make this reality in the wiki at least.
 
> Write error counter increments should be accompanied by kernel log events 
> telling you more -- what level of the device stack is returning the 
> errors that propagate up to the filesystem level, for instance.  Expected 
> would be either bus level timeouts and resets, or storage device errors.  
 
I agree, and I get 0 such errors here, which is why it's weird.

> If it's storage device errors, SMART data should show increasing raw 
> value relocated sectors or the like (smartctl -A).  If it's bus errors, 

Correct, and they are all at 0.

> it could be bad cabling (bad connections or bad shielding, or using 
> SATA-150 certified cables for SATA-600 or some such), or, as I saw on an 

Cabling is indeed a likely culprit, I'm just surprised that if it's the
case, the sata layer is showing me nothing (I'm doing tail -f
/var/log/kern.log and usually I'd see sata or PMP errors there)

> old and failing mobo (when I pulled it there were bulging and some 
> exploded capacitors) a few years ago, failing filter-capacitors on the 
> mobo signalling paths.  Bad power, including the possibility of an 
> overloaded UPS that hit one guy I know, is notorious for both this sort 
> of issue and memory problems, as well.

All true, but wouldn't all of these show up as actual disk errors by the
underlying driver involved too?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

next prev parent reply	other threads:[~2016-02-24  0:20 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
2016-02-23 23:17 ` Duncan
2016-02-23 23:22   ` Duncan
2016-02-24  0:19   ` Marc MERLIN [this message]
2016-02-24  0:38     ` Duncan
2016-03-07 15:13 ` Marc MERLIN

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160224001944.GX22487@merlins.org \
    --to=marc@merlins.org \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.