linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0
Date: Tue, 23 Feb 2016 23:17:06 +0000 (UTC)	[thread overview]
Message-ID: <pan$1ce2f$38765775$42544d39$1c9fd0a5@cox.net> (raw)
In-Reply-To: 20160223215911.GA13811@merlins.org

Marc MERLIN posted on Tue, 23 Feb 2016 13:59:11 -0800 as excerpted:

> I have a freshly created md5 array, with drives that I specifically
> scanned one by one block by block, and for good measure, I also scanned
> the entire software raid with a check command which took 3 days to run.
> 
> Everything passed.
> 
> Then, I made a bcache of that device, an ssd that seems to work fine
> otherwise (brand new), and dmcrypted the result
> 
> md5 - bache - dmcrypt - btrfs ssd /
> 
> Now, I'm copying data over with btrfs send, and I'm seeing these slowly
> show up and the write counter go up one by one.
> BTRFS error (device dm-7): bdev /dev/mapper/oldds1 errs: wr 17, rd 0,
> flush 0, corrupt 0, gen 0
> 
> Where is the documentation for those counters?
> Is the write error fatal, or a recovered error?
> Should I consider that my filesystem is corrupted as soon as any of
> those counters go up?
> (I couldn't find an exact meaning of each of them)

I believe all formal documentation of what the error counters actually 
mean is developer-level -- "Trust the Source, Luke."

Unless something has recently been added to the wiki documenting them, 
admin/user level documentation is only the simple mention in the
btrfs-device manpage under stats, and what can be gathered, often by 
reading between the lines or from simply observing real behavior and the 
kernel log when errors increment, from the simple error counter names and 
comments here on this list.

Yet another point supporting the "btrfs is still stabilizing, not yet 
fully stable" position, I suppose, as it could definitely be argued that 
those counters and their visibility, including display in the kernel log 
at mount time, are definitely intended to be consumed at the admin-user 
level, and that it follows that they should be documented at the admin-
user level before the filesystem can properly be defined as fully stable.


Meanwhile, not saying my own admin-user viewpoint is gospel, by any 
stretch, but with the intent of hopefully helping make sense of things...

>From my own experience of some months with a failing ssd (as part of a 
raid1 pair with an ssd that was working fine, so I could and did 
regularly scrub the errors and took advantage of the checksummed raid1 
pairing to let it go much further than I would have in other 
circumstances, simply to observe how things worked as it degraded)...

Write error counter increments should be accompanied by kernel log events 
telling you more -- what level of the device stack is returning the 
errors that propagate up to the filesystem level, for instance.  Expected 
would be either bus level timeouts and resets, or storage device errors.  

If it's storage device errors, SMART data should show increasing raw 
value relocated sectors or the like (smartctl -A).  If it's bus errors, 
it could be bad cabling (bad connections or bad shielding, or using 
SATA-150 certified cables for SATA-600 or some such), or, as I saw on an 
old and failing mobo (when I pulled it there were bulging and some 
exploded capacitors) a few years ago, failing filter-capacitors on the 
mobo signalling paths.  Bad power, including the possibility of an 
overloaded UPS that hit one guy I know, is notorious for both this sort 
of issue and memory problems, as well.

Of course bus timeout errors can also be due to lower timeouts on the bus 
(typically 30-second) than on the device (often 2-minute retry time, on 
consumer-level devices), but there's others here with far more knowledge 
in that area, including what to do to try to fix it, than I have, and the 
various options to fix it have been posted multiple times by now, and 
likely will be posted here again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


  reply	other threads:[~2016-02-23 23:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-23 21:59 Documentation for BTRFS error (device dev): bdev /dev/xx errs: wr 22, rd 0, flush 0, corrupt 0, gen 0 Marc MERLIN
2016-02-23 23:17 ` Duncan [this message]
2016-02-23 23:22   ` Duncan
2016-02-24  0:19   ` Marc MERLIN
2016-02-24  0:38     ` Duncan
2016-03-07 15:13 ` Marc MERLIN

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$1ce2f$38765775$42544d39$1c9fd0a5@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).