Re: uncorrectable errors after btrfs replace

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stefan Behrens <sbehrens@giantdisaster.de>
To: Chris Murphy <lists@colorremedies.com>
Cc: Stuart Pook <slp644161@pook.it>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: uncorrectable errors after btrfs replace
Date: Mon, 02 Sep 2013 18:23:06 +0200	[thread overview]
Message-ID: <5224BB6A.7040900@giantdisaster.de> (raw)
In-Reply-To: <3E91F3EB-AE7F-47FE-AAD3-00B08F761911@colorremedies.com>

On Sun, 25 Aug 2013 20:07:32 -0600, Chris Murphy wrote:
> On Aug 25, 2013, at 4:10 PM, Stuart Pook <slp644161@pook.it> wrote:
>>
>> I emailed them to Stefan Behrens & Chris Murphy.  Please let me know if you did not get them (presumably because they are too big).
> 
> Observations:
> 
> 1. The problems started before the start of the provided log.
> 
> 2. smartd reports sdb at 100˚C. The spec sheet for WD2002FAEX is 60˚C. It's possible the raw value isn't actually ˚C so you'll need to look at smartctl -a columns VALUE, WORST and THRESH to determine if it is or has hit the threshold. Seems possible the drives are being cooked.
> 
> sdc is ST2000DL004 which google finds this
> http://forums.seagate.com/t5/Desktop-HDD-Desktop-SSHD/BEWARE-the-so-called-Samsung-HD204UI/m-p/166856
> 
> It also looks to be running hot. 
> 
> 3. the first ata error seems to be 8/10 encoding related, could be a connector problem, a port problem, a drive problem, or firmware bug - the Emask 0x10 implicates NCQ according to libata.h:
> AC_ERR_NCQ              = (1 << 10), /* marker for offending NCQ qc */
> 
> 4. Hundreds of these:
> ata10.00: failed command: READ FPDMA QUEUED
> 
> Implies it may be an incompatibility between this drive and the controller, possibly disabling NCQ on the drive will fix the problem (set queue depth to 1)
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/550559
> 
> https://ata.wiki.kernel.org/index.php/Libata_FAQ
> 
> echo 1 > /sys/block/sdX/device/queue_depth
> 
> 
> I can't tell you what /dev/ node applies to ata10:00 because the log is incomplete, so I don't know which drive is giving you a hard time with NCQ. Thing is, if you disable NCQ on just one drive, it'll slow it down compared to the others. I don't know how tolerant btrfs is when devices have different speeds.
> 
> 
> 
> 5. Tens of thousands of checksum errors on both dm-11 and dm-12. 
> 
> 6. Many instances of 
>  btrfs: unable to fixup (regular) error at logical 53281xxxxxx on dev /dev/dm-11
> 
> So kernel messages have been screaming of bus related problems for some time, they were ignored, btrfs did what it could, reported hundreds to thousands of errors in dmesg, but user space tools didn't warn the user operations effectively failed.

Right, I assume that the WD6400AAKS failed in reading the 250,000 blocks due to heat or SATA link issues. And in this case the user space tools should have warned and aborted the operations because there is hope that after cooling down the disk or after fixing the SATA link issues, the read errors disappear.

There is the other use case where such unrecoverable read errors are expected. This is the case when a disk is about to die.

The configuration option is missing whether to abort or continue on unrecoverable read errors. The even better solution is to implement an optional verify at the end or a scrub run, and to only declare the operation as being finished when this additional check succeeds.

next prev parent reply	other threads:[~2013-09-02 16:23 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <S1753593Ab3HRQvp/20130818165145Z+301@vger.kernel.org>
2013-08-18 19:12 ` uncorrectable errors after btrfs replace Stuart Pook
2013-08-18 21:43   ` Chris Murphy
2013-08-18 22:35     ` Stuart Pook
2013-08-19  0:42       ` Chris Murphy
2013-08-19  1:21         ` George Mitchell
2013-08-20 14:46         ` slp644161
2013-08-20 15:16           ` Stefan Behrens
2013-08-25 22:10             ` Stuart Pook
2013-08-26  2:07               ` Chris Murphy
2013-08-26  2:32                 ` Chris Murphy
2013-09-02 16:23                 ` Stefan Behrens [this message]
2013-08-20  9:44       ` Stefan Behrens
2013-08-20 13:52         ` slp644161
2013-08-20 14:50           ` Stefan Behrens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5224BB6A.7040900@giantdisaster.de \
    --to=sbehrens@giantdisaster.de \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=slp644161@pook.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.