Re: strange i/o errors with btrfs on raid/lvm

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Jogi Hofmüller" <jogi@mur.at>
To: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: strange i/o errors with btrfs on raid/lvm
Date: Mon, 28 Sep 2015 10:39:53 +0200	[thread overview]
Message-ID: <5608FCD9.2000902@mur.at> (raw)
In-Reply-To: <CAJCQCtTaagpqurnjLeN7Mw8oBv59uWYtMbOb5REo8NgiP+UjjA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3378 bytes --]

Hi Chris, all,

Am 2015-09-25 um 22:47 schrieb Chris Murphy:
> On Fri, Sep 25, 2015 at 2:26 PM, Jogi Hofmüller <jogi@mur.at> wrote:
> 
>> That was right while the RAID was in degraded state and rebuilding.
> 
> On the guest:
> 
> Aug 28 05:17:01 vm kernel: [140683.741688] BTRFS info (device vdc):
> disk space caching is enabled
> Aug 28 05:17:13 vm kernel: [140695.575896] BTRFS warning (device vdc):
> block group 13988003840 has wrong amount of free space

The device vdc is the backup device.  That's where we collect snapshots
of our mail spool.  I could fix that remounting with -o clear_cache as
you suggested.  However it is not directly related to the I/O error
problem.  The backup device sits on a different RAID than the mail spool.

> On the host, there are no messages that correspond to this time index,
> but a bit over an hour and a half later are when there are sas error
> messages, and the first reported write error.
> 
> I see the rebuild event starting:
> 
> Aug 28 07:04:23 host mdadm[2751]: RebuildStarted event detected on md
> device /dev/md/0
> 
> But there are subsequent sas errors still, including hard resetting of
> the link, and additional read errors. This continues more than once...

That is smartd still trying to read the failed device (sda).

> And then
> 
> Aug 28 17:06:49 host mdadm[2751]: RebuildFinished event detected on md
> device /dev/md/0, component device  mismatches found: 2048 (on raid
> level 10)

Right.  I totally missed the 'mismatches found' part :(

> and also a number of SMART warnings about seek error on another device
> 
> Aug 28 17:35:55 host smartd[3146]: Device: /dev/sda [SAT], SMART Usage
> Attribute: 7 Seek_Error_Rate changed from 180 to 179

Still the failed device.  These messages continue until we replaced the
disk with a new one.

> But 2048 mismatches found after a rebuild is a problem. So there's
> already some discrepancy in the mdadm raid10. And mdadm raid1 (or 10)
> cannot resolve mismatches because which block is correct is ambiguous.
> So that means something is definitely going to get corrupt. Btrfs, if
> the metadata profile is DUP can recover from that. But data can't.
> Normally this results in an explicit Btrfs message about a checksum
> mismatch and no ability to fix it, but will still report the path to
> affected file.  But I'm not finding that.

I ran checkarray on md0 and that reduced the mismatch_cnt to 384.

What I still don't understand is why it is possible to make a backup of
a file that is not accessible in the file system.  All files that
produce an I/O error upon access are fine on the backup drive.  It's
even possible to restore a file from the backup drive and then it is
read/writable again.

Another thing I cannot explain is why the only files affected are those
that get read/written lot's of times.

And finally the question is why does none of the other logical volumes
that reside on the same RAID experience any problems?  There are several
other logical volumes containing btrfs and ext4 file systems.

Anyhow, thanks for all the suggestions so far.

Cheers,

PS:  my messages with attached log files got forwarded to /dev/null
because they exceeded the 100000 char limit :(
-- 
j.hofmüller

We are all idiots with deadlines.                       - Mike West

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 213 bytes --]

next prev parent reply	other threads:[~2015-09-28  8:39 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-24 14:34 strange i/o errors with btrfs on raid/lvm Jogi Hofmüller
2015-09-24 17:41 ` Chris Murphy
2015-09-25 17:17   ` Chris Murphy
     [not found]     ` <5605ADE4.5090800@mur.at>
2015-09-25 20:47       ` Chris Murphy
2015-09-28  8:39         ` Jogi Hofmüller [this message]
2015-10-02  3:54         ` Russell Coker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5608FCD9.2000902@mur.at \
    --to=jogi@mur.at \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).