Re: Unable to fixup (regular) error in RAID1 fs

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Unable to fixup (regular) error in RAID1 fs
Date: Wed, 29 Oct 2014 03:02:14 +0000 (UTC)	[thread overview]
Message-ID: <pan$926c2$2d969c3e$ebb0bb58$d02b674@cox.net> (raw)
In-Reply-To: a3d3305c6149a1aa5e2b3f29d76fff93@miceliux.com

Juan Orti posted on Tue, 28 Oct 2014 16:54:19 +0100 as excerpted:

> [ 3713.086292] BTRFS: unable to fixup (regular) error at logical 
> 483011874816 on dev /dev/sdb2
> [ 3713.092577] BTRFS: checksum error at logical 483011948544 on dev 
> /dev/sdb2, sector 628793528, root 2500, inode 1436631, offset 
> 4059963392, length 4096, links 1 (path: 
> juan/.local/share/gnome-boxes/images/boxes-unknown)
> [ 3713.092584] BTRFS: bdev /dev/sdb2 errs: wr 0, rd 0, flush 0, corrupt 
> 38, gen 0
> [ 3713.093035] BTRFS: unable to fixup (regular) error at logical 
> 483011948544 on dev /dev/sdb2
> 
> Why can't it fix the errors? a bad device? smartctl says the disk is ok. 
> I'm currently running a full scrub to see if it finds more errors. What 
> should I do?

Btrfs raid1, and I see you have it for both data and metadata.

During normal operation, when btrfs comes across a block that doesn't
match its checksum, it will look to see if there's another copy (which
there is with raid1, which has exactly two copies) of that block and will
try to use it instead if so.  If the second copy matches the checksum,
all is fine and btrfs will in fact attempt to rewrite the bad copy using
the good copy, as well as returning the good copy to whatever was
reading it.

Those corruption errors seem to indicate that it can't find a good
copy to update the bad copy with -- both copies ended up bad.  Either
that or it found the good copy and returned it to whatever was reading,
but couldn't rewrite the bad copy, for some reason.

I'm not sure which of those interpretations is correct, but given
that you didn't see anything else bad happening, no apps returning
errors due to read error, etc, I'd guess the second.  Because
otherwise whatever was doing the read should have returned an
error.

Doing a scrub, as you already did, is the first thing I'd try here,
since normal operation won't catch all the errors.

BUT, you report that the scrub found no errors, which is weird.
You have the log saying there's corruption errors, but scrub
saying there's not.

The easiest explanation for something like that, is that the errors
were temporary.  If it happens again or regularly, consider running
memcheck or the like, as it could be bad memory.  Do you have ECC RAM?

Another question.  Do you have skinny metadata on that btrfs?  If you
do, btrfs should mention "skinny extents" when mounting the filesystem.

The reason I'm asking this is that if I'm reading the patch descriptions
correctly, a recently posted patch deals with a specific skinny-metadata
bug where wrong results would occasionally be returned, resulting in
errors.  Not being a dev I don't have the technical ability to know for
sure whether this could be connected to that or not, but it sounds like
the sort of thing I might expect from a bug that intermittently returned
bad data -- odd apparent corruption errors in normal use that scrub
can't see, even tho it's designed to catch and fix if possible exactly
that sort of corruption error.

Anyway, if scrub says no corruption, for a potential corruption error
I'd be inclined to trust scrub, so I think the filesystem is fine.
But if so, I'm worried about what might be triggering these
intermittent errors.  Certainly watch for more of them, and if you're
running skinny-metadata, consider finding and applying that patch.
If not or in general, also be on the lookout for more possible hints
of failing memory and/or run a good memory checker for a few hours
and see if it reports all is well.

But as they say about some kinds of potential cancer reports at times,
sometimes watchful waiting is the best you can do, hoping no further
symptoms show up, but being alert in case they do, to try something
more drastic, that isn't warranted /unless/ they do.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2014-10-29  3:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-28 15:54 Unable to fixup (regular) error in RAID1 fs Juan Orti
2014-10-28 20:17 ` Juan Orti
2014-10-29  3:02 ` Duncan [this message]
2014-10-29  8:08   ` Juan Orti
2014-10-29 16:19     ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$926c2$2d969c3e$ebb0bb58$d02b674@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox