Re: Unocorrectable errors with RAID1

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Christoph Groth <christoph@grothesque.org>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Unocorrectable errors with RAID1
Date: Mon, 16 Jan 2017 16:42:31 +0100	[thread overview]
Message-ID: <87fukjdna0.fsf@grothesque.org> (raw)
In-Reply-To: <85a62769-0607-4be5-3c5b-5091bebea07e@gmail.com> (Austin S. Hemmelgarn's message of "Mon, 16 Jan 2017 08:24:37 -0500")

Austin S. Hemmelgarn wrote:
> On 2017-01-16 06:10, Christoph Groth wrote:

>> root@mim:~# btrfs fi df /
>> Data, RAID1: total=417.00GiB, used=344.62GiB
>> Data, single: total=8.00MiB, used=0.00B
>> System, RAID1: total=40.00MiB, used=68.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, RAID1: total=3.00GiB, used=1.35GiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=464.00MiB, used=0.00B

> Just a general comment on this, you might want to consider 
> running a full balance on this filesystem, you've got a huge 
> amount of slack space in the data chunks (over 70GiB), and 
> significant space in the Metadata chunks that isn't accounted 
> for by the GlobalReserve, as well as a handful of empty single 
> profile chunks which are artifacts from some old versions of 
> mkfs.  This isn't of course essential, but keeping ahead of such 
> things does help sometimes when you have issues.

Thanks!  So slack is the difference between "total" and "used"?  I 
saw that the manpage of "btrfs balance" explains this a bit in its 
"examples" section.  Are you aware of any more in-depth 
documentation?  Or one has to look at the source at this level?

I ran

btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /
btrfs balance start -dusage=25 -musage=25 /

This resulted in

root@mim:~# btrfs fi df /
Data, RAID1: total=365.00GiB, used=344.61GiB
System, RAID1: total=32.00MiB, used=64.00KiB
Metadata, RAID1: total=2.00GiB, used=1.35GiB
GlobalReserve, single: total=460.00MiB, used=0.00B

I hope that one day there will be a daemon that silently performs 
all the necessary btrfs maintenance in the background when system 
load is low!

>> * So scrubbing is not enough to check the health of a btrfs 
>> file system?  It’s also necessary to read all the files?

> Scrubbing checks data integrity, but not the state of the data. 
> IOW, you're checking that the data and metadata match with the 
> checksums, but not necessarily that the filesystem itself is 
> valid.

I see, but what should one then do to detect problems such as mine 
as soon as possible?  Periodically calculate hashes for all files? 
I’ve never seen a recommendation to do that for btrfs.

> There are a few things you can do to mitigate the risk of not 
> using ECC RAM though:
> * Reboot regularly, at least weekly, and possibly more 
> frequently.
> * Keep the system cool, warmer components are more likely to 
> have transient errors.
> * Prefer fewer numbers of memory modules when possible.  Fewer 
> modules means less total area that could be hit by cosmic rays 
> or other high-energy radiation (the main cause of most transient 
> errors).

Thanks for the advice, I think I buy the regular reboots.

As a consequence of my problem I think I’ll stop using RAID1 on 
the file server, since this only protects against dead disks, 
which evidently is only part of the problem.  Instead, I’ll make 
sure that the laptop that syncs with the server has a SSD that is 
big enough to hold all the data that is on the server as well (1 
TB SSDs are affordable now).  This way, instead of disk-level 
redundancy, I’ll have machine-level redundancy.  When something 
like the current problem hits one of the two machines, I should 
still have a usable second machine with all the data on it.

next prev parent reply	other threads:[~2017-01-16 15:42 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-16 11:10 Unocorrectable errors with RAID1 Christoph Groth
2017-01-16 13:24 ` Austin S. Hemmelgarn
2017-01-16 15:42   ` Christoph Groth [this message]
2017-01-16 16:29     ` Austin S. Hemmelgarn
2017-01-17  4:50       ` Janos Toth F.
2017-01-17 12:25         ` Austin S. Hemmelgarn
2017-01-17  9:18       ` Christoph Groth
2017-01-17 12:32         ` Austin S. Hemmelgarn
2017-01-16 22:45 ` Goldwyn Rodrigues
2017-01-17  8:44   ` Christoph Groth
2017-01-17 11:32     ` Goldwyn Rodrigues
2017-01-17 20:25       ` Christoph Groth
2017-01-17 21:52         ` Chris Murphy
2017-01-17 23:10           ` Christoph Groth
2017-01-18  7:13             ` gdb log of crashed "btrfs-image -s" Christoph Groth
2017-01-18 11:49               ` Goldwyn Rodrigues
2017-01-18 20:11                 ` Christoph Groth
2017-01-23 12:09                   ` Goldwyn Rodrigues
2017-01-17 22:57         ` Unocorrectable errors with RAID1 Goldwyn Rodrigues
2017-01-17 23:22           ` Christoph Groth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fukjdna0.fsf@grothesque.org \
    --to=christoph@grothesque.org \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.