From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.webfaction.com ([74.55.86.74]:48628 "EHLO smtp.webfaction.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750831AbdAPPmh (ORCPT ); Mon, 16 Jan 2017 10:42:37 -0500 From: Christoph Groth To: "Austin S. Hemmelgarn" Cc: linux-btrfs@vger.kernel.org Subject: Re: Unocorrectable errors with RAID1 References: <87o9z7dzvd.fsf@grothesque.org> <85a62769-0607-4be5-3c5b-5091bebea07e@gmail.com> Date: Mon, 16 Jan 2017 16:42:31 +0100 In-Reply-To: <85a62769-0607-4be5-3c5b-5091bebea07e@gmail.com> (Austin S. Hemmelgarn's message of "Mon, 16 Jan 2017 08:24:37 -0500") Message-ID: <87fukjdna0.fsf@grothesque.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: Austin S. Hemmelgarn wrote: > On 2017-01-16 06:10, Christoph Groth wrote: >> root@mim:~# btrfs fi df / >> Data, RAID1: total=417.00GiB, used=344.62GiB >> Data, single: total=8.00MiB, used=0.00B >> System, RAID1: total=40.00MiB, used=68.00KiB >> System, single: total=4.00MiB, used=0.00B >> Metadata, RAID1: total=3.00GiB, used=1.35GiB >> Metadata, single: total=8.00MiB, used=0.00B >> GlobalReserve, single: total=464.00MiB, used=0.00B > Just a general comment on this, you might want to consider > running a full balance on this filesystem, you've got a huge > amount of slack space in the data chunks (over 70GiB), and > significant space in the Metadata chunks that isn't accounted > for by the GlobalReserve, as well as a handful of empty single > profile chunks which are artifacts from some old versions of > mkfs. This isn't of course essential, but keeping ahead of such > things does help sometimes when you have issues. Thanks! So slack is the difference between "total" and "used"? I saw that the manpage of "btrfs balance" explains this a bit in its "examples" section. Are you aware of any more in-depth documentation? Or one has to look at the source at this level? I ran btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft / btrfs balance start -dusage=25 -musage=25 / This resulted in root@mim:~# btrfs fi df / Data, RAID1: total=365.00GiB, used=344.61GiB System, RAID1: total=32.00MiB, used=64.00KiB Metadata, RAID1: total=2.00GiB, used=1.35GiB GlobalReserve, single: total=460.00MiB, used=0.00B I hope that one day there will be a daemon that silently performs all the necessary btrfs maintenance in the background when system load is low! >> * So scrubbing is not enough to check the health of a btrfs >> file system? It’s also necessary to read all the files? > Scrubbing checks data integrity, but not the state of the data. > IOW, you're checking that the data and metadata match with the > checksums, but not necessarily that the filesystem itself is > valid. I see, but what should one then do to detect problems such as mine as soon as possible? Periodically calculate hashes for all files? I’ve never seen a recommendation to do that for btrfs. > There are a few things you can do to mitigate the risk of not > using ECC RAM though: > * Reboot regularly, at least weekly, and possibly more > frequently. > * Keep the system cool, warmer components are more likely to > have transient errors. > * Prefer fewer numbers of memory modules when possible. Fewer > modules means less total area that could be hit by cosmic rays > or other high-energy radiation (the main cause of most transient > errors). Thanks for the advice, I think I buy the regular reboots. As a consequence of my problem I think I’ll stop using RAID1 on the file server, since this only protects against dead disks, which evidently is only part of the problem. Instead, I’ll make sure that the laptop that syncs with the server has a SSD that is big enough to hold all the data that is on the server as well (1 TB SSDs are affordable now). This way, instead of disk-level redundancy, I’ll have machine-level redundancy. When something like the current problem hits one of the two machines, I should still have a usable second machine with all the data on it.