linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@inwind.it>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: cryptearth <cryptearth@cryptearth.de>,
	linux-btrfs@vger.kernel.org, Josef Bacik <jbacik@fb.com>
Subject: Re: using raid56 on a private machine
Date: Tue, 6 Oct 2020 19:12:04 +0200	[thread overview]
Message-ID: <db963644-ac4b-cf19-dcf6-795ff92413e8@inwind.it> (raw)
In-Reply-To: <20201006012427.GD21815@hungrycats.org>

On 10/6/20 3:24 AM, Zygo Blaxell wrote:
> On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
[...]

>>
>> I have only few suggestions:
>> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
>> you want to experiment and want to help the development of BTRFS. But
>> be ready to face the lost of all data. (very unlikely, but more the
>> size of the filesystem is big, more difficult is a restore of the data
>> in case of problem).
> 
> Losing all of the data seems unlikely given the bugs that exist so far.
> The known issues are related to availability (it crashes a lot and
> isn't fully usable in degraded mode) and small amounts of data loss
> (like 5 blocks per TB).

 From what I reading in the mailing list when the problem is too complex to solve
to the point that the filesystem has to be re-format, quite often the main issue is not to
"extract" the data, but is about the availability of additional space to "store" the data.

> 
> The above assumes you never use raid5 or raid6 for btrfs metadata.  Using
> raid5 or raid6 for metadata can result in total loss of the filesystem,
> but you can use raid1 or raid1c3 for metadata instead.
> 
>> 2) doesn't fill the filesystem more than 70-80%. If you go further
>> this limit the likelihood to catch the "dark and scary corners"
>> quickly increases.
> 
> Can you elaborate on that?  I run a lot of btrfs filesystems at 99%
> capacity, some of the bigger ones even higher.  If there were issues at
> 80% I expect I would have noticed them.  There were some performance
> issues with full filesystems on kernels using space_cache=v1, but
> space_cache=v2 went upstream 4 years ago, and other significant
> performance problems a year before that.

My suggestion was more to have enough space to not stress the filesystem
than "if you go behind this limit you have problem".

A problem of BTRFS that confuses the users is that you can have space, but you
can't allocate a new metadata chunk.

See
https://lore.kernel.org/linux-btrfs/6e6565b2-58c6-c8c1-62d0-6e8357e41a42@gmx.com/T/#t


Having the filesystem filled to 99% means that you have to check carefully
the filesystem (and balance it) to avoid scenarios like this.

On other side 1% of 1TB (a small filesystem for today standard) are about
10GB, that everybody should consider enough....

  
> The last few GB is a bit of a performance disaster and there are
> some other gotchas, but that's an absolute number, not a percentage.

True, it is sufficient to have few GB free (i.e. not allocated by chunk)
in *enough* disks...

However these requirements are a bit complex to understand by a new BTRFS
users.

> 
> Never balance metadata.  That is a ticket to a dark and scary corner.
> Make sure you don't do it, and that you don't accidentally install a
> cron job that does it.
> 
>> 3) run scrub periodically and after a power failure ; better to use
>> an uninterruptible power supply (this is true for all the RAID, even
>> the MD one).
> 
> scrub also provides early warning of disk failure, and detects disks
> that are silently corrupting your data.  It should be run not less than
> once a month, though you can skip months where you've already run a
> full-filesystem read for other reasons (e.g. replacing a failed disk).
> 
>> 4) I don't have any data to support this; but as occasional reader of
>> this mailing list I have the feeling that combing BTRFS with LUCKS(or
>> bcache) raises the likelihood of a problem.

> I haven't seen that correlation.  All of my machines run at least one
> btrfs on luks (dm-crypt).  The larger ones use lvmcache.  I've also run
> bcache on test machines doing power-fail tests.


> 
> That said, there are additional hardware failure risks involved in
> caching (more storage hardware components = more failures) and the
> system must be designed to tolerate and recover from these failures.
> 
> When cache disks fail, just uncache and run scrub to repair.  btrfs
> checksums will validate the data on the backing HDD (which will be badly
> corrupted after a cache SSD failure) and will restore missing data from
> other drives in the array.
> 
> It's definitely possible to configure bcache or lvmcache incorrectly,
> and then you will have severe problems.  Each HDD must have a separate
> dedicated SSD.  No sharing between cache devices is permitted.  They must
> use separate cache pools.  If one SSD is used to cache two or more HDDs
> and the SSD fails, it will behave the same as a multi-disk failure and
> probably destroy the filesystem.  So don't do that.
> 
> Note that firmware in the SSDs used for caching must respect write
> ordering, or the cache will do severe damage to the filesystem on
> just about every power failure.  It's a good idea to test hardware
> in a separate system through a few power failures under load before
> deploying them in production.  Most devices are OK, but a few percent
> of models out there have problems so severe they'll damage a filesystem
> in a single-digit number of power loss events.  It's fairly common to
> encounter users who have lost a btrfs on their first or second power
> failure with a problematic drive.  If you're stuck with one of these
> disks, you can disable write caching and still use it, but there will
> be added write latency, and in the long run it's better to upgrade to
> a better disk model.
> 
>> 5) pay attention that having an 8 disks raid, raises the likelihood of a
>> failure of about an order of magnitude more than a single disk ! RAID6
>> (or any other RAID) mitigates that, in the sense that it creates a
>> time window where it is possible to make maintenance (e.g. a disk
>> replacement) before the lost of data.
>> 6) leave the room in the disks array for an additional disk (to use
>> when a disk replacement is needed)
>> 7) avoid the USB disks, because these are not reliable
>>
>>
>>>
>>> Any information appreciated.
>>>
>>>
>>> Greetings from Germany,
>>>
>>> Matt
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  parent reply	other threads:[~2020-10-06 17:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-05 16:59 using raid56 on a private machine cryptearth
2020-10-05 17:57 ` Goffredo Baroncelli
2020-10-05 19:22   ` cryptearth
2020-10-06  1:24   ` Zygo Blaxell
2020-10-06  5:50     ` cryptearth
2020-10-06 19:31       ` Zygo Blaxell
2020-10-06 17:12     ` Goffredo Baroncelli [this message]
2020-10-06 20:07       ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db963644-ac4b-cf19-dcf6-795ff92413e8@inwind.it \
    --to=kreijack@inwind.it \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=cryptearth@cryptearth.de \
    --cc=jbacik@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).