From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Marc MERLIN <marc@merlins.org>
Cc: Andrei Borzenkov <arvidjaar@gmail.com>,
Josef Bacik <josef@toxicpanda.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
Chris Murphy <lists@colorremedies.com>,
Qu Wenruo <quwenruo.btrfs@gmx.com>
Subject: Re: Suggestions for building new 44TB Raid5 array
Date: Sat, 11 Jun 2022 19:44:43 -0400 [thread overview]
Message-ID: <YqUo678pVtaRuF1V@hungrycats.org> (raw)
In-Reply-To: <20220611045120.GN22722@merlins.org>
On Fri, Jun 10, 2022 at 09:51:20PM -0700, Marc MERLIN wrote:
> so, my apologies to all for the thread of death that is hopefully going
> to be over soon. I still want to help Josef fix the tools though,
> hopefully we'll get that filesystem back to a mountable state.
>
> That said, it's been over 2 months now, and I do need to get this
> filesystem back up from backup, so I ended up buying new drives (5x
> 11TiB in raid5).
>
> Given the pretty massive corruption that happened in ways that I still
> can't explain, I'll make sure to turn off all the drive write caches
> but I think I'm not sure I want to trust bcache anymore even though
> I had it in writethrough mode.
>
> Here's the Email from March, questions still apply:
>
> Kernel will be 5.16. Filesystem will be 24TB and contain mostly bigger
> files (100MB to 10GB).
>
> 1) mdadm --create /dev/md7 --level=5 --consistency-policy=ppl --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=internal
> 2) echo 0fb96f02-d8da-45ce-aba7-070a1a8420e3 > /sys/block/bcache64/bcache/attach
> gargamel:/dev# cat /sys/block/md7/bcache/cache_mode
> [writethrough] writeback writearound none
> 3) cryptsetup luksFormat --align-payload=2048 -s 256 -c aes-xts-plain64 /dev/bcache64
> 4) cryptsetup luksOpen /dev/bcache64 dshelf1
> 5) mkfs.btrfs -m dup -L dshelf1 /dev/mapper/dshelf1
>
> Any other btrfs options I should set for format to improve reliability
> first and performance second?
> I'm told I should use space_cache=v2, is it default now with btrfs-progs 5.10.1-2 ?
It's default with current btrfs-progs. I'm not sure what the cutoff
version is, but it doesn't matter--you can convert to v2 on first mount,
which will be fast on an empty filesystem.
> As for bcache, I'm really thinking about droppping it, unless I'm told
> it should be safe to use.
I would not recommend the cache in this configuration for resilience
because it doesn't keep device failures in separate failure domains.
Common SSD failure modes (e.g. silent data corruption, dropped writes)
can be detected but not repaired, and can affect any part of the
filesystem when viewed through the cache.
Unfortunately cache is only resilient with btrfs raid1 using SSD+HDD
cached device pairs so that a failure of any SSD or HDD affects at most
one btrfs device. That configuration works reasonably well, but you'll
need a pile more disks (both HDD and SSD) to match the capacity.
btrfs raid5 of SSD+HDD devices doesn't work--it will keep all IO accesses
below the cache's sequential IO size cutoff, which will wear out the SSDs
too fast (in addition to the other btrfs raid5 problems). Same problem
with raid10 or raid0.
I've tested btrfs with both bcache and lvmcache. I mostly use lvmcache,
and have had no problems with it. bcache had problems in testing, so
I've never used bcache outside of test environments.
bcache has a few sharp edges when SSD devices fail that prevent
recovery with the filesystem still online. It seems to trigger
service-interrupting firmware bugs in some SSD models with
its access patterns compared to lvmcache (failures that are
common on one vendor/model/firmware that never happen on any other
vendor/model/firmware, and that occur much more often, or at all, when
bcache is in use compared to when bcache is not in use).
I have not lost data with bcache when SSD corruption is not present--it
survived hundreds of power-fail crash test cycles and came back after
all the SSD firmware crashes in testing--but the service interruptions
from crashing firmware and the inability to recover from a failed drive
while keeping the filesystem online were a problem. We worked around
this by using lvmcache instead.
If your IO subsystem has problems with write dropping, then it's going
to be much worse with any cache. Neither bcache nor lvmcache have
any sort of hardening against SSD corruption or failure. They both
fail badly on SSD corruption tests even in writethrough mode.
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/
>
next prev parent reply other threads:[~2022-06-11 23:44 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-11 4:51 Suggestions for building new 44TB Raid5 array Marc MERLIN
2022-06-11 9:30 ` Roman Mamedov
[not found] ` <CAK-xaQYc1PufsvksqP77HMe4ZVTkWuRDn2C3P-iMTQzrbQPLGQ@mail.gmail.com>
2022-06-11 14:52 ` Marc MERLIN
2022-06-11 17:54 ` Roman Mamedov
2022-06-12 17:31 ` Marc MERLIN
2022-06-12 21:21 ` Roman Mamedov
2022-06-13 17:46 ` Marc MERLIN
2022-06-13 18:06 ` Roman Mamedov
2022-06-14 4:51 ` Marc MERLIN
2022-06-13 18:10 ` Zygo Blaxell
2022-06-13 18:13 ` Marc MERLIN
2022-06-13 18:29 ` Roman Mamedov
2022-06-13 20:08 ` Zygo Blaxell
2022-06-14 6:36 ` Torbjörn Jansson
2022-06-20 20:37 ` Andrea Gelmini
2022-06-21 5:26 ` Zygo Blaxell
2022-07-06 9:09 ` Andrea Gelmini
2022-06-11 23:44 ` Zygo Blaxell [this message]
2022-06-14 11:03 ` ronnie sahlberg
[not found] ` <5e1733e6-471e-e7cb-9588-3280e659bfc2@aqueos.com>
2022-06-20 15:01 ` Marc MERLIN
2022-06-20 15:52 ` Ghislain Adnet
2022-06-20 16:27 ` Marc MERLIN
2022-06-20 17:02 ` Andrei Borzenkov
2022-06-20 17:26 ` Marc MERLIN
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YqUo678pVtaRuF1V@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=arvidjaar@gmail.com \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=marc@merlins.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox