From: cryptearth <cryptearth@cryptearth.de>
To: linux-btrfs@vger.kernel.org
Subject: Re: using raid56 on a private machine
Date: Tue, 6 Oct 2020 07:50:18 +0200 [thread overview]
Message-ID: <6c516b70-d85f-6115-88e9-295adf4221b3@cryptearth.de> (raw)
In-Reply-To: <20201006012427.GD21815@hungrycats.org>
Hello Zygo,
that's quite a lot of information I wasn't aware of.
// In advance: Sorry for the wall of text. That mail got a bit longer
than I thought.
I guess one point I still have to get my head around is about
meta-blocks vs data-blocks: I don't even know if and how my current raid
is capable of detecting other types of errors than instant failures,
like corruption of structual meta-data or the actual data-blocks itself.
Up until now, I never encountered any data corruption, neither read nor
write issues. I always was able to correctly read all data the exact
same as I've written them. There's only one application that uses rather
big files (only in the range of 1gb-3gb) which keeps somehow corrupting
itself (it's GTA V), but as the files that fail are files which, at
least in my eyes, are only should be opened to read during runtime (as
they contain assets like models and textures), but are actually opened
in read/write mode I suspect that for some odd reason the game itself
keeps writing data to those data and by it keep corrupting itself. Other
big file, like other types of images (all about 4gb and more) or
database files, which I also often read from and write to, never got any
of these issues - but I guess that's just GTA V at its best - aside from
some other rather strange CRM I had to use it's one of the worst pieces
of modern software I know.
The types of errors I encountered and which led to me replacing the
drives makred as failed were about this: The monitoring software of this
amd fakeraid at some point pops up one of those notifications telling me
that one of the drives failed, was set to offline, and the raid was set
to critical. Looking into the logs it only says that some operation
<hex-code> failed at some address <another hex-code> on some drive (port
number) and that the BSL (bad sector list) was updated. This comes up a
few times and then this line about that drive going offline - so, it's a
burst error. But: Even using Google didn't got me what those operation
code mean. So, I don't know if a drive failed for some read or write
error, some parity or checksum issue, or for whatever reason. All
information I get is that there's a burst error, the drive is marked as
bad, some list is updated and the array is set to a degraded state.
But: Otherwise to what I tested so far with BtrFS it's not like the
array goes offline or isn't available after reboot anymore. I can keep
using it, and, as RAID5, at least this one, seem to always calculate and
check at least some sort of checksum, there isn't even any performance
penalty. To get it running again all I have to do is to shutdown the
system, replace the drive, boot up again (that's caused by the hardware
- it doesn't support hotplug) and hit rebuild in the raid control panel
- which takes only a couple of hours with my 3TB drives.
But, as already mentioned, as this is only RAID5 each rebuild is like
gambling and hoping for no other drive fails until the rebuild is
finished. If another drive would go bad during rebuild the data would
most likely be lost. And recovery would even be harder as it's AMDs
proprietary stuff - and from what I was able to find AMD denied help
even to businesses - let alone me as a private person. All I could do
would be to replace the board with some compatible one, but I don't even
know if it just has to have a SB950 chipset, or if it has to be the
exact same board. The "bios-level" interface seem to be implemented as
an option rom on its own - so it shouldn't depend on the specific board.
Anyway, long story short: I don'T want to wait until that catastrophe
occurs, but rather want to prevent it by change my setup. Instead of
rely something fused onto the motherboard my friend suggest to use a
simple "dumb" HBA and do all the stuff in software, like Linux mdadm or
BtrFS/ZFS. As I learned over the past few days while learning about
BtrFS' RAID-like capabilities RAID isn't as simple as I thought until
now but can actually suffer from one (or more) drive return corrupted
data instead of just fail, and a typical hardware RAID controller or
many "rather simple" software raid implementations can't tell the
difference between the actual data and some not so good ones. As it was
explained in some talk: Some implementations work in a way that if a
data-block becomes corrupted which result in a fail of parity check the
parity, which could actually be used to recover the correct data, is
thrown away and overriden with some corrupted data re-calculated with
the corrupted data-block. Hence using special filesystems like BtrFS and
ZFS is recommended as they have additional information like per-block
checksums to actually tell if the checksum calculation failed or if some
data-block became corrupted.
As ZFS seem to have some not so clear license related stuff preventing
it from get included into the kernel I took a look at BtrFS - which
doesn't seem to fit my needs. Sure, I could go with a RAID 1+0 - but
this still would result in only about 12TB useable space while actually
throwing in 3 more 3TB drives, but I actually planed to increase the
useable size of my array instead of just bump it's redundancy. As for
metadata: I've read up about the RAID1 and RAID3/4 profiles: And
although RAID1c3 is recommended for a RAID6 (which would store 3 copies
so there should be at least one copy left even in a double failure) is
using a RAID1c4 also an option? I wouldn't mind to give up a bit of the
available space to an extra metadata copy if it helps to prevent data
loss in the case of a drive failure.
You also wrote to never balance metadata. But how should I restore the
lost metadata after a drive replacement if I only re-balance the
data-blocks? Do they get updated and re-distributed in the background
while restoring the data-blocks during a rebuild? Or is this more like:
"redundancy builds up again over time by the regular algorithms"? I may
still have something wrong in my understanding about "using multiple
disks in one array" so currently I would suspect that all data are
rebuild - also metadata - but I guess BtrFS works different on this topic?
Yes, I can tolerate loss of data as I do have an extra backup of my
important data, and as I only use it for my personal use I guess any
data lost by not having a proper backup of them is on me anyway, but
seeing BtrFS and ZFS used in like 45 drive arrays for crucial data with
requirement for high availability I'd like to find a solution I can set
up my array in a way like RAID6, so it can withstand a double failure,
but which is also still available during a rebuild. And although during
my tests BtrFS showed promissing when I was able to mount and use an
array with WinBtrFS, which would also solve my additional quest for come
up with some way of use the same volume on both Linux and Windows, it
doesn't seem to be ready for my plan yet, or at least not with the
knowledge I got so far. I'm open for any suggestions and explanations as
I obvious still have quite a lot to learn - and, if I may set up a BtrFS
volume, likely to require some help doing it "the right way" for what
I'd like.
Thanks to anyone, and sorry again for that rather long mail
Matt
Am 06.10.2020 um 03:24 schrieb Zygo Blaxell:
> On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
>> On 10/5/20 6:59 PM, cryptearth wrote:
>>> Hello there,
>>>
>>> as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about
>>> the current status of BtrFS RAID5/6 support or if I should go with a
>>> more traditional mdadm array.
> Definitely do not use a single mdadm raid6 array with btrfs. It is
> equivalent to running btrfs with raid6 metadata: mdadm cannot recover
> from data corruption on the disks, and btrfs cannot recover from
> write hole issues in degraded mode. Any failure messier than a total
> instantaneous disk failure will probably break the filesystem.
>
>>> The general status page on the btrfs wiki shows "unstable" for
>>> RAID5/6, and it's specific pages mentions some issue marked as "not
>>> production ready". It also says to not use it for the metadata but
>>> only for the actual data.
> That's correct. Very briefly, the issues are:
>
> 1. Reads don't work properly in degraded mode.
>
> 2. The admin tools are incomplete.
>
> 3. The diagnostic tools are broken.
>
> 4. It is not possible to recover from all theoretically
> recoverable failure events.
>
> Items 1 and 4 make raid5/6 unusable for metadata (total filesystem loss
> is likely). Use raid1 or raid1c3 for metadata instead. This is likely
> a good idea even if all the known issues are fixed--metadata access
> patterns don't perform well with raid5/6, and the most likely proposals
> to solve the raid5/6 problems will require raid1/raid1c3 metadata to
> store an update journal.
>
> If your application can tolerate small data losses correlated with disk
> failures (i.e. you can restore a file from backup every year if required,
> and you have no requirement for data availability while replacing disks)
> then you can use raid5 now; otherwise, btrfs raid5/6 is not ready yet.
>
>>> I plan to use it for my own personal system at home - and I do
>>> understand that RAID is no replacement for a backup, but I'd rather
>>> like to ask upfront if it's ready to use before I encounter issues
>>> when I use it.
>>> I already had the plan about using a more "traditional" mdadm setup
>>> and just format the resulting volume with ext4, but as I asked about
>>> that many actually suggested to me to rather use modern filesystems
>>> like BtrFS or ZFS instead of "old school RAID".
> Indeed, old school raid maximizes your probability of silent data loss by
> allowing multiple disks in inject silent data loss failures and firmware
> bug effects.
>
> btrfs and ZFS store their own data integrity information, so they can
> reliably identify failures on the drives. If redundant storage is used,
> they can recover automatically from failures the drives can't or won't
> report.
>
>>> Do you have any help for me about using BtrFS with RAID6 vs mdadm or ZFS?
>> Zygo collected some useful information about RAID5/6:
>>
>> https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
>>
>> However more recently Josef (one of the main developers), declared
>> that BTRFS with RAID5/6 has "...some dark and scary corners..."
>>
>> https://lore.kernel.org/linux-btrfs/bf9594ea55ce40af80548888070427ad97daf78a.1598374255.git.josef@toxicpanda.com/
> I think my list is a little more...concrete. ;)
>
>>> I also don't really understand why and what's the difference between
>>> metadata, data, and system.
>>> When I set up a volume only define RAID6 for the data it sets
>>> metadata and systemdata default to RAID1, but doesn't this mean that
>>> those important metadata are only stored on two drives instead of
>>> spread accross all drives like in a regular RAID6? This would somewhat
>>> negate the benefit of RAID6 to withstand a double failure like a 2nd
>>> drive fail while rebuilding the first failed one.
>> Correct. In fact Zygo suggested to user RAID6 + RAID1C3.
>>
>> I have only few suggestions:
>> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
>> you want to experiment and want to help the development of BTRFS. But
>> be ready to face the lost of all data. (very unlikely, but more the
>> size of the filesystem is big, more difficult is a restore of the data
>> in case of problem).
> Losing all of the data seems unlikely given the bugs that exist so far.
> The known issues are related to availability (it crashes a lot and
> isn't fully usable in degraded mode) and small amounts of data loss
> (like 5 blocks per TB).
>
> The above assumes you never use raid5 or raid6 for btrfs metadata. Using
> raid5 or raid6 for metadata can result in total loss of the filesystem,
> but you can use raid1 or raid1c3 for metadata instead.
>
>> 2) doesn't fill the filesystem more than 70-80%. If you go further
>> this limit the likelihood to catch the "dark and scary corners"
>> quickly increases.
> Can you elaborate on that? I run a lot of btrfs filesystems at 99%
> capacity, some of the bigger ones even higher. If there were issues at
> 80% I expect I would have noticed them. There were some performance
> issues with full filesystems on kernels using space_cache=v1, but
> space_cache=v2 went upstream 4 years ago, and other significant
> performance problems a year before that.
>
> The last few GB is a bit of a performance disaster and there are
> some other gotchas, but that's an absolute number, not a percentage.
>
> Never balance metadata. That is a ticket to a dark and scary corner.
> Make sure you don't do it, and that you don't accidentally install a
> cron job that does it.
>
>> 3) run scrub periodically and after a power failure ; better to use
>> an uninterruptible power supply (this is true for all the RAID, even
>> the MD one).
> scrub also provides early warning of disk failure, and detects disks
> that are silently corrupting your data. It should be run not less than
> once a month, though you can skip months where you've already run a
> full-filesystem read for other reasons (e.g. replacing a failed disk).
>
>> 4) I don't have any data to support this; but as occasional reader of
>> this mailing list I have the feeling that combing BTRFS with LUCKS(or
>> bcache) raises the likelihood of a problem.
> I haven't seen that correlation. All of my machines run at least one
> btrfs on luks (dm-crypt). The larger ones use lvmcache. I've also run
> bcache on test machines doing power-fail tests.
>
> That said, there are additional hardware failure risks involved in
> caching (more storage hardware components = more failures) and the
> system must be designed to tolerate and recover from these failures.
>
> When cache disks fail, just uncache and run scrub to repair. btrfs
> checksums will validate the data on the backing HDD (which will be badly
> corrupted after a cache SSD failure) and will restore missing data from
> other drives in the array.
>
> It's definitely possible to configure bcache or lvmcache incorrectly,
> and then you will have severe problems. Each HDD must have a separate
> dedicated SSD. No sharing between cache devices is permitted. They must
> use separate cache pools. If one SSD is used to cache two or more HDDs
> and the SSD fails, it will behave the same as a multi-disk failure and
> probably destroy the filesystem. So don't do that.
>
> Note that firmware in the SSDs used for caching must respect write
> ordering, or the cache will do severe damage to the filesystem on
> just about every power failure. It's a good idea to test hardware
> in a separate system through a few power failures under load before
> deploying them in production. Most devices are OK, but a few percent
> of models out there have problems so severe they'll damage a filesystem
> in a single-digit number of power loss events. It's fairly common to
> encounter users who have lost a btrfs on their first or second power
> failure with a problematic drive. If you're stuck with one of these
> disks, you can disable write caching and still use it, but there will
> be added write latency, and in the long run it's better to upgrade to
> a better disk model.
>
>> 5) pay attention that having an 8 disks raid, raises the likelihood of a
>> failure of about an order of magnitude more than a single disk ! RAID6
>> (or any other RAID) mitigates that, in the sense that it creates a
>> time window where it is possible to make maintenance (e.g. a disk
>> replacement) before the lost of data.
>> 6) leave the room in the disks array for an additional disk (to use
>> when a disk replacement is needed)
>> 7) avoid the USB disks, because these are not reliable
>>
>>
>>> Any information appreciated.
>>>
>>>
>>> Greetings from Germany,
>>>
>>> Matt
>>
>> --
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
>>
next prev parent reply other threads:[~2020-10-06 5:50 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-05 16:59 using raid56 on a private machine cryptearth
2020-10-05 17:57 ` Goffredo Baroncelli
2020-10-05 19:22 ` cryptearth
2020-10-06 1:24 ` Zygo Blaxell
2020-10-06 5:50 ` cryptearth [this message]
2020-10-06 19:31 ` Zygo Blaxell
2020-10-06 17:12 ` Goffredo Baroncelli
2020-10-06 20:07 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6c516b70-d85f-6115-88e9-295adf4221b3@cryptearth.de \
--to=cryptearth@cryptearth.de \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).