Re: RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree")

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Forza <forza@tnonline.net>, Qu Wenruo <wqu@suse.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree")
Date: Sun, 24 Jul 2022 19:27:25 +0800	[thread overview]
Message-ID: <390ec19b-3bbd-8eba-ea54-01a31e6b745c@gmx.com> (raw)
In-Reply-To: <dce8715a-3179-6e58-1958-0747a17e2a38@tnonline.net>



On 2022/7/21 22:51, Forza wrote:
>
>
> On 2022-07-19 03:19, Qu Wenruo wrote:
>>
>>
>> On 2022/7/19 05:49, Forza wrote:
>>>
>>>
>>> ---- From: Chris Murphy <lists@colorremedies.com> -- Sent: 2022-07-15
>>> - 22:14 ----
>>>
>>>> On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli
>>>> <kreijack@libero.it> wrote:
>>>>>
>>>>> On 14/07/2022 09.46, Johannes Thumshirn wrote:
>>>>>> On 14.07.22 09:32, Qu Wenruo wrote:
>>>>>>> [...]
>>>>>>
>>>>>> Again if you're doing sub-stripe size writes, you're asking stupid
>>>>>> things and
>>>>>> then there's no reason to not give the user stupid answers.
>>>>>>
>>>>>
>>>>> Qu is right, if we consider only full stripe write the "raid hole"
>>>>> problem
>>>>> disappear, because if a "full stripe" is not fully written it is not
>>>>> referenced either.
>>>>>
>>>>>
>>>>> Personally I think that the ZFS variable stripe size, may be
>>>>> interesting
>>>>> to evaluate. Moreover, because the BTRFS disk format is quite
>>>>> flexible,
>>>>> we can store different BG with different number of disks.
>>>
>>> We can create new types of BGs too. For example parity BGs.
>>>
>>>>> Let me to make an
>>>>> example: if we have 10 disks, we could allocate:
>>>>> 1 BG RAID1
>>>>> 1 BG RAID5, spread over 4 disks only
>>>>> 1 BG RAID5, spread over 8 disks only
>>>>> 1 BG RAID5, spread over 10 disks
>>>>>
>>>>> So if we have short writes, we could put the extents in the RAID1
>>>>> BG; for longer
>>>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by
>>>>> length
>>>>> of the data.
>>>>>
>>>>> Yes this would require a sort of garbage collector to move the data
>>>>> to the biggest
>>>>> raid5 BG, but this would avoid (or reduce) the fragmentation which
>>>>> affect the
>>>>> variable stripe size.
>>>>>
>>>>> Doing so we don't need any disk format change and it would be
>>>>> backward compatible.
>>>
>>> Do we need to implement RAID56 in the traditional sense? As the
>>> user/sysadmin I care about redundancy and performance and cost. The
>>> option to create redundancy for any 'n drives is appealing from a
>>> cost perspective, otherwise I'd use RAID1/10.
>>
>> Have you heard any recent problems related to dm-raid56?
>
> No..?

Then, I'd say their write-intent + journal (PPL for RAID5, full journal
for RAID6) is a tried and true solution.

I see no reason not to follow.

>
>>
>> If your answer is no, then I guess we already have an  answer to your
>> question.
>>
>>>
>>> Since the current RAID56 mode have several important drawbacks
>>
>> Let me to be clear:
>>
>> If you can ensure you didn't hit power loss, or after a power loss do a
>> scrub immediately before any new write, then current RAID56 is fine, at
>> least not obviously worse than dm-raid56.
>>
>> (There are still common problems shared between both btrfs raid56 and
>> dm-raid56, like destructive-RMW)
>>
>>> - and that it's officially not recommended for production use - it is
>>> a good idea to reconstruct new btrfs 'redundant-n' profiles that
>>> doesn't have the inherent issues of traditional RAID.
>>
>> I'd say the complexity is hugely underestimated.
>
> You are probably right. But is it solvable, and is there a vision of
> 'something better' than traditional RAID56?

I'd say, maybe.

I prefer some encode at file extent level (like compression) to provide
extra data recovery, other than relying on stripe based RAID56.

The problem is, normally such encoding is to correct data corruption for
a small percentage, but for regular RAID1/10 or even small number of
disks RAID56, the percentage is not small.

(missing 1 disk in 3 disks RAID5, we're in fact recovery 50% of our data)

If we can find a good encode (probably used after compression), I'm 100%
fine to use that encoding, other than traditional RAID56.

>
>>
>>> For example a non-striped redundant-n profile as well as a striped
>>> redundant-n profile.
>>
>> Non-striped redundant-n profile is already so complex that I can't
>> figure out a working idea right now.
>>
>> But if there is such way, I'm pretty happy to consider.
>
> Can we borrow ideas from the PAR2/PAR3 format?
>
> For each extent, create 'par' redundancy metadata that allows for n-% or
> n-copies of recovery, and that this metadata is also split on different
> disks to allow for n total drive-failures? Maybe parity data can be
> stored in parity BGs, in metadata itself or in special type of extents
> inside data BGs.

The problem is still there, if there is anything representing a stripe,
and any calculate extra info based on stripes, then we can still hit the
write-hole problem.

If we do sub-stripe write, we have to update the checksum or whatever,
which can be out-of-sync during power loss.


If you mean an extra tree to store all these extra checksum/info (aka,
no longer need the stripe unit at all), then I guess it may be possible.

Like we use a special csum algorithm which may take way larger space
than our current 32 bytes per 4K, then I guess we may be able to get
extra redundancy.

There will be some problems like different metadata/data csum (metadata
csum is limited to 32bytes as it's inlined), and way larger metadata
usage for csum.

But those should be more or less solvable.

>
>>
>>>
>>>>
>>>> My 2 cents...
>>>>
>>>> Regarding the current raid56 support, in order of preference:
>>>>
>>>> a. Fix the current bugs, without changing format. Zygo has an
>>>> extensive list.
>>>
>>> I agree that relatively simple fixes should be made. But it seems we
>>> will need quite a large rewrite to solve all issues? Is there a
>>> minium viable option here?
>>
>> Nope. Just see my write-intent code, already have prototype (just needs
>> new scrub based recovery code at mount time) working.
>>
>> And based on my write-intent code, I don't think it's that hard to
>> implement a full journal.
>>
>
> This is good news. Do you see any other major issues that would need
> fixing before RADI56 can be considered production-ready?

Currently I have only finished write-intent bitmaps, which requires
after power loss, all devices are still available and data not touched
is still correct.

For powerloss + missing device, I have to go full journal, but the code
should be pretty similar thus I'm not that concerned.


The biggest problem remaining is, write-intent bitmap/full journal
requires regular devices, no support for zoned devices at all.

Thus zoned guys are not a big fan of this solution.

Thanks,
Qu

>
>
>> Thanks,
>> Qu
>>
>>>
>>>> b. Mostly fix the write hole, also without changing the format, by
>>>> only doing COW with full stripe writes. Yes you could somehow get
>>>> corrupt parity still and not know it until degraded operation produces
>>>> a bad reconstruction of data - but checksum will still catch that.
>>>> This kind of "unreplicated corruption" is not quite the same thing as
>>>> the write hole, because it isn't pernicious like the write hole.
>>>
>>> What is the difference to a)? Is write hole the worst issue? Judging
>>> from the #brtfs channel discussions there seems to be other quite
>>> severe issues, for example real data corruption risks in degraded mode.
>>>
>>>> c. A new de-clustered parity raid56 implementation that is not
>>>> backwards compatible.
>>>
>>> Yes. We have a good opportunity to work out something much better
>>> than current implementations. We could have  redundant-n profiles
>>> that also works with tired storage like ssd/nvme similar to the
>>> metadata on ssd idea.
>>>
>>> Variable stripe width has been brought up before, but received cool
>>> responses. Why is that? IMO it could improve random 4k ios by doing
>>> equivalent to RAID1 instead of RMW, while also closing the write
>>> hole. Perhaps there is a middle ground to be found?
>>>
>>>
>>>>
>>>> Ergo, I think it's best to not break the format twice. Even if a new
>>>> raid implementation is years off.
>>>
>>> I very agree here. Btrfs already suffers in public opinion from the
>>> lack of a stable and safe-for-data RAID56, and requiring several
>>> non-compatible chances isn't going to help.
>>>
>>> I also think it's important that the 'temporary' changes actually
>>> leads to a stable filesystem. Because what is the point otherwise?
>>>
>>> Thanks
>>> Forza
>>>
>>>>
>>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always
>>>> does full stripe COW won't matter even if the performance is worse
>>>> because no one should use parity raid for this workload anyway.
>>>>
>>>>
>>>> --
>>>> Chris Murphy
>>>
>>>

next prev parent reply	other threads:[~2022-07-24 11:31 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-16 14:31 [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 1/8] btrfs: add raid stripe tree definitions Johannes Thumshirn
2022-05-17  7:39   ` Qu Wenruo
2022-05-17  7:45     ` Johannes Thumshirn
2022-05-17  7:56       ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 2/8] btrfs: move btrfs_io_context to volumes.h Johannes Thumshirn
2022-05-17  7:42   ` Qu Wenruo
2022-05-17  7:51     ` Johannes Thumshirn
2022-05-17  7:58       ` Qu Wenruo
2022-05-17  8:01         ` Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 3/8] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2022-05-17  8:09   ` Qu Wenruo
2022-05-17  8:13     ` Johannes Thumshirn
2022-05-17  8:28       ` Qu Wenruo
2022-05-18 11:29         ` Johannes Thumshirn
2022-05-19  8:36           ` Qu Wenruo
2022-05-19  8:39             ` Johannes Thumshirn
2022-05-19 10:37               ` Qu Wenruo
2022-05-19 11:44                 ` Johannes Thumshirn
2022-05-19 11:48                   ` Qu Wenruo
2022-05-19 11:53                     ` Johannes Thumshirn
2022-05-19 13:26                       ` Qu Wenruo
2022-05-19 13:49                         ` Johannes Thumshirn
2022-05-19 22:56                           ` Qu Wenruo
2022-05-20  8:27                             ` Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 4/8] btrfs: add boilerplate code to insert raid extent Johannes Thumshirn
2022-05-17  7:53   ` Qu Wenruo
2022-05-17  8:00   ` Qu Wenruo
2022-05-17  8:05     ` Johannes Thumshirn
2022-05-17  8:09       ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 5/8] btrfs: add code to delete " Johannes Thumshirn
2022-05-17  8:06   ` Qu Wenruo
2022-05-17  8:10     ` Johannes Thumshirn
2022-05-17  8:14       ` Qu Wenruo
2022-05-17  8:20         ` Johannes Thumshirn
2022-05-17  8:31           ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 6/8] btrfs: add code to read " Johannes Thumshirn
2022-05-16 14:55   ` Josef Bacik
2022-05-16 14:31 ` [RFC ONLY 7/8] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 8/8] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2022-05-16 14:58 ` [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree Josef Bacik
2022-05-16 15:04   ` Johannes Thumshirn
2022-05-16 15:10     ` Josef Bacik
2022-05-16 15:47       ` Johannes Thumshirn
2022-05-17  7:23 ` Nikolay Borisov
2022-05-17  7:31   ` Qu Wenruo
2022-05-17  7:41     ` Johannes Thumshirn
2022-05-17  7:32   ` Johannes Thumshirn
2022-07-13 10:54 ` RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree") Qu Wenruo
2022-07-13 11:43   ` Johannes Thumshirn
2022-07-13 12:01     ` Qu Wenruo
2022-07-13 12:42       ` Johannes Thumshirn
2022-07-13 13:47         ` Qu Wenruo
2022-07-13 14:01           ` Johannes Thumshirn
2022-07-13 15:24             ` Lukas Straub
2022-07-13 15:28               ` Johannes Thumshirn
2022-07-14  1:08             ` Qu Wenruo
2022-07-14  7:08               ` Johannes Thumshirn
2022-07-14  7:32                 ` Qu Wenruo
2022-07-14  7:46                   ` Johannes Thumshirn
2022-07-14  7:53                     ` Qu Wenruo
2022-07-15 17:54                     ` Goffredo Baroncelli
2022-07-15 19:08                       ` Thiago Ramon
2022-07-16  0:34                         ` Qu Wenruo
2022-07-16 11:11                           ` Qu Wenruo
2022-07-16 13:52                             ` Thiago Ramon
2022-07-16 14:26                               ` Goffredo Baroncelli
2022-07-17 17:58                                 ` Goffredo Baroncelli
2022-07-17  0:30                               ` Qu Wenruo
2022-07-17 15:18                                 ` Thiago Ramon
2022-07-17 22:01                                   ` Qu Wenruo
2022-07-17 23:00                           ` Zygo Blaxell
2022-07-18  1:04                             ` Qu Wenruo
2022-07-15 20:14                       ` Chris Murphy
2022-07-18  7:33                         ` Johannes Thumshirn
2022-07-18  8:03                           ` Qu Wenruo
2022-07-18 21:49                         ` Forza
2022-07-19  1:19                           ` Qu Wenruo
2022-07-21 14:51                             ` Forza
2022-07-24 11:27                               ` Qu Wenruo [this message]
2022-07-25  0:00                             ` Zygo Blaxell
2022-07-25  0:25                               ` Qu Wenruo
2022-07-25  5:41                                 ` Zygo Blaxell
2022-07-25  7:49                                   ` Qu Wenruo
2022-07-25 19:58                               ` Goffredo Baroncelli
2022-07-25 21:29                                 ` Qu Wenruo
2022-07-18  7:30                       ` Johannes Thumshirn
2022-07-19 18:58                         ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=390ec19b-3bbd-8eba-ea54-01a31e6b745c@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=forza@tnonline.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox