Re: Using btrfs raid5/6

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Andrei Borzenkov <arvidjaar@gmail.com>, Qu Wenruo <wqu@suse.com>,
	Scoopta <mlist@scoopta.email>,
	linux-btrfs@vger.kernel.org
Subject: Re: Using btrfs raid5/6
Date: Mon, 9 Dec 2024 21:34:07 -0500	[thread overview]
Message-ID: <Z1eonzLzseG2_vny@hungrycats.org> (raw)
In-Reply-To: <cfa74363-b310-49f0-b4bf-07a98c1be972@gmx.com>

On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
> 在 2024/12/7 18:07, Andrei Borzenkov 写道:
> > 06.12.2024 07:16, Qu Wenruo wrote:
> > > 在 2024/12/6 14:29, Andrei Borzenkov 写道:
> > > > 05.12.2024 23:27, Qu Wenruo wrote:
> > > > > 在 2024/12/6 03:23, Andrei Borzenkov 写道:
> > > > > > 05.12.2024 01:34, Qu Wenruo wrote:
> > > > > > > 在 2024/12/5 05:47, Andrei Borzenkov 写道:
> > > > > > > > 04.12.2024 07:40, Qu Wenruo wrote:
> > > > > > > > > 
> > > > > > > > > 在 2024/12/4 14:04, Scoopta 写道:
> > > > > > > > > > I'm looking to deploy btfs raid5/6 and have read some of the
> > > > > > > > > > previous
> > > > > > > > > > posts here about doing so "successfully." I want to make sure I
> > > > > > > > > > understand the limitations correctly. I'm looking to replace an
> > > > > > > > > > md+ext4
> > > > > > > > > > setup. The data on these drives is replaceable but obviously
> > > > > > > > > > ideally I
> > > > > > > > > > don't want to have to replace it.
> > > > > > > > > 
> > > > > > > > > 0) Use kernel newer than 6.5 at least.
> > > > > > > > > 
> > > > > > > > > That version introduced a more comprehensive check for any RAID56
> > > > > > > > > RMW,
> > > > > > > > > so that it will always verify the checksum and rebuild when
> > > > > > > > > necessary.
> > > > > > > > > 
> > > > > > > > > This should mostly solve the write hole problem, and we even have
> > > > > > > > > some
> > > > > > > > > test cases in the fstests already verifying the behavior.
> > > > > > > > 
> > > > > > > > Write hole happens when data can *NOT* be rebuilt because data is
> > > > > > > > inconsistent between different strips of the same stripe. How btrfs
> > > > > > > > solves this problem?
> > > > > > > 
> > > > > > > An example please.
> > > > > > 
> > > > > > You start with stripe
> > > > > > 
> > > > > > A1,B1,C1,D1,P1
> > > > > > 
> > > > > > You overwrite A1 with A2
> > > > > 
> > > > > This already falls into NOCOW case.
> > > > > 
> > > > > No guarantee for data consistency.
> > > > > 
> > > > > For COW cases, the new data are always written into unused slot, and
> > > > > after crash we will only see the old data.
> > > > 
> > > > Do you mean that btrfs only does full stripe write now? As I recall from
> > > > the previous discussions, btrfs is using fixed size stripes and it can
> > > > fill unused strips. Like
> > > > 
> > > > First write
> > > > 
> > > > A1,B1,...,...,P1
> > > > 
> > > > Second write
> > > > 
> > > > A1,B1,C2,D2,P2
> > > > 
> > > > I.e. A1 and B1 do not change, but C2 and D2 are added.
> > > > 
> > > > Now, if parity is not updated before crash and D gets lost we have
> > > 
> > > After crash, C2/D2 is not referenced by anyone.
> > > So we won't need to read C2/D2/P2 because it's just unallocated space.
> > 
> > You do need to read C2/D2 to build parity and to reconstruct any missing
> > block. Parity no more matches C2/D2. Whether C2/D2 are actually
> > referenced by upper layers is irrelevant for RAID5/6.
> 
> Nope, in that case whatever garbage is in C2/D2, btrfs just do not care.
> 
> Just try it yourself.
> 
> You can even mkfs without discarding the device, then btrfs has garbage
> for unwritten ranges.
> 
> Then do btrfs care those unallocated space nor their parity?
> No.
> 
> Btrfs only cares full stripe that has at least one block being referred.
> 
> For vertical stripe that has no sector involved, btrfs treats it as
> nocsum, aka, as long as it can read it's fine. If it can not be read
> from the disk (missing dev etc), just use the rebuild data.
> 
> Either way for unused sector it makes no difference.

The assumption Qu made here is that btrfs never writes data blocks to the
same stripe from two or more different transactions, without freeing and
allocating the entire stripe in between.  If that assumption were true,
there would be no write hole in the current implementation.

The reality is that btrfs does exactly the opposite, as in Andrei's second
example.  This causes potential data loss of the first transaction's
data if the second transaction's write is aborted by a crash.  After the
first transaction, the parity and uninitialized data blocks can be used
to recover any data block in the first transaction.  When the second
transaction is aborted with some but not all of the blocks updated, the
parity will no longer be usable to reconstruct the data blocks from _any_
part of the stripe, including the first transaction's committed data.

Technically, in this event, the second transaction's data is _also_
lost, but as Qu mentioned above, that data isn't part of a committed
transaction, so the damaged data won't appear in the filesystem after a
crash, corrupted or otherwise.

The potential data loss does not become actual data loss until the stripe
goes into degraded mode, where the out-of-sync parity block is needed to
recover a missing or corrupt data block.  If the stripe was already in
degraded mode during the crash, data loss is immediate.

If the drives are all healthy, the parity block can be recomputed
by a scrub, as long as the scrub is completed between a crash and a
drive failure.

If drives are missing or corrupt and parity hasn't been properly updated,
then data block reconstruction cannot occur.  btrfs will reject the
reconstructed block when its csum doesn't match, resulting in an
uncorrectable error.

There's several options to fix the write hole:

1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
within a partially filled raid56 stripe, unless the stripe was empty
at the beginning of the current transaction (i.e. multiple RMW writes
are OK, as long as they all disappear in the same crash event).  This
ensures a stripe is never written from two separate btrfs transactions,
eliminating the write hole.  This option requires an allocator change,
and some rework/optimization of how ordered extents are written out.
It also requires more balances--space within partially filled stripes
isn't usable until every data block within the stripe is freed, and
balances do exactly that.

2.  Add a stripe journal.  Requires on-disk format change to add the
journal, and recovery code at startup to replay it.  It's the industry
standard way to fix the write hole in a traditional raid5 implementation,
so it's the first idea everyone proposes.  It's also quite slow if you
don't have dedicated purpose-built hardware for the journal.  It's the
only option for closing the write hole on nodatacow files.

3.  Add a generic remapping layer for all IO blocks to avoid requiring
RMW cycles.  This is the raid-stripe-tree feature, a brute-force approach
that makes RAID profiles possible on ZNS drives.  ZNS drives have similar
but much more strict write-ordering constraints than traditional raid56,
so if the raid stripe tree can do raid5 on ZNS, it should be able to
handle CMR easily ("efficiently" is a separate question).

4.  Replace the btrfs raid5 profile with something else, and deprecate
the raid5 profile.  I'd recommend not considering that option until
after someone delivers a complete, write-hole-free replacement profile,
ready for merging.  The existing raid5 is not _that_ hard to fix, we
already have 3 well-understood options, and one of them doesn't require
an on-disk format change.

Option 1 is probably the best one:  it doesn't require on-disk format
changes, only changes to the way kernels manage future writes.  Ideally,
the implementation includes an optimization to collect small extent writes
and merge them into full-stripe writes, which will make those _much_
faster on raid56.  The current implementation does multiple unnecessary
RMW cycles when writing multiple separate data extents to the same
stripe, even when the extents are allocated within a single transaction
and collectively the extents fill the entire stripe.

Option 1 won't fix nodatacow files, but that's only a problem if you
use nodatacow files.

I suspect options 2 and 3 have so much overhead that they are far
slower than option 1, even counting the extra balances option 1 requires.
With option 1, the extra overhead is in a big batch you can run overnight,
while options 2 and 3 impose continuous overhead on writes, and for
option 3, on reads as well.

> > > So still wrong example.
> > > 
> > 
> > It is the right example, you just prefer to ignore this problem.
> 
> Sure sure, whatever you believe.
> 
> Or why not just read the code on how the current RAID56 works?

The above is a summary of the state of raid56 when I last read the code
in depth (from circa v6.6), combined with direct experience from running
a small fleet of btrfs raid5 arrays and observing how they behave since
2016, and review of the raid-stripe-tree design docs.

> > > Remember we should discuss on the RMW case, meanwhile your case doesn't
> > > even involve RMW, just a full stripe write.
> > > 
> > > > 
> > > > A1,B1,C2,miss,P1
> > > > 
> > > > with exactly the same problem.
> > > > 
> > > > It has been discussed multiple times, that to fix it either btrfs has to
> > > > use variable stripe size (basically, always do full stripe write) or
> > > > some form of journal for pending updates.
> > > 
> > > If taking a correct example, it would be some like this:
> > > 
> > > Existing D1 data, unused D2 , P(D1+D2).
> > > Write D2 and update P(D1+D2), then power loss.
> > > 
> > > Case 0): Power loss after all data and metadata reached disk
> > > Nothing to bother, metadata already updated to see both D1 and D2,
> > > everything is fine.
> > > 
> > > Case 1): Power loss before metadata reached disk
> > > 
> > > This means we will only see D1 as the old data, have no idea there is
> > > any D2.
> > > 
> > > Case 1.0): both D2 and P(D1+D2) reached disk
> > > Nothing to bother, again.
> > > 
> > > Case 1.1): D2 reached disk, P(D1+D2) doesn't
> > > We still do not need to bother anything (if all devices are still
> > > there), because D1 is still correct.
> > > 
> > > But if the device of D1 is missing, we can not recover D1, because D2
> > > and P(D1+D2) is out of sync.
> > > 
> > > However I can argue this is not a simple corruption/power loss, it's two
> > > problems (power loss + missing device), this should count as 2
> > > missing/corrupted sectors in the same vertical stripe.

A raid56 array must still tolerate power failures while it is degraded.
This is table stakes for a modern parity raid implementation.

The raid56 write hole occurs when it is possible for an active stripe
to enter an unrecoverable state.  This is an implementation bug, not a
device failure.

Leaving an inactive stripe in a corrupted state after a crash is OK.
Never modifying any active stripe, so they are never corrupted, is OK.
btrfs corrupts active stripes, which is not OK.

Hopefully this is clear.

> > This is the very definition of the write hole. You are entitled to have
> > your opinion, but at least do not confuse others by claiming that btrfs
> > protects against write hole.
> > 
> > It need not be the whole device - it is enough to have a single
> > unreadable sector which happens more often (at least, with HDD).
> > 
> > And as already mentioned it need not happen at the same (or close) time.
> > The data corruption may happen days and months after lost write. Sure,
> > you can still wave it off as a double fault - but if in case of failed
> > disk (or even unreadable sector) administrator at least gets notified in
> > logs, here it is absolutely silent without administrator even being
> > aware that this stripe is no more redundant and so administrator cannot
> > do anything to fix it.
> > 
> > > As least btrfs won't do any writeback to the same vertical stripe at all.
> > > 
> > > Case 1.2): P(D1+D2) reached disk, D2 doesn't
> > > The same as case 1.1).
> > > 
> > > Case 1.3): Neither D2 nor P(D1+D2) reached disk
> > > 
> > > It's the same as case 1.0, even missing D1 is fine to recover.
> > > 
> > > 
> > > So if you believe powerloss + missing device counts as a single device
> > > missing, and it doesn't break the tolerance of RAID5, then you can count
> > > this as a "write-hole".
> > > 
> > > But to me, this is not a single error, but two error (write failure +
> > > missing device), beyond the tolerance of RAID5.
> > > 
> > > Thanks,
> > > Qu
> > 
> 
>

next prev parent reply	other threads:[~2024-12-10  2:39 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-04  3:34 Using btrfs raid5/6 Scoopta
2024-12-04  4:29 ` Andrei Borzenkov
2024-12-04  4:49   ` Scoopta
2024-12-04  4:40 ` Qu Wenruo
2024-12-04  4:50   ` Scoopta
2024-12-04 19:17   ` Andrei Borzenkov
2024-12-04 22:34     ` Qu Wenruo
2024-12-05 16:53       ` Andrei Borzenkov
2024-12-05 20:27         ` Qu Wenruo
2024-12-06  3:59           ` Andrei Borzenkov
2024-12-06  4:16             ` Qu Wenruo
2024-12-06 18:10               ` Goffredo Baroncelli
2024-12-07  7:37               ` Andrei Borzenkov
2024-12-07 20:26                 ` Qu Wenruo
2024-12-10  2:34                   ` Zygo Blaxell [this message]
2024-12-10 19:36                     ` Goffredo Baroncelli
2024-12-11  1:47                       ` Jonah Sabean
2024-12-11  7:26                       ` Zygo Blaxell
2024-12-11 19:39                         ` Goffredo Baroncelli
2024-12-15  7:49                           ` Zygo Blaxell
2024-12-21 18:32                     ` Proposal for RAID-PN (was Re: Using btrfs raid5/6) Forza
2024-12-22 12:00                       ` Goffredo Baroncelli
2024-12-23  7:42                         ` Andrei Borzenkov
2024-12-24  9:31                           ` Goffredo Baroncelli
2024-12-06  2:03   ` Using btrfs raid5/6 Jonah Sabean
2024-12-07 20:48     ` Qu Wenruo
2024-12-08 16:31       ` Jonah Sabean
2024-12-08 20:07         ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1eonzLzseG2_vny@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=arvidjaar@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mlist@scoopta.email \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox