Re: Status of RAID5/6 - Zygo Blaxell

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Andrei Borzenkov <arvidjaar@gmail.com>
Cc: kreijack@inwind.it,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	linux-btrfs@vger.kernel.org
Subject: Re: Status of RAID5/6
Date: Sat, 31 Mar 2018 10:40:38 -0400	[thread overview]
Message-ID: <20180331144020.GG2446@hungrycats.org> (raw)
In-Reply-To: <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2584 bytes --]

On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote:
> 31.03.2018 11:16, Goffredo Baroncelli пишет:
> > On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
> >>> The key is that if a data write is interrupted, all the transaction
> >>> is interrupted and aborted. And due to the COW nature of btrfs, the
> >>> "old state" is restored at the next reboot.
> > 
> >> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> >> RMW operations which are not COW and don't provide any data integrity
> >> guarantee.  Old data (i.e. data from very old transactions that are not
> >> part of the currently written transaction) can be destroyed by this.
> > 
> > Could you elaborate a bit ?
> > 
> > Generally speaking, updating a part of a stripe require a RMW cycle, because
> > - you need to read all data stripe (with parity in case of a problem)
> > - then you should write
> > 	- the new data
> > 	- the new parity (calculated on the basis of the first read, and the new data)
> > 
> > However the "old" data should be untouched; or you are saying that the "old" data is rewritten with the same data ? 
> > 
> 
> If old data block becomes unavailable, it can no more be reconstructed
> because old content of "new data" and "new priority" blocks are lost.
> Fortunately if checksum is in use it does not cause silent data
> corruption but it effectively means data loss.
> 
> Writing of data belonging to unrelated transaction affects previous
> transactions precisely due to RMW cycle. This fundamentally violates
> btrfs claim of always having either old or new consistent state.

Correct.

To fix this, any RMW stripe update on raid56 has to be written to a
log first.  All RMW updates must be logged because a disk failure could
happen at any time.

Full stripe writes don't need to be logged because all the data in the
stripe belongs to the same transaction, so if a disk fails the entire
stripe is either committed or it is not.

One way to avoid the logging is to change the btrfs allocation parameters
so that the filesystem doesn't allocate data in RAID stripes that are
already occupied by data from older transactions.  This is similar to
what 'ssd_spread' does, although the ssd_spread option wasn't designed
for this and won't be effective on large arrays.  This avoids modifying
stripes that contain old committed data, but it also means the free space
on the filesystem will become heavily fragmented over time.  Users will
have to run balance *much* more often to defragment the free space.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

next prev parent reply	other threads:[~2018-03-31 14:40 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell [this message]
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180331144020.GG2446@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=arvidjaar@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.