RAID-6 and write hole with write-intent bitmap

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* RAID-6 and write hole with write-intent bitmap
@ 2020-11-24  7:20 Mukund Sivaraman
  2020-11-24 10:10 ` Wols Lists
  0 siblings, 1 reply; 7+ messages in thread
From: Mukund Sivaraman @ 2020-11-24  7:20 UTC (permalink / raw)
  To: linux-raid

Hi all

I am trying to setup a MD RAID-6 array and use the ext4 filesystem in
ordered mode (default) on it. The data gets backed up periodically. I
want the array to be always available.

I prefer not using a write-journal if it is sufficient for my usage. I
want to use the write-intent bitmap only. AIUI the write-hole problem
occurs when there is a crash or abrupt power off *and* disk failures.

* After a crash or abrupt power off, the write-intent bitmap is used to
  rewrite parity where necessary. If there is no disk failure during
  this period, is the RAID-6 array guaranteed to recover without
  corruption?

  With RAID-6, will recovery with write-intent bitmap succeed with 1
  disk failure during the recovery period without a write-journal? i.e.,
  is there a possibility of write hole with 1 disk failure in a RAID-6
  array?

* With RAID-6 with write-intent bitmap in use, ext4 in ordered mode, no
  disk failures, and abrupt power loss, is there any chance of data loss
  in files other than those being written to just before the power loss?

(Apologies if these are silly questions, but I request answers.)

		Mukund

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID-6 and write hole with write-intent bitmap
  2020-11-24  7:20 RAID-6 and write hole with write-intent bitmap Mukund Sivaraman
@ 2020-11-24 10:10 ` Wols Lists
  2020-11-24 18:50   ` Mukund Sivaraman
  2020-11-28  1:51   ` Nix
  0 siblings, 2 replies; 7+ messages in thread
From: Wols Lists @ 2020-11-24 10:10 UTC (permalink / raw)
  To: Mukund Sivaraman, linux-raid

On 24/11/20 07:20, Mukund Sivaraman wrote:
> Hi all
> 
> I am trying to setup a MD RAID-6 array and use the ext4 filesystem in
> ordered mode (default) on it. The data gets backed up periodically. I
> want the array to be always available.
> 
> I prefer not using a write-journal if it is sufficient for my usage. I
> want to use the write-intent bitmap only. AIUI the write-hole problem
> occurs when there is a crash or abrupt power off *and* disk failures.

No, I don't think so. I'm not sure, but aiui, there is a critical point
where the data is partially saved to disk, and should a power failure
occur at that precise point you have a stripe incompletely saved, and
therefore corrupt. This is why you need a log to fix it ...
> 
> * After a crash or abrupt power off, the write-intent bitmap is used to
>   rewrite parity where necessary. If there is no disk failure during
>   this period, is the RAID-6 array guaranteed to recover without
>   corruption?
> 
>   With RAID-6, will recovery with write-intent bitmap succeed with 1
>   disk failure during the recovery period without a write-journal? i.e.,
>   is there a possibility of write hole with 1 disk failure in a RAID-6
>   array?
> 
> * With RAID-6 with write-intent bitmap in use, ext4 in ordered mode, no
>   disk failures, and abrupt power loss, is there any chance of data loss
>   in files other than those being written to just before the power loss?

Probably. Sod's law, you will have other files on the same stripe and
things could go wrong ... Plus I believe some file systems (including
ext4?) store small files in the directory, not as their own i-node, so
there's a whole bunch of other complications possible, plus if you
corrupt the directory ,,,
> 
> (Apologies if these are silly questions, but I request answers.)
> 
RULE 0: RAID IS NO SUBSTITUTE FOR BACKUPS.

And if you don't want to lose live data as it is being updated, you need
a journal. Run the correct horse for the course :-)

Cheers.
Wol


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID-6 and write hole with write-intent bitmap
  2020-11-24 10:10 ` Wols Lists
@ 2020-11-24 18:50   ` Mukund Sivaraman
  2020-11-24 20:16     ` Piergiorgio Sartor
                       ` (2 more replies)
  2020-11-28  1:51   ` Nix
  1 sibling, 3 replies; 7+ messages in thread
From: Mukund Sivaraman @ 2020-11-24 18:50 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

Hi Wols

On Tue, Nov 24, 2020 at 10:10:32AM +0000, Wols Lists wrote:
> On 24/11/20 07:20, Mukund Sivaraman wrote:
> > Hi all
> > 
> > I am trying to setup a MD RAID-6 array and use the ext4 filesystem in
> > ordered mode (default) on it. The data gets backed up periodically. I
> > want the array to be always available.
> > 
> > I prefer not using a write-journal if it is sufficient for my usage. I
> > want to use the write-intent bitmap only. AIUI the write-hole problem
> > occurs when there is a crash or abrupt power off *and* disk failures.
> 
> No, I don't think so. I'm not sure, but aiui, there is a critical point
> where the data is partially saved to disk, and should a power failure
> occur at that precise point you have a stripe incompletely saved, and
> therefore corrupt. This is why you need a log to fix it ...

I appreciate that you took time to reply. Thank you. I am also in the
"not sure" group, and we may be served well by an authoritative answer
from someone who is familiar with the code. I also didn't follow whether
you're saying there is a write hole or not. The answer may be
implementation specific too, so I am looking for an answer from someone
who knows the code.

The following may be incorrect as I am a RAID layperson, but AIUI:

(a) With RAID-5, assuming there are 4 member disks A, B, C, D, a write
operation with its data on disk A and stripe's parity on disk B may
involve:

1. a read of the stripe
2. update of data on A
3. computation and update of parity A^C^D on B

These are not atomic updates. If power is lost between steps 2 and 3,
upon recovery the mismatch between data and parity for the stripe would
be found and the parity can be updated on B. The data chunk written to A
may be incomplete if power is lost during step 2, but the ext4's journal
would return the FS to a consistent state. Moreover, there should not be
any modification/corruption of data in the stripe on disks C and D
(assuming the disks are OK).

(b) With RAID-6, assuming there are 5 member disks A, B, C, D, E, a
write operation with its data on disk A and stripe's parity on disks
B(p) and C(q) would involve:

1. a read of the stripe
2. update of data on A
3. computation and update of parity on B(p)
4. update of parity on C(q)

These are not atomic updates. If power is lost between steps 2 and 3,
upon recovery the mismatch of data on A would be found and the data
chunk can be updated on A. The data chunk written to A may be incomplete
if power is lost during step 2, but the ext4's journal would return the
FS to a consistent state. If power is lost between steps 3 and 4, upon
recovery the mismatch of parity would be found between B(p) and C(q) and
the parity can be updated on B(p) and C(q). Mainly, there should not be
any modification/corruption of data in the stripe on disks D and E
(assuming the disks are OK).

The above may be incorrect, so please indicate what happens, and if
there is a write hole, why there is one.

We don't mind if data in files being written to at the time of power
loss are partially written. It can happen with any abrupt power
loss. The concern is if other unrelated parts of the filesystem not
tracked by the filesystem's journal get corrupted because of other data
chunks of a stripe being updated during recovery.

> > 
> > * After a crash or abrupt power off, the write-intent bitmap is used to
> >   rewrite parity where necessary. If there is no disk failure during
> >   this period, is the RAID-6 array guaranteed to recover without
> >   corruption?
> > 
> >   With RAID-6, will recovery with write-intent bitmap succeed with 1
> >   disk failure during the recovery period without a write-journal? i.e.,
> >   is there a possibility of write hole with 1 disk failure in a RAID-6
> >   array?
> > 
> > * With RAID-6 with write-intent bitmap in use, ext4 in ordered mode, no
> >   disk failures, and abrupt power loss, is there any chance of data loss
> >   in files other than those being written to just before the power loss?
> 
> Probably. Sod's law, you will have other files on the same stripe and
> things could go wrong ... Plus I believe some file systems (including
> ext4?) store small files in the directory, not as their own i-node, so
> there's a whole bunch of other complications possible, plus if you
> corrupt the directory ,,,
> > 
> > (Apologies if these are silly questions, but I request answers.)
> > 
> RULE 0: RAID IS NO SUBSTITUTE FOR BACKUPS.

The data is backed up periodically.

> And if you don't want to lose live data as it is being updated, you need
> a journal. Run the correct horse for the course :-)

It is important that the array is available and not in a failed state or
with an corrupted FS due to a power loss. We would also like to avoid
having to go to restoring from backups as much as possible. There is
power outage about once every week. The system is powered via an
inverter (a lead-acid battery backed UPS) which switches from mains to
battery power when there is power loss within a few tens of milliseconds
that the server's power supply tolerates. Rarely, the switchover time is
longer, or there is a dip, and the server powers off. So consider that
power outages are somewhat common and the array should survive it to
avoid extra work for us, regardless of backups.

The write-journal is a relatively new addition to MD and I feel
conservative about using it for now. I have come across failures
reported on the lists[1], it is not clear if others are using it in
production, and some things such as how to remove the write journal from
an array are not documented (there was a sequence of steps which was
mentioned in the commit log of a patch that introduced support[2], but a
step was missing in it as pointed out in a different mailing list
post[3]). Please don't take these things as criticism - it is just that
the feature appears to be relatively new. Adding an NVMe SSD to hold the
write journal would add another component to the mix which I want to
avoid. However if an authoritative answer indicates the write journal is
required in our case and the implementation is mature, we will try to
adopt it.

[1] https://www.spinics.net/lists/raid/msg62646.html
[2] https://marc.info/?l=linux-raid&m=149063896208043
[3] https://www.spinics.net/lists/raid/msg59940.html

		Mukund

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID-6 and write hole with write-intent bitmap
  2020-11-24 18:50   ` Mukund Sivaraman
@ 2020-11-24 20:16     ` Piergiorgio Sartor
  2020-11-24 21:30     ` antlists
  2020-11-28  1:57     ` Nix
  2 siblings, 0 replies; 7+ messages in thread
From: Piergiorgio Sartor @ 2020-11-24 20:16 UTC (permalink / raw)
  To: Mukund Sivaraman; +Cc: Wols Lists, linux-raid

On Wed, Nov 25, 2020 at 12:20:04AM +0530, Mukund Sivaraman wrote:
> Hi Wols
> 
> On Tue, Nov 24, 2020 at 10:10:32AM +0000, Wols Lists wrote:
> > On 24/11/20 07:20, Mukund Sivaraman wrote:
> > > Hi all
> > > 
> > > I am trying to setup a MD RAID-6 array and use the ext4 filesystem in
> > > ordered mode (default) on it. The data gets backed up periodically. I
> > > want the array to be always available.
> > > 
> > > I prefer not using a write-journal if it is sufficient for my usage. I
> > > want to use the write-intent bitmap only. AIUI the write-hole problem
> > > occurs when there is a crash or abrupt power off *and* disk failures.
> > 
> > No, I don't think so. I'm not sure, but aiui, there is a critical point
> > where the data is partially saved to disk, and should a power failure
> > occur at that precise point you have a stripe incompletely saved, and
> > therefore corrupt. This is why you need a log to fix it ...
> 
> I appreciate that you took time to reply. Thank you. I am also in the
> "not sure" group, and we may be served well by an authoritative answer
> from someone who is familiar with the code. I also didn't follow whether
> you're saying there is a write hole or not. The answer may be
> implementation specific too, so I am looking for an answer from someone
> who knows the code.
> 
> The following may be incorrect as I am a RAID layperson, but AIUI:
> 
> (a) With RAID-5, assuming there are 4 member disks A, B, C, D, a write
> operation with its data on disk A and stripe's parity on disk B may
> involve:
> 
> 1. a read of the stripe
> 2. update of data on A
> 3. computation and update of parity A^C^D on B
> 
> These are not atomic updates. If power is lost between steps 2 and 3,
> upon recovery the mismatch between data and parity for the stripe would
> be found and the parity can be updated on B. The data chunk written to A
> may be incomplete if power is lost during step 2, but the ext4's journal
> would return the FS to a consistent state. Moreover, there should not be
> any modification/corruption of data in the stripe on disks C and D
> (assuming the disks are OK).
> 
> (b) With RAID-6, assuming there are 5 member disks A, B, C, D, E, a
> write operation with its data on disk A and stripe's parity on disks
> B(p) and C(q) would involve:
> 
> 1. a read of the stripe
> 2. update of data on A
> 3. computation and update of parity on B(p)
> 4. update of parity on C(q)
> 
> These are not atomic updates. If power is lost between steps 2 and 3,
> upon recovery the mismatch of data on A would be found and the data
> chunk can be updated on A. The data chunk written to A may be incomplete
> if power is lost during step 2, but the ext4's journal would return the
> FS to a consistent state. If power is lost between steps 3 and 4, upon
> recovery the mismatch of parity would be found between B(p) and C(q) and
> the parity can be updated on B(p) and C(q). Mainly, there should not be
> any modification/corruption of data in the stripe on disks D and E
> (assuming the disks are OK).
> 
> The above may be incorrect, so please indicate what happens, and if
> there is a write hole, why there is one.

The write hole happen, generically,
when data is written to a stripe
and, whatever the reason, the write
is not completed to all devices.
So, some chunks contain new data,
some others old data.

A cause could be a sudded power loss.

The parity will be likely wrong,
and it can be fixed, but this will
not help to get proper data (all
new or all old, but not mixed).

> We don't mind if data in files being written to at the time of power
> loss are partially written. It can happen with any abrupt power
> loss. The concern is if other unrelated parts of the filesystem not
> tracked by the filesystem's journal get corrupted because of other data
> chunks of a stripe being updated during recovery.

Well, since nobody knows what is
written, it could be as well some
metadata will be in inconsistent
state, partially written.

Then, depending on the FS, this might
or might not be a problem.

> > > * After a crash or abrupt power off, the write-intent bitmap is used to
> > >   rewrite parity where necessary. If there is no disk failure during
> > >   this period, is the RAID-6 array guaranteed to recover without
> > >   corruption?
> > > 
> > >   With RAID-6, will recovery with write-intent bitmap succeed with 1
> > >   disk failure during the recovery period without a write-journal? i.e.,
> > >   is there a possibility of write hole with 1 disk failure in a RAID-6
> > >   array?
> > > 
> > > * With RAID-6 with write-intent bitmap in use, ext4 in ordered mode, no
> > >   disk failures, and abrupt power loss, is there any chance of data loss
> > >   in files other than those being written to just before the power loss?
> > 
> > Probably. Sod's law, you will have other files on the same stripe and
> > things could go wrong ... Plus I believe some file systems (including
> > ext4?) store small files in the directory, not as their own i-node, so
> > there's a whole bunch of other complications possible, plus if you
> > corrupt the directory ,,,
> > > 
> > > (Apologies if these are silly questions, but I request answers.)
> > > 
> > RULE 0: RAID IS NO SUBSTITUTE FOR BACKUPS.
> 
> The data is backed up periodically.
> 
> > And if you don't want to lose live data as it is being updated, you need
> > a journal. Run the correct horse for the course :-)
> 
> It is important that the array is available and not in a failed state or
> with an corrupted FS due to a power loss. We would also like to avoid
> having to go to restoring from backups as much as possible. There is
> power outage about once every week. The system is powered via an
> inverter (a lead-acid battery backed UPS) which switches from mains to
> battery power when there is power loss within a few tens of milliseconds
> that the server's power supply tolerates. Rarely, the switchover time is
> longer, or there is a dip, and the server powers off. So consider that
> power outages are somewhat common and the array should survive it to
> avoid extra work for us, regardless of backups.
> 
> The write-journal is a relatively new addition to MD and I feel
> conservative about using it for now. I have come across failures
> reported on the lists[1], it is not clear if others are using it in
> production, and some things such as how to remove the write journal from
> an array are not documented (there was a sequence of steps which was
> mentioned in the commit log of a patch that introduced support[2], but a
> step was missing in it as pointed out in a different mailing list
> post[3]). Please don't take these things as criticism - it is just that
> the feature appears to be relatively new. Adding an NVMe SSD to hold the
> write journal would add another component to the mix which I want to
> avoid. However if an authoritative answer indicates the write journal is
> required in our case and the implementation is mature, we will try to
> adopt it.

I'm not 100% sure the journal will be a
solution anyway.
I mean, the device holding the journal
should be redundant too, so the journal
itself might be subject to write hole.

I might be completely wrong, here, anyway.

> [1] https://www.spinics.net/lists/raid/msg62646.html
> [2] https://marc.info/?l=linux-raid&m=149063896208043
> [3] https://www.spinics.net/lists/raid/msg59940.html

bye 

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID-6 and write hole with write-intent bitmap
  2020-11-24 18:50   ` Mukund Sivaraman
  2020-11-24 20:16     ` Piergiorgio Sartor
@ 2020-11-24 21:30     ` antlists
  2020-11-28  1:57     ` Nix
  2 siblings, 0 replies; 7+ messages in thread
From: antlists @ 2020-11-24 21:30 UTC (permalink / raw)
  To: Mukund Sivaraman; +Cc: linux-raid



On 24/11/2020 18:50, Mukund Sivaraman wrote:
> (a) With RAID-5, assuming there are 4 member disks A, B, C, D, a write
> operation with its data on disk A and stripe's parity on disk B may
> involve:

Close
> 
> 1. a read of the stripe
> 2. update of data in the stripe
2.5 write A^C^D
> 3. computation and update of parity A^C^D on B
> 
> These are not atomic updates. If power is lost between steps 2 and 3,
> upon recovery the mismatch between data and parity for the stripe would
> be found and the parity can be updated on B. The data chunk written to A
> may be incomplete if power is lost during step 2, but the ext4's journal
> would return the FS to a consistent state. Moreover, there should not be
> any modification/corruption of data in the stripe on disks C and D
> (assuming the disks are OK).

I *don't* think that's necessarily true. Yes it probably is true, but 
it's not guaranteed ...
> 
> (b) With RAID-6, assuming there are 5 member disks A, B, C, D, E, a
> write operation with its data on disk A and stripe's parity on disks
> B(p) and C(q) would involve:
> 
> 1. a read of the stripe
> 2. update of data in stripe
2.5 write A^D^E
> 3. computation and update of parity on B(p)
> 4. update of parity on C(q)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID-6 and write hole with write-intent bitmap
  2020-11-24 10:10 ` Wols Lists
  2020-11-24 18:50   ` Mukund Sivaraman
@ 2020-11-28  1:51   ` Nix
  1 sibling, 0 replies; 7+ messages in thread
From: Nix @ 2020-11-28  1:51 UTC (permalink / raw)
  To: Wols Lists; +Cc: Mukund Sivaraman, linux-raid

On 24 Nov 2020, Wols Lists told this:

> On 24/11/20 07:20, Mukund Sivaraman wrote:
>> * With RAID-6 with write-intent bitmap in use, ext4 in ordered mode, no
>>   disk failures, and abrupt power loss, is there any chance of data loss
>>   in files other than those being written to just before the power loss?
>
> Probably. Sod's law, you will have other files on the same stripe and
> things could go wrong ... Plus I believe some file systems (including
> ext4?) store small files in the directory, not as their own i-node, so

ext4 can store small files in the *inode*, not in the containing
directory: that's impossible, since an inode can appear in many
directories at the same time.

... but, of course, inodes on many filesystems are packed into big
tables of inodes, and if you have a write hole hitting an inode write,
you've probably buggered up a bunch of other inodes too. And inodes are
more or less unordered, so that's just smashed a random spray of things
on the disk... it is quite posible that a lot of them are in directories
close in space or in time-of-write to the one you were updating, but not
necessarily. Reach for backups time.

> RULE 0: RAID IS NO SUBSTITUTE FOR BACKUPS.

But backups are a substitute for lack of sleep (for me anyway, because
if I don't have good backups, I can't sodding sleep).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID-6 and write hole with write-intent bitmap
  2020-11-24 18:50   ` Mukund Sivaraman
  2020-11-24 20:16     ` Piergiorgio Sartor
  2020-11-24 21:30     ` antlists
@ 2020-11-28  1:57     ` Nix
  2 siblings, 0 replies; 7+ messages in thread
From: Nix @ 2020-11-28  1:57 UTC (permalink / raw)
  To: Mukund Sivaraman; +Cc: Wols Lists, linux-raid

On 24 Nov 2020, Mukund Sivaraman told this:
[...]
> (a) With RAID-5, assuming there are 4 member disks A, B, C, D, a write
> operation with its data on disk A and stripe's parity on disk B may
> involve:
>
> 1. a read of the stripe
> 2. update of data on A
> 3. computation and update of parity A^C^D on B
>
> These are not atomic updates. If power is lost between steps 2 and 3,

The writes usually proceed in parallel (because anything else would be
abominably slow). But... the problem is that the writes to the component
disks are also not atomic, and will likely not proceed at the same
rates: only with spindle-synched drives is there anything like a
guarantee of that, and those have been unobtainable for decades. So a
power loss could well lead to 500 sectors of the stripe written on disk
A, 430 sectors written on disk B... and the sectors between sector 430
and 500 are not consistent. (Disk C might well be up around sector 600,
disk D around sector 450 and there's no *way* mere parity or RAID 6
syndromes can recover from the wildly-varying mess between sectors 430
and 600... it's not like it gets recorded anywhere where a disk write
got up to before the power went out, either. But the journal avoids this
in the usual fashion for a journal, by writing out the whole thing first
and committing it to stable storage, so that on restart the incomplete
writes can just be replayed.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-11-28  2:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-24  7:20 RAID-6 and write hole with write-intent bitmap Mukund Sivaraman
2020-11-24 10:10 ` Wols Lists
2020-11-24 18:50   ` Mukund Sivaraman
2020-11-24 20:16     ` Piergiorgio Sartor
2020-11-24 21:30     ` antlists
2020-11-28  1:57     ` Nix
2020-11-28  1:51   ` Nix

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox