RAID5 write hole?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 write hole?
@ 2010-06-26 14:31 Shaochun Wang
  2010-06-26 15:42 ` Mikael Abrahamsson
  2010-06-27 10:33 ` John Hendrikx
  0 siblings, 2 replies; 8+ messages in thread
From: Shaochun Wang @ 2010-06-26 14:31 UTC (permalink / raw)
  To: linux-raid

Hi:

Recently I heard of the so called "write hole" problem of raid5 in
Linux software raid. I use ext4 filesystem on my NAS, which assembles
data disks using Linux software raid. So I wonder how safe my such
system!

If the "write hole" is inevitable, will it result in the corruption of
ext4 filesystem? 

-- 
Shaochun Wang <scwang@ios.ac.cn>

Jabber: fungusw@jabber.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-26 14:31 RAID5 write hole? Shaochun Wang
@ 2010-06-26 15:42 ` Mikael Abrahamsson
  2010-06-26 21:28   ` Shaochun Wang
  2010-06-27 10:33 ` John Hendrikx
  1 sibling, 1 reply; 8+ messages in thread
From: Mikael Abrahamsson @ 2010-06-26 15:42 UTC (permalink / raw)
  To: Shaochun Wang; +Cc: linux-raid

On Sat, 26 Jun 2010, Shaochun Wang wrote:

> Hi:
>
> Recently I heard of the so called "write hole" problem of raid5 in
> Linux software raid. I use ext4 filesystem on my NAS, which assembles
> data disks using Linux software raid. So I wonder how safe my such
> system!

RAID is never a replacement for backups, corruption can happen at multiple 
levels in your system for different reasons. Non-ECC memory can have bit 
flips which corrupts your data, write hole can cause data corruption, etc.

Generally, unless you have really really high demands on data integrity, 
this is not a major problem.

Ext4 has other potential software/fs interactions when it comes to data 
integrity, in that it write buffers for quite some time, so even if you 
think your file is saved, it might take many seconds before it's actually 
on disk if your software doesn't fsync() it. Most don't, because Ext3 took 
so long to do it.

So generally, don't worry too much, but make sure you have backups for 
your important data.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-26 15:42 ` Mikael Abrahamsson
@ 2010-06-26 21:28   ` Shaochun Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Shaochun Wang @ 2010-06-26 21:28 UTC (permalink / raw)
  To: linux-raid

Maybe Sun's ZFS is the ultimate choice!

-- 
Shaochun Wang <scwang@ios.ac.cn>

Jabber: fungusw@jabber.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-26 14:31 RAID5 write hole? Shaochun Wang
  2010-06-26 15:42 ` Mikael Abrahamsson
@ 2010-06-27 10:33 ` John Hendrikx
  2010-06-27 12:16   ` Neil Brown
  1 sibling, 1 reply; 8+ messages in thread
From: John Hendrikx @ 2010-06-27 10:33 UTC (permalink / raw)
  To: Shaochun Wang; +Cc: linux-raid

Shaochun Wang wrote:
> Hi:
>
> Recently I heard of the so called "write hole" problem of raid5 in
> Linux software raid. I use ext4 filesystem on my NAS, which assembles
> data disks using Linux software raid. So I wonder how safe my such
> system!
>
> If the "write hole" is inevitable, will it result in the corruption of
> ext4 filesystem? 
The write hole occurs if your system crashes during a write operation, 
where one stripe gets updated but the other corresponding stripe does 
not.  This could lead to parity information not matching the 
corresponding data.

If the raid 5 system atleast ensures that the data stripe is always 
written before parity, then the montly resync check that mdadm does 
should be able to detect this and write new parity information.

Atleast this way the bad parity does not lurk around forever on your 
raid system causing numerous problems when a disk finally fails.

The write hole is not inevitable, but would require some special 
measures at the raid level which could affect performance.  And as with 
any corruption, it could definitely corrupt your filesystem.

--John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-27 10:33 ` John Hendrikx
@ 2010-06-27 12:16   ` Neil Brown
  2010-06-29  6:16     ` Shaochun Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-06-27 12:16 UTC (permalink / raw)
  To: John Hendrikx; +Cc: Shaochun Wang, linux-raid

On Sun, 27 Jun 2010 12:33:49 +0200
John Hendrikx <hjohn@xs4all.nl> wrote:

> Shaochun Wang wrote:
> > Hi:
> >
> > Recently I heard of the so called "write hole" problem of raid5 in
> > Linux software raid. I use ext4 filesystem on my NAS, which assembles
> > data disks using Linux software raid. So I wonder how safe my such
> > system!
> >
> > If the "write hole" is inevitable, will it result in the corruption of
> > ext4 filesystem? 
> The write hole occurs if your system crashes during a write operation, 
> where one stripe gets updated but the other corresponding stripe does 
> not.  This could lead to parity information not matching the 
> corresponding data.

Correct.

> 
> If the raid 5 system atleast ensures that the data stripe is always 
> written before parity, then the montly resync check that mdadm does 
> should be able to detect this and write new parity information.

This bit isn't so correct.
When the RAID5 is next assembled after the crash, if all devices are present
(i.e. the array is not degraded) then it will check and correct all the
parity blocks immediately.  If you have a write-intent-bitmap configured,
this will be quite quick.  If not it could take hours.

Once the resync has completed you are safe again, any risk from the "write
hole" will have disappeared.

If your array was degraded when the system crashed, or is degraded on
restart, or degrades before the resync completes, then you could suffer from
the "Write hole" ... if a write was interrupted by the crash.

In the first two cases (which are effectively the same case), mdadm will
refuse to assemble the array because it knows it could be suffering from a
write-hole problem.  You need to reassemble with "--force" which means you
acknowledge that there could be corruption due to the write hole.

If you lose a device during the resync you could still suffer from the write
hole, but md doesn't alert you to this.  That could be seen as a
short-coming, but I'm not sure how it might be fixed.  I wouldn't want the
array to suddenly stop working because there is suddenly a risk of write-hold
based corruption....

> 
> Atleast this way the bad parity does not lurk around forever on your 
> raid system causing numerous problems when a disk finally fails.

Yes, it certainly does not lurk forever - the resync fixes it.

> 
> The write hole is not inevitable, but would require some special 
> measures at the raid level which could affect performance.  And as with 
> any corruption, it could definitely corrupt your filesystem.

The write hole can be "fixed" in two ways that I am aware of.
1/ log all writes (including parity updates) to some stable storage before
   writing them to the RAID5.  This is typically done in "hardware RAID" cards
   using NVRAM for the stable storage.
   Once NVRAM is widely available on commodity server hardware I suspect
   md/raid5 will be enhanced to support this.  I have thought about doing
   this using a RAID1 as the alternate stable storage, but the performance
   cost is unlikely to acceptable.
2/ use a filesystem which understands the layout of the RAID5 and which
   somehow "knows" which stripes were written "recently" so that it can
   invalidate them (if it cannot verify them) after a crash.  This would
   almost certainly require a copy-on-write disciple in the filesystem.

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-27 12:16   ` Neil Brown
@ 2010-06-29  6:16     ` Shaochun Wang
  2010-06-29  6:23       ` Mikael Abrahamsson
  0 siblings, 1 reply; 8+ messages in thread
From: Shaochun Wang @ 2010-06-29  6:16 UTC (permalink / raw)
  To: linux-raid

On Sun, Jun 27, 2010 at 10:16:13PM +1000, Neil Brown wrote:
> On Sun, 27 Jun 2010 12:33:49 +0200
> John Hendrikx <hjohn@xs4all.nl> wrote:
> 
> parity blocks immediately.  If you have a write-intent-bitmap configured,
> this will be quite quick.  If not it could take hours.
How do I know whether my RAID5 has wirte-intent bitmap enabled?

-- 
Shaochun Wang <scwang@ios.ac.cn>

Jabber: fungusw@jabber.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-29  6:16     ` Shaochun Wang
@ 2010-06-29  6:23       ` Mikael Abrahamsson
  2010-06-29 13:28         ` Shaochun Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Mikael Abrahamsson @ 2010-06-29  6:23 UTC (permalink / raw)
  To: Shaochun Wang; +Cc: linux-raid

On Tue, 29 Jun 2010, Shaochun Wang wrote:

> How do I know whether my RAID5 has wirte-intent bitmap enabled?

$ cat /proc/mdstat | grep bitmap
       bitmap: 0/8 pages [0KB], 131072KB chunk

If you don't get any bitmap information in there, it's not enabled.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID5 write hole?
  2010-06-29  6:23       ` Mikael Abrahamsson
@ 2010-06-29 13:28         ` Shaochun Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Shaochun Wang @ 2010-06-29 13:28 UTC (permalink / raw)
  To: linux-raid

On Tue, Jun 29, 2010 at 08:23:35AM +0200, Mikael Abrahamsson wrote:
> On Tue, 29 Jun 2010, Shaochun Wang wrote:
> 
> $ cat /proc/mdstat | grep bitmap
>        bitmap: 0/8 pages [0KB], 131072KB chunk
> 
> If you don't get any bitmap information in there, it's not enabled.
It seems that I do not have write-intent bitmap enabled. If
write-intent bitmap is useful, why is it not the default one?

-- 
Shaochun Wang <scwang@ios.ac.cn>

Jabber: fungusw@jabber.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-06-29 13:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-26 14:31 RAID5 write hole? Shaochun Wang
2010-06-26 15:42 ` Mikael Abrahamsson
2010-06-26 21:28   ` Shaochun Wang
2010-06-27 10:33 ` John Hendrikx
2010-06-27 12:16   ` Neil Brown
2010-06-29  6:16     ` Shaochun Wang
2010-06-29  6:23       ` Mikael Abrahamsson
2010-06-29 13:28         ` Shaochun Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).