blog entry on RAID limitation

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* blog entry on RAID limitation
       [not found] <8E5ACAE05E6B9E44A2903C693A5D4E8A097F6C86@hqemmail02.nvidia.com>
@ 2006-01-17  8:45 ` Jeff Breidenbach
  2006-01-17 10:40   ` Neil Brown
  2006-01-21 20:43   ` Carlos Carvalho
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Breidenbach @ 2006-01-17  8:45 UTC (permalink / raw)
  To: linux-raid

Is this a real issue or ignorable Sun propoganda?

-----Original Message-----
From: I-Gene Leong
Subject: RE: [colo] OT: Server Hardware Recommendations
Date: Mon, 16 Jan 2006 14:10:33 -0800

There was an interesting blog entry out in relation to Sun's RAID-Z
talking about RAID-5 shortcomings:

http://blogs.sun.com/roller/page/bonwick?entry=raid_z

It sounds to me like RAID-1 would also be vulnerable to the write hole
mentioned inside.
- I-Gene

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-17  8:45 ` blog entry on RAID limitation Jeff Breidenbach
@ 2006-01-17 10:40   ` Neil Brown
  2006-01-17 16:42     ` Jacob Madsen
  2006-01-21 20:43   ` Carlos Carvalho
  1 sibling, 1 reply; 10+ messages in thread
From: Neil Brown @ 2006-01-17 10:40 UTC (permalink / raw)
  To: Jeff Breidenbach; +Cc: linux-raid

On Tuesday January 17, jeff@jab.org wrote:
> Is this a real issue or ignorable Sun propoganda?

Well.... the 'raid-5 write hole' is old news.  It's been discussed on
this list several times and doesn't seem to actually stop people
getting a lot of value out of software raid5.

Nonetheless, their raid-z certainly seems interesting, I though feel the
term is misleading.  raid-z doesn't provide a virtual storage device
in which you can store whatever filesystem you like.  raid-z is their
code name for a particular aspect of the ZFS filesystem.

Though some of these details are guessed and so might be wrong, it
probably goes something like this:

ZFS uses a 'variable block size' which is probably very similar to
what other filesystems call 'extents'.  When an extent is written, a
hash (aka checksum or MIC - message integrity check) is calculate and
stored, probably with the indexing information.  This makes it easy to
check for media errors.

Also the extent is possibly written over various devices, quite
possibly at different locations on the different devices.  It might be
written twice, thus producing effective mirroring.  It might be
chopped up into bits with the bits written to different devices and a
parity block written to another device.  This produces an effect
similar to raid5.
This layout can even be different for different blocks.

On a regular (Ext3 like) filesystem this would be very awkward as
updating a block would be confusingly hard.  However ZFS never updates
in place.  It is 'copy on write' so any change is written to a new
location and updating the indexing and MIC is all part of the package.

Not that not only data blocks, but also indirect block and all metadata
could be duplicated or striped with parity.

This is definitely a clever idea, as are lots of the ideas in ZFS.
But just because someone has had a clever idea, that doesn't reduce
the value of existing clever ideas like raid5.

In general, I think increasing the connection between the filesystem
and the volume manager/virtual storage is a good idea.  Finding the
right balance is not going to be trivial.  ZFS has taken one very
interesting approach.  There are others.

I have a feeling the above isn't as coherent as I would like.  Maybe I
should go to bed....

> 
> -----Original Message-----
> From: I-Gene Leong
> Subject: RE: [colo] OT: Server Hardware Recommendations
> Date: Mon, 16 Jan 2006 14:10:33 -0800
> 
> There was an interesting blog entry out in relation to Sun's RAID-Z
> talking about RAID-5 shortcomings:
> 
> http://blogs.sun.com/roller/page/bonwick?entry=raid_z
> 
> It sounds to me like RAID-1 would also be vulnerable to the write hole
> mentioned inside.

The 'write hole' exists for all raid levels with redundancy.  The
'resync' process after an unclean shutdown closes the hole,
eventually.

With raid-5, a drive failure while the hole is open means potential
undetectable data loss.  With raid-1, a drive failure doesn't imply
data loss even during the hole.

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-17 10:40   ` Neil Brown
@ 2006-01-17 16:42     ` Jacob Madsen
  2006-01-17 22:47       ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Jacob Madsen @ 2006-01-17 16:42 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> In general, I think increasing the connection between the filesystem
> and the volume manager/virtual storage is a good idea.  Finding the
> right balance is not going to be trivial.  ZFS has taken one very
> interesting approach.  There are others.
>   
Just out of curiosity... When you say there are others, are you then
refering to existing solutions or just saying other approaches will be
developed in the future?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-17 16:42     ` Jacob Madsen
@ 2006-01-17 22:47       ` Neil Brown
  2006-01-21 20:50         ` Carlos Carvalho
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2006-01-17 22:47 UTC (permalink / raw)
  To: Jacob Madsen; +Cc: linux-raid

On Tuesday January 17, jacob@mungo.dk wrote:
> Neil Brown wrote:
> > In general, I think increasing the connection between the filesystem
> > and the volume manager/virtual storage is a good idea.  Finding the
> > right balance is not going to be trivial.  ZFS has taken one very
> > interesting approach.  There are others.
> >   
> Just out of curiosity... When you say there are others, are you then
> refering to existing solutions or just saying other approaches will be
> developed in the future?

There was a paper given at the USENIX FAST conference

    http://www.cs.wisc.edu/adsl/Publications/fast05-journal-guided.pdf

which discussed modifications to ext3 so that after a crash, it would
tell the underlying raid which blocks might have been undergoing a
'write' at the time of the crash, so that raid5 could resync just
those stripes.  This reduces the resync time much more efficiently
that write-intent logging does.

I have had a project underway for some time (about half a day a week
at the moment) to create a file system which is raid-friendly.  When
configured on a raid5, it will always write a full stripe at a time,
and never over-write live data.  This means that there is no need to
pre-read parity or data, and it completely removes the "write hole".

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-17  8:45 ` blog entry on RAID limitation Jeff Breidenbach
  2006-01-17 10:40   ` Neil Brown
@ 2006-01-21 20:43   ` Carlos Carvalho
  1 sibling, 0 replies; 10+ messages in thread
From: Carlos Carvalho @ 2006-01-21 20:43 UTC (permalink / raw)
  To: linux-raid

Jeff Breidenbach (jeff@jab.org) wrote on 17 January 2006 00:45:
 >Is this a real issue or ignorable Sun propoganda?
 >
 >-----Original Message-----
 >From: I-Gene Leong
 >Subject: RE: [colo] OT: Server Hardware Recommendations
 >Date: Mon, 16 Jan 2006 14:10:33 -0800
 >
 >There was an interesting blog entry out in relation to Sun's RAID-Z
 >talking about RAID-5 shortcomings:
 >
 >http://blogs.sun.com/roller/page/bonwick?entry=raid_z
 >
 >It sounds to me like RAID-1 would also be vulnerable to the write hole
 >mentioned inside.

The write-hole exists ONLY when the machine stops without a proper
shutdown AND with an incomplete array (eg. one disk out of the array
in a raid5). Sometimes this happens when the machine crashes or power
goes down and on reboot one disk fails.

If I understand Sun's marketing ZFS always writes full stripes in all
disks, which means the array is never dirty. Therefore the write-hole
indeed doesn't exist.

The problem with the argument is that the write-hole is not so big of
a problem in a well-behaving server because the probability of a crash
and an incomplete array happening simultaneously is very small, so
Sun's feature is not so important.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-17 22:47       ` Neil Brown
@ 2006-01-21 20:50         ` Carlos Carvalho
  2006-01-24  3:20           ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Carlos Carvalho @ 2006-01-21 20:50 UTC (permalink / raw)
  To: linux-raid

Neil Brown (neilb@suse.de) wrote on 18 January 2006 09:47:
 >On Tuesday January 17, jacob@mungo.dk wrote:
 >> Neil Brown wrote:
 >> > In general, I think increasing the connection between the filesystem
 >> > and the volume manager/virtual storage is a good idea.

Well, I agree in principle however the increase in complexity is
likely to make the whole thing even harder to be as reliable as one
needs... Just consider the complication of xfs or reiser4...

 >I have had a project underway for some time (about half a day a week
 >at the moment) to create a file system which is raid-friendly.  When
 >configured on a raid5, it will always write a full stripe at a time,
 >and never over-write live data.  This means that there is no need to
 >pre-read parity or data, and it completely removes the "write hole".

This seems interesting not so much because of the write-hole but
because of the possible increase in speed. I'm not going to ask about
the filesystem features because I think you already said in the list
that you want to play with it yourself :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
@ 2006-01-22 11:52 Rik Herrin
  2006-01-22 23:08 ` Neil Brown
  2006-01-27 13:09 ` Molle Bestefich
  0 siblings, 2 replies; 10+ messages in thread
From: Rik Herrin @ 2006-01-22 11:52 UTC (permalink / raw)
  To: linux-raid

Wouldn't connecting a UPS + using a stable kernel
version remove 90% or so of the "RAID-5 write hole"
problem?  Are there any other means to know when this
has occurred?

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-22 11:52 Rik Herrin
@ 2006-01-22 23:08 ` Neil Brown
  2006-01-27 13:09 ` Molle Bestefich
  1 sibling, 0 replies; 10+ messages in thread
From: Neil Brown @ 2006-01-22 23:08 UTC (permalink / raw)
  To: Rik Herrin; +Cc: linux-raid

On Sunday January 22, rikherrin@yahoo.com wrote:
> Wouldn't connecting a UPS + using a stable kernel
> version remove 90% or so of the "RAID-5 write hole"
> problem?  Are there any other means to know when this
> has occurred?

A UPS would help, as would a well-tested stable kernel.
However sometimes power supplies die, so make sure you have two of
them...

Yes, there are several ways to mitigate the risk.  However they cost
money and the risk is actually already very small in the first place.  
Coming up with a by-design approach that removes the risk completely
is still a good goal, providing the performance costs are reasonable
(i.e. almost zero).

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-21 20:50         ` Carlos Carvalho
@ 2006-01-24  3:20           ` Neil Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2006-01-24  3:20 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

On Saturday January 21, carlos@fisica.ufpr.br wrote:
> Neil Brown (neilb@suse.de) wrote on 18 January 2006 09:47:
>  >On Tuesday January 17, jacob@mungo.dk wrote:
>  >> Neil Brown wrote:
>  >> > In general, I think increasing the connection between the filesystem
>  >> > and the volume manager/virtual storage is a good idea.
> 
> Well, I agree in principle however the increase in complexity is
> likely to make the whole thing even harder to be as reliable as one
> needs... Just consider the complication of xfs or reiser4...

I wouldn't generally expect current filesystems to be enhanced to
understand volume management.  Rather that new filesystems would
include volume management in their core design (like ZFS does).
The ext3 enhancement I mentioned is an exception to this, and it is a
very light-weight integration between FS and LVM.

If you think about it, volume management concepts are largely very
simple, and are very similar to some filesystem concepts.
Doing them both together makes a lot of sense - it leverages the
synergies (if you'll excuse the buzzwords).

A lot of the complexity in volume management comes from trying to
present an illusion of a single large device to the filesystem.  If
you didn't have to construct that illusion, you would need a lot less
code.

The blog entry which started this thread made some comment about how
little code was need for the raid-X implementation.  I suspect this is
largely because there is no need to create illusions.

> 
>  >I have had a project underway for some time (about half a day a week
>  >at the moment) to create a file system which is raid-friendly.  When
>  >configured on a raid5, it will always write a full stripe at a time,
>  >and never over-write live data.  This means that there is no need to
>  >pre-read parity or data, and it completely removes the "write hole".
> 
> This seems interesting not so much because of the write-hole but
> because of the possible increase in speed. I'm not going to ask about
> the filesystem features because I think you already said in the list
> that you want to play with it yourself :-)

I plan to post my code and doco somewhere once I get to a particular
milestone.  However that milestone seems to keep receding into the
distance....
I want the kernel module that I am writing to have substantial
read-only functionality on a filesystem that spans multiple devices.
However I have been waylaid by having the rewrite the directory
handling because my first draft didn't provide stable seek offsets for
filenames, and that really is a 'MUST'.

NeilBrown

> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: blog entry on RAID limitation
  2006-01-22 11:52 Rik Herrin
  2006-01-22 23:08 ` Neil Brown
@ 2006-01-27 13:09 ` Molle Bestefich
  1 sibling, 0 replies; 10+ messages in thread
From: Molle Bestefich @ 2006-01-27 13:09 UTC (permalink / raw)
  To: linux-raid

Rik Herrin wrote:
> Wouldn't connecting a UPS + using a stable kernel
> version remove 90% or so of the "RAID-5 write hole"
> problem?

There are some RAID systems that you'd rather not have redundant power on.

Think encryption.  As long as a system is online, it's normal for it
to have encryption keys in memory and it's disk systems mounted
through the decryption system.  You wouldn't want someone to be able
to steal your server along with the UPS and stuff it in a van with a
power inverter :-).

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-01-27 13:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <8E5ACAE05E6B9E44A2903C693A5D4E8A097F6C86@hqemmail02.nvidia.com>
2006-01-17  8:45 ` blog entry on RAID limitation Jeff Breidenbach
2006-01-17 10:40   ` Neil Brown
2006-01-17 16:42     ` Jacob Madsen
2006-01-17 22:47       ` Neil Brown
2006-01-21 20:50         ` Carlos Carvalho
2006-01-24  3:20           ` Neil Brown
2006-01-21 20:43   ` Carlos Carvalho
2006-01-22 11:52 Rik Herrin
2006-01-22 23:08 ` Neil Brown
2006-01-27 13:09 ` Molle Bestefich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).