* blog entry on RAID limitation [not found] <8E5ACAE05E6B9E44A2903C693A5D4E8A097F6C86@hqemmail02.nvidia.com> @ 2006-01-17 8:45 ` Jeff Breidenbach 2006-01-17 10:40 ` Neil Brown 2006-01-21 20:43 ` Carlos Carvalho 0 siblings, 2 replies; 10+ messages in thread From: Jeff Breidenbach @ 2006-01-17 8:45 UTC (permalink / raw) To: linux-raid Is this a real issue or ignorable Sun propoganda? -----Original Message----- From: I-Gene Leong Subject: RE: [colo] OT: Server Hardware Recommendations Date: Mon, 16 Jan 2006 14:10:33 -0800 There was an interesting blog entry out in relation to Sun's RAID-Z talking about RAID-5 shortcomings: http://blogs.sun.com/roller/page/bonwick?entry=raid_z It sounds to me like RAID-1 would also be vulnerable to the write hole mentioned inside. - I-Gene ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-17 8:45 ` blog entry on RAID limitation Jeff Breidenbach @ 2006-01-17 10:40 ` Neil Brown 2006-01-17 16:42 ` Jacob Madsen 2006-01-21 20:43 ` Carlos Carvalho 1 sibling, 1 reply; 10+ messages in thread From: Neil Brown @ 2006-01-17 10:40 UTC (permalink / raw) To: Jeff Breidenbach; +Cc: linux-raid On Tuesday January 17, jeff@jab.org wrote: > Is this a real issue or ignorable Sun propoganda? Well.... the 'raid-5 write hole' is old news. It's been discussed on this list several times and doesn't seem to actually stop people getting a lot of value out of software raid5. Nonetheless, their raid-z certainly seems interesting, I though feel the term is misleading. raid-z doesn't provide a virtual storage device in which you can store whatever filesystem you like. raid-z is their code name for a particular aspect of the ZFS filesystem. Though some of these details are guessed and so might be wrong, it probably goes something like this: ZFS uses a 'variable block size' which is probably very similar to what other filesystems call 'extents'. When an extent is written, a hash (aka checksum or MIC - message integrity check) is calculate and stored, probably with the indexing information. This makes it easy to check for media errors. Also the extent is possibly written over various devices, quite possibly at different locations on the different devices. It might be written twice, thus producing effective mirroring. It might be chopped up into bits with the bits written to different devices and a parity block written to another device. This produces an effect similar to raid5. This layout can even be different for different blocks. On a regular (Ext3 like) filesystem this would be very awkward as updating a block would be confusingly hard. However ZFS never updates in place. It is 'copy on write' so any change is written to a new location and updating the indexing and MIC is all part of the package. Not that not only data blocks, but also indirect block and all metadata could be duplicated or striped with parity. This is definitely a clever idea, as are lots of the ideas in ZFS. But just because someone has had a clever idea, that doesn't reduce the value of existing clever ideas like raid5. In general, I think increasing the connection between the filesystem and the volume manager/virtual storage is a good idea. Finding the right balance is not going to be trivial. ZFS has taken one very interesting approach. There are others. I have a feeling the above isn't as coherent as I would like. Maybe I should go to bed.... > > -----Original Message----- > From: I-Gene Leong > Subject: RE: [colo] OT: Server Hardware Recommendations > Date: Mon, 16 Jan 2006 14:10:33 -0800 > > There was an interesting blog entry out in relation to Sun's RAID-Z > talking about RAID-5 shortcomings: > > http://blogs.sun.com/roller/page/bonwick?entry=raid_z > > It sounds to me like RAID-1 would also be vulnerable to the write hole > mentioned inside. The 'write hole' exists for all raid levels with redundancy. The 'resync' process after an unclean shutdown closes the hole, eventually. With raid-5, a drive failure while the hole is open means potential undetectable data loss. With raid-1, a drive failure doesn't imply data loss even during the hole. NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-17 10:40 ` Neil Brown @ 2006-01-17 16:42 ` Jacob Madsen 2006-01-17 22:47 ` Neil Brown 0 siblings, 1 reply; 10+ messages in thread From: Jacob Madsen @ 2006-01-17 16:42 UTC (permalink / raw) To: linux-raid Neil Brown wrote: > In general, I think increasing the connection between the filesystem > and the volume manager/virtual storage is a good idea. Finding the > right balance is not going to be trivial. ZFS has taken one very > interesting approach. There are others. > Just out of curiosity... When you say there are others, are you then refering to existing solutions or just saying other approaches will be developed in the future? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-17 16:42 ` Jacob Madsen @ 2006-01-17 22:47 ` Neil Brown 2006-01-21 20:50 ` Carlos Carvalho 0 siblings, 1 reply; 10+ messages in thread From: Neil Brown @ 2006-01-17 22:47 UTC (permalink / raw) To: Jacob Madsen; +Cc: linux-raid On Tuesday January 17, jacob@mungo.dk wrote: > Neil Brown wrote: > > In general, I think increasing the connection between the filesystem > > and the volume manager/virtual storage is a good idea. Finding the > > right balance is not going to be trivial. ZFS has taken one very > > interesting approach. There are others. > > > Just out of curiosity... When you say there are others, are you then > refering to existing solutions or just saying other approaches will be > developed in the future? There was a paper given at the USENIX FAST conference http://www.cs.wisc.edu/adsl/Publications/fast05-journal-guided.pdf which discussed modifications to ext3 so that after a crash, it would tell the underlying raid which blocks might have been undergoing a 'write' at the time of the crash, so that raid5 could resync just those stripes. This reduces the resync time much more efficiently that write-intent logging does. I have had a project underway for some time (about half a day a week at the moment) to create a file system which is raid-friendly. When configured on a raid5, it will always write a full stripe at a time, and never over-write live data. This means that there is no need to pre-read parity or data, and it completely removes the "write hole". NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-17 22:47 ` Neil Brown @ 2006-01-21 20:50 ` Carlos Carvalho 2006-01-24 3:20 ` Neil Brown 0 siblings, 1 reply; 10+ messages in thread From: Carlos Carvalho @ 2006-01-21 20:50 UTC (permalink / raw) To: linux-raid Neil Brown (neilb@suse.de) wrote on 18 January 2006 09:47: >On Tuesday January 17, jacob@mungo.dk wrote: >> Neil Brown wrote: >> > In general, I think increasing the connection between the filesystem >> > and the volume manager/virtual storage is a good idea. Well, I agree in principle however the increase in complexity is likely to make the whole thing even harder to be as reliable as one needs... Just consider the complication of xfs or reiser4... >I have had a project underway for some time (about half a day a week >at the moment) to create a file system which is raid-friendly. When >configured on a raid5, it will always write a full stripe at a time, >and never over-write live data. This means that there is no need to >pre-read parity or data, and it completely removes the "write hole". This seems interesting not so much because of the write-hole but because of the possible increase in speed. I'm not going to ask about the filesystem features because I think you already said in the list that you want to play with it yourself :-) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-21 20:50 ` Carlos Carvalho @ 2006-01-24 3:20 ` Neil Brown 0 siblings, 0 replies; 10+ messages in thread From: Neil Brown @ 2006-01-24 3:20 UTC (permalink / raw) To: Carlos Carvalho; +Cc: linux-raid On Saturday January 21, carlos@fisica.ufpr.br wrote: > Neil Brown (neilb@suse.de) wrote on 18 January 2006 09:47: > >On Tuesday January 17, jacob@mungo.dk wrote: > >> Neil Brown wrote: > >> > In general, I think increasing the connection between the filesystem > >> > and the volume manager/virtual storage is a good idea. > > Well, I agree in principle however the increase in complexity is > likely to make the whole thing even harder to be as reliable as one > needs... Just consider the complication of xfs or reiser4... I wouldn't generally expect current filesystems to be enhanced to understand volume management. Rather that new filesystems would include volume management in their core design (like ZFS does). The ext3 enhancement I mentioned is an exception to this, and it is a very light-weight integration between FS and LVM. If you think about it, volume management concepts are largely very simple, and are very similar to some filesystem concepts. Doing them both together makes a lot of sense - it leverages the synergies (if you'll excuse the buzzwords). A lot of the complexity in volume management comes from trying to present an illusion of a single large device to the filesystem. If you didn't have to construct that illusion, you would need a lot less code. The blog entry which started this thread made some comment about how little code was need for the raid-X implementation. I suspect this is largely because there is no need to create illusions. > > >I have had a project underway for some time (about half a day a week > >at the moment) to create a file system which is raid-friendly. When > >configured on a raid5, it will always write a full stripe at a time, > >and never over-write live data. This means that there is no need to > >pre-read parity or data, and it completely removes the "write hole". > > This seems interesting not so much because of the write-hole but > because of the possible increase in speed. I'm not going to ask about > the filesystem features because I think you already said in the list > that you want to play with it yourself :-) I plan to post my code and doco somewhere once I get to a particular milestone. However that milestone seems to keep receding into the distance.... I want the kernel module that I am writing to have substantial read-only functionality on a filesystem that spans multiple devices. However I have been waylaid by having the rewrite the directory handling because my first draft didn't provide stable seek offsets for filenames, and that really is a 'MUST'. NeilBrown > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-17 8:45 ` blog entry on RAID limitation Jeff Breidenbach 2006-01-17 10:40 ` Neil Brown @ 2006-01-21 20:43 ` Carlos Carvalho 1 sibling, 0 replies; 10+ messages in thread From: Carlos Carvalho @ 2006-01-21 20:43 UTC (permalink / raw) To: linux-raid Jeff Breidenbach (jeff@jab.org) wrote on 17 January 2006 00:45: >Is this a real issue or ignorable Sun propoganda? > >-----Original Message----- >From: I-Gene Leong >Subject: RE: [colo] OT: Server Hardware Recommendations >Date: Mon, 16 Jan 2006 14:10:33 -0800 > >There was an interesting blog entry out in relation to Sun's RAID-Z >talking about RAID-5 shortcomings: > >http://blogs.sun.com/roller/page/bonwick?entry=raid_z > >It sounds to me like RAID-1 would also be vulnerable to the write hole >mentioned inside. The write-hole exists ONLY when the machine stops without a proper shutdown AND with an incomplete array (eg. one disk out of the array in a raid5). Sometimes this happens when the machine crashes or power goes down and on reboot one disk fails. If I understand Sun's marketing ZFS always writes full stripes in all disks, which means the array is never dirty. Therefore the write-hole indeed doesn't exist. The problem with the argument is that the write-hole is not so big of a problem in a well-behaving server because the probability of a crash and an incomplete array happening simultaneously is very small, so Sun's feature is not so important. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation @ 2006-01-22 11:52 Rik Herrin 2006-01-22 23:08 ` Neil Brown 2006-01-27 13:09 ` Molle Bestefich 0 siblings, 2 replies; 10+ messages in thread From: Rik Herrin @ 2006-01-22 11:52 UTC (permalink / raw) To: linux-raid Wouldn't connecting a UPS + using a stable kernel version remove 90% or so of the "RAID-5 write hole" problem? Are there any other means to know when this has occurred? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-22 11:52 Rik Herrin @ 2006-01-22 23:08 ` Neil Brown 2006-01-27 13:09 ` Molle Bestefich 1 sibling, 0 replies; 10+ messages in thread From: Neil Brown @ 2006-01-22 23:08 UTC (permalink / raw) To: Rik Herrin; +Cc: linux-raid On Sunday January 22, rikherrin@yahoo.com wrote: > Wouldn't connecting a UPS + using a stable kernel > version remove 90% or so of the "RAID-5 write hole" > problem? Are there any other means to know when this > has occurred? A UPS would help, as would a well-tested stable kernel. However sometimes power supplies die, so make sure you have two of them... Yes, there are several ways to mitigate the risk. However they cost money and the risk is actually already very small in the first place. Coming up with a by-design approach that removes the risk completely is still a good goal, providing the performance costs are reasonable (i.e. almost zero). NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: blog entry on RAID limitation 2006-01-22 11:52 Rik Herrin 2006-01-22 23:08 ` Neil Brown @ 2006-01-27 13:09 ` Molle Bestefich 1 sibling, 0 replies; 10+ messages in thread From: Molle Bestefich @ 2006-01-27 13:09 UTC (permalink / raw) To: linux-raid Rik Herrin wrote: > Wouldn't connecting a UPS + using a stable kernel > version remove 90% or so of the "RAID-5 write hole" > problem? There are some RAID systems that you'd rather not have redundant power on. Think encryption. As long as a system is online, it's normal for it to have encryption keys in memory and it's disk systems mounted through the decryption system. You wouldn't want someone to be able to steal your server along with the UPS and stuff it in a van with a power inverter :-). ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-01-27 13:09 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <8E5ACAE05E6B9E44A2903C693A5D4E8A097F6C86@hqemmail02.nvidia.com>
2006-01-17 8:45 ` blog entry on RAID limitation Jeff Breidenbach
2006-01-17 10:40 ` Neil Brown
2006-01-17 16:42 ` Jacob Madsen
2006-01-17 22:47 ` Neil Brown
2006-01-21 20:50 ` Carlos Carvalho
2006-01-24 3:20 ` Neil Brown
2006-01-21 20:43 ` Carlos Carvalho
2006-01-22 11:52 Rik Herrin
2006-01-22 23:08 ` Neil Brown
2006-01-27 13:09 ` Molle Bestefich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).