* AW: AW: RAID1 and data safety?
@ 2005-03-29 8:54 Schuett Thomas EXT
2005-03-29 9:27 ` Peter T. Breuer
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Schuett Thomas EXT @ 2005-03-29 8:54 UTC (permalink / raw)
To: linux-raid
I wrote:
>>
>> Still, if there is different data on the two disks due to a previous
>> power failure, the comparsion could really be the better choise, isn't it?
>>
>
Neil wrote:
>What does a comparison of two blocks tell you? That they are
>different, not which one is "right".
>
>A filesystem designed to handle these sort of problems wouldn't suffer
>from data inconsistencies due to power off. It would "know" where it
>was up to and would either re-write or ignore any data that it doesn't
>know to certainly be safe.
But:
If you have a raid1 and a journaling fs, see the following:
If the system chrashes at the end of a write transaction,
then the end-of-transaction information may got written
to hda already, but not to hdb. On the next boot, the
journaling fs may see an overall unclean bit (*probably* a transaction
is pending), so it reads the transaction log.
And here the fault happens:
By chance, it reads the transaction log from hda, then sees, that the
transaction was finished, and clears the overall unclean bit.
This cleaning is a write, so it goes to *both* HDs.
Situation now: On hdb there is a pending transaction in the transaction
log, but the overall unclean bit is cleared. This may not be realised,
until by chance a year later hda chrashes, and you finaly face the fact,
that there is a corrupt situation on the left HD.
Solution approach: if it would have read the transaction log from both HDs,
it would have returned a read fault. A good journaling fs probably stores
the end-of-transaction info in a different block, than the start-of-transaction
info. Then it can say: If I cannot properly read the end-of-transaction info,
then I consider the transaction as not finished, so I do a rollback.
(Of course this requires a readable start-of-transaction info, therefore it
should be stored seperate from the end-of-transaction info.)
Does this sound reasonable?
Thomas
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: RAID1 and data safety? 2005-03-29 8:54 AW: AW: RAID1 and data safety? Schuett Thomas EXT @ 2005-03-29 9:27 ` Peter T. Breuer 2005-03-29 10:09 ` Neil Brown 2005-03-29 9:30 ` AW: " Molle Bestefich 2005-03-29 10:08 ` AW: " Neil Brown 2 siblings, 1 reply; 18+ messages in thread From: Peter T. Breuer @ 2005-03-29 9:27 UTC (permalink / raw) To: linux-raid Schuett Thomas EXT <Thomas.Schuett.extern@mchh.siemens.de> wrote: > And here the fault happens: > By chance, it reads the transaction log from hda, then sees, that the > transaction was finished, and clears the overall unclean bit. > This cleaning is a write, so it goes to *both* HDs. Don't put the journal on the raid device, then - I'm not ever sure why people do that! (they probably have a reason that is good - to them). Or put it on another raid partition/device on the same media, but one set to error unless replication is perfect (there does seem to be a use for that policy!). Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 9:27 ` Peter T. Breuer @ 2005-03-29 10:09 ` Neil Brown 2005-03-29 11:26 ` Peter T. Breuer 0 siblings, 1 reply; 18+ messages in thread From: Neil Brown @ 2005-03-29 10:09 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tuesday March 29, ptb@lab.it.uc3m.es wrote: > > Don't put the journal on the raid device, then - I'm not ever sure why > people do that! (they probably have a reason that is good - to them). > Not good advice. DO put the journal on a raid device. It is much safer there. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 10:09 ` Neil Brown @ 2005-03-29 11:26 ` Peter T. Breuer 2005-03-29 12:13 ` Lars Marowsky-Bree 2005-04-04 22:57 ` Doug Ledford 0 siblings, 2 replies; 18+ messages in thread From: Peter T. Breuer @ 2005-03-29 11:26 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday March 29, ptb@lab.it.uc3m.es wrote: > > > > Don't put the journal on the raid device, then - I'm not ever sure why > > people do that! (they probably have a reason that is good - to them). > > Not good advice. DO put the journal on a raid device. It is much > safer there. Two journals means two possible sources of unequal information - plus the two datasets. We have been through this before. You get the journal you deserve. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 11:26 ` Peter T. Breuer @ 2005-03-29 12:13 ` Lars Marowsky-Bree 2005-04-04 22:57 ` Doug Ledford 1 sibling, 0 replies; 18+ messages in thread From: Lars Marowsky-Bree @ 2005-03-29 12:13 UTC (permalink / raw) To: Peter T. Breuer, linux-raid On 2005-03-29T13:26:32, "Peter T. Breuer" <ptb@lab.it.uc3m.es> wrote: > > Not good advice. DO put the journal on a raid device. It is much > > safer there. > Two journals means two possible sources of unequal information - plus the > two datasets. We have been through this before. You get the journal > you deserve. The RAID never exposes this potential inconsistency to the higher levels though. Indeed, you get what you deserve. -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 11:26 ` Peter T. Breuer 2005-03-29 12:13 ` Lars Marowsky-Bree @ 2005-04-04 22:57 ` Doug Ledford 1 sibling, 0 replies; 18+ messages in thread From: Doug Ledford @ 2005-04-04 22:57 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tue, 2005-03-29 at 13:26 +0200, Peter T. Breuer wrote: > Neil Brown <neilb@cse.unsw.edu.au> wrote: > > On Tuesday March 29, ptb@lab.it.uc3m.es wrote: > > > > > > Don't put the journal on the raid device, then - I'm not ever sure why > > > people do that! (they probably have a reason that is good - to them). > > > > Not good advice. DO put the journal on a raid device. It is much > > safer there. > > Two journals means two possible sources of unequal information - plus the > two datasets. We have been through this before. You get the journal > you deserve. No, you don't. You've been through this before and it wasn't any more correct than it is now. Most of this seems to center on the fact that you aren't aware of a few constraints that the linux md subsystem and the various linux journaling filesystems were written under and how each of those meets those constraints at an implementation level, so allow me to elucidate that for you. 1) All linux filesystems are designed to work on actual, physical hard drives. 2) The md subsystem is designed to provide fault tolerance for hard drive failures via redundant storage of information (except raid0 and linear, those are ignored throughout the rest of this email). 3) The md subsystem is designed to seamlessly operate underneath any linux filesystem. This implies that it must *act* like an actual, physical hard drive in order to not violate assumptions made at the filesystem level. So here's how those constraints are satisfied in linux. For constraint #1, specifically as it relates to journaling filesystems, all journaling filesystem currently in use started their lives at a time when the linux block layer did not provide any means of write barriers. As a result, they used completion events as write barriers. That is to say, if you needed a write barrier between the end of journal transaction write and the start of the actual data writes to the drive, you simply waited for the drive to say that the actual end of journal transaction data had been written prior to issuing any of the writes to the actual filesystem. You then waited for all filesystem writes to complete before allowing that journal transaction to be overwritten. Additionally, people have mentioned the concept of rollbacks relating to journaling filesystems. At least ext3, and likely all journaling filesystems on linux, don't do rollbacks. They do replays. In order to do a rollback, you would have to first read the data you are going to update, save it somewhere, then start the update and if you crash somewhere in the update you then read the saved data and put it back in place of the partially completed update. Obviously, this has performance impact because it means that any update requires a corresponding read/write cycle to save the old data. What they actually do is transactional updates where they write the update to the journal, wait for all of the journal writes relevant to a specific transaction group to complete, then start the writes to the actual filesystem. If you crash during the update to the filesystem, you replay any and all whole journal transactions in the ext3 journal which simply re-issues the writes so that any that didn't complete, get completed. You never start the writes until you know they are already committed to the journal, and you never remove them from the journal until you know they are all committed to the filesystem proper. That way you are 100% guaranteed to be able to complete whatever group of filesystem proper writes were in process at the time of a crash, returning you to a consistent state. The main assumption that the filesystem relies upon to make this true is that an issued write request(s) is not returned until it is complete and on media (or in the drive buffer and the drive is claiming that even in the event of a power failure it will still make media). OK, that's the filesystem issues. For constraint #2, md satisfies this by storing data in a way that any single drive failure can be compensated for transparently (or more if using a more than 2 disk raid1 array or using raid6). The primary thing here is that on a recoverable failure scenario, the layers above md must A) not know the error occurred and B) must get the right data when reading and C) must be able to continue writing to the device and those writes must be preserved across reboots and other recovery operations that might take place to bring the array out of degraded mode. This is where the event counters come into play. That's what md uses to be able to tell which drives in an array are up to date versus those that aren't, which is what's needed to satisfy C. Now, what Peter has been saying can happen on a raid1 array (but which can't) is creeping data corruption that's only noticed later because a write to md array gets completed on one device but not the other and it isn't until you read it later that this shows up. Under normal failure scenarios (aka, not the rather unlikely one posted to this list recently that involves random drives disappearing and then reappearing at just the right time), this isn't an issue. From a very high level perspective, there are only two type of raid1 restarts: ones where the event counters of constituent devices match, and ones where they don't. If you don't have a raid restart then we aren't really interested in what's going on because without a crash, things are going to eventually get written identical on all devices, it's just a matter of time, and while waiting on that to happen the page cache returns the right data. So, if the event counters don't match, then the freshest device is taken and the non-fresh devices are kicked from the array, so you only get the most recent data. This is what you get if a raid devices is failed out of the array prior to a system crash or raid restart. If the array counters match and the state matches on all devices, then we simply pick a device as the "in sync" device and resync to all the other mirrors. This mimics the behavior of a physical hard drive in the sense that if you had multiple writes in flight, it isn't guaranteed that they all completed, just that whatever we return is all consistent and that it isn't possible to read the same block twice and get two different values. This is what happens when a system crashes with all devices active. For less common situations, such as a 3 disk raid1 array, you can actually get a combination of the two above behaviors if you have something like 1 disk fail, then a crash. We'll pick one of the two up to date disks as the master (we always simply take the first active-sync disk in the rdev array, so which disk it actually is depends on the disk's position in the rdev array), sync it over to the other active- sync disk, and kick the failed disk entirely out of the array. It is entirely possible that prior to the crash, a write might have completed on one of the disks, but not on the one selected as the master for the resync operation. This write will then be lost upon restart. Again, this is consistent with a physical hard drive since the ordering of multiple outstanding writes is indeterminate. Raid4/5/6 is similar in that any disk that doesn't have an up to date superblock is kicked from the array. However, we can't just pick a disk to sync to all the others, but we still need to provide a guarantee that two reads won't get different data where two different reads means a read from a data block versus a read from a reconstructed from parity data block, so we resync all the parity blocks from all the data blocks so that should a degraded state happen in the future, stale parity won't cause the data returned to change. The one failing here relative to mimicking a real hard drive is that it's possible that with a raid5 array with say a 64k chunk size and a 256k write, that you could end up with the first 64k chunk making it to disk, then not the second, then the third making it, etc. Generally speaking this doesn't happen on real disks because sequential writes are done sequentially to media. So, now we get to the interactions between the md raid devices and the filesystem level of the OS. This is where ext3 or other journaling filesystems actually solve that problem I just noted with lost writes on raid1 arrays and with partial writes with stale data in the middle of new data on raid4/5/6 arrays. If you use a journaling filesystem, then any in flight writes to the filesystem proper at the time of a crash are replayed in their entirety to ensure that they all make it to disk. Any in flight writes to the journal will be thrown away and never go to the filesystem proper. This means that even if two disks are inconsistent with each other, the resync operation makes them consistent and the journal replay guarantees they are up to date with the last completed journal group entry. This is true even when the journal is on the same raid device as the filesystem because the journal is written with a write completion barrier and the md subsystem doesn't complete the barrier write until it has hit the media on all constituent devices, which ensures that regardless of which device is picked as a master for resync purposes after an unclean shutdown that it is impossible for any of the filesystem proper writes to have started without a complete journal transaction to replay in the event of failure. Now, if I recall correctly, Peter posted a patch that changed this semantic in the raid1 code. The raid1 code does not complete a write to the upper layers of the kernel until it's been completed on all devices and his patch made it such that as soon as it hit 1 device it returned the write to the upper layers of the kernel. Doing this would introduce the very problem Peter has been thinking that the default raid1 stack had all along. Let's say you are using ext3 and are writing an end of journal transaction marker. That marker is sent to drive A and B. Assume drive A is busy completing some reads at the moment, and drive B isn't and completes the end of journal write quickly. The patch Peter posted (or at least talked about, can't remember which) would then return a completion event to the ext3 journal code. The ext3 code would then assume the journal was all complete and start issuing the writes related to that journal transaction en-masse. These writes will then go to drives A and B. Since drive A was busy with some reads, it gets these writes prior to completing the end of transaction write it already had in its queue. Being a nice, smart SCSI disk with tagged queuing enabled, it then proceeds to complete the whole queue of writes in whatever order is most efficient for it. It completes two of the writes that were issued by the ext3 filesystem after the ext3 filesystem thought the journal entry was complete, and then the machine has a power supply failure and nothing else gets written. As it turns out, drive A is the first drive in the rdev array, so on reboot, it's selected as the master for resync. Now, that means that all the data, journal and everything else, is going to be copied from drive A to drive B. And guess what. We never completed that end of journal write on drive A, so when the ext3 filesystem is mounted, that journal transaction is going to be considered incomplete and *not* get replayed. But we've also written a couple of the updates from that transaction to disk A already. Well, there you go, data corruption. So, Peter, if you are still toying with that patch, it's a *BAAAAAD* idea. That's what using a journaling filesystem on top of an md device gets you in terms of what problems the journaling solves for the md device. In turn, a weakness of any journaling filesystem is that it is inherently vulnerable to hard disk failures. A drive failure takes out the filesystem and your machine becomes unusable. Obviously, this very problem is what md solves for filesystems. Whether talking about the journal or the rest of the filesystem, if you let a hard drive error percolate up to the filesystem, then you've failed in the goal of software raid. I remember talk once about how putting the journal on the raid device was bad because it would cause the media in that area of the drive to wear out faster. The proper response to that is: "So. I don't care. If that section of media wears out faster, fine by me, because I'm smart and put both my journal and my filesystem on a software raid device that allows me to replace the worn out device with a fresh one without ever loosing any data or suffering a crash." The goal of the md layer is not to prevent drive wear out, the goal is to make us tolerant of drive failures so we don't care when they happen, we simply replace the bad drive and go on. Since drive failures happen on a fairly regular basis without md, if the price of not suffering problems as a result of those failures is that we slightly increase the failure rate due to excessive writing in the journal area, then fine by me. In addition, if you use raid5 arrays like I do, then putting the journal on the raid array is a huge win because of the outrageously high sequential throughput of a raid5 array. Journals are preallocated at filesystem creation time and occupy a more or less sequential area on the disks. Journals are also more or less a ring buffer. You can tune the journal size to a reasonable multiple of a full stripe size on the raid5 array (say something like 1 to 10 MB per disk, so in a 5 disk raid5 array, I'd use between a 4 and 40MB journal, depending on whether I thought I would be doing a lot of large writes of sufficient enough size to utilize a large journal), turn on journaling of not just meta- data but all data, and then benefit from the fact that the journal writes take place as more or less sequential writes as seen by things like tiobench benchmark runs, and because the typical filesystem writes are usually much more random in nature, the journaling overhead can be reduced to no more than, say, 25% performance loss while getting the benefit of both meta-data and regular data journaled. It's certainly *far* faster than sticking the journal on some other device unless it's another very fast raid array. Anyway, I think the situation can be summed up as this: See Peter try to admin lots of machines. See Peter imagine problems that don't exist. See Peter disable features that would make his life easier as Peter takes steps to circumvent his imaginary problems. See Peter stay at work over New Years holiday fixing problems that were likely a result of his own efforts to avoid problems. Don't be a Peter, listen to Neil. -- Doug Ledford <dledford@xsintricity.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: AW: RAID1 and data safety? 2005-03-29 8:54 AW: AW: RAID1 and data safety? Schuett Thomas EXT 2005-03-29 9:27 ` Peter T. Breuer @ 2005-03-29 9:30 ` Molle Bestefich 2005-03-29 10:08 ` AW: " Neil Brown 2 siblings, 0 replies; 18+ messages in thread From: Molle Bestefich @ 2005-03-29 9:30 UTC (permalink / raw) To: linux-raid > Does this sound reasonable? Does to me. Great example! Thanks for painting the pretty picture :-). Seeing as you're clearly the superior thinker, I'll address your brain instead of wasting wattage on my own. Let's say that MD had the feature to read from both disks in a mirror and perform a comparison on read. Let's say that I had that feature turned on for 2 mirror arrays (4 disks). I want to get a bit of performance back though, so I stripe the two mirrored arrays. Do you see any problem in this scenario? Are we back to "corruption could happen then" or are we still OK? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: AW: AW: RAID1 and data safety? 2005-03-29 8:54 AW: AW: RAID1 and data safety? Schuett Thomas EXT 2005-03-29 9:27 ` Peter T. Breuer 2005-03-29 9:30 ` AW: " Molle Bestefich @ 2005-03-29 10:08 ` Neil Brown 2005-03-29 11:29 ` Peter T. Breuer 2 siblings, 1 reply; 18+ messages in thread From: Neil Brown @ 2005-03-29 10:08 UTC (permalink / raw) To: Schuett Thomas EXT; +Cc: linux-raid On Tuesday March 29, Thomas.Schuett.extern@mchh.siemens.de wrote: > But: > If you have a raid1 and a journaling fs, see the following: > If the system chrashes at the end of a write transaction, > then the end-of-transaction information may got written > to hda already, but not to hdb. On the next boot, the > journaling fs may see an overall unclean bit (*probably* a transaction > is pending), so it reads the transaction log. > > And here the fault happens: > By chance, it reads the transaction log from hda, then sees, that the > transaction was finished, and clears the overall unclean bit. > This cleaning is a write, so it goes to *both* HDs. > > Situation now: On hdb there is a pending transaction in the transaction > log, but the overall unclean bit is cleared. This may not be realised, > until by chance a year later hda chrashes, and you finaly face the fact, > that there is a corrupt situation on the left HD. Wrong. There is nothing of the sort on hdb. Due to the system crash the data on hdb is completely ignored. Data from hda is copied over onto it. Until that copy has completed, nothing is read from hdb. You could possibly come up with a scenario where the above happens but while the copy from hda->hdb is happening, hda dies completely so reads have to start happening from hdb. md could possibly handle this situation better (ensure a copy has happened for any block before a read succeeds of that block), but I don't think it is at all likely to be a re-life problem. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 10:08 ` AW: " Neil Brown @ 2005-03-29 11:29 ` Peter T. Breuer 2005-03-29 16:46 ` Luca Berra ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Peter T. Breuer @ 2005-03-29 11:29 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > Due to the system crash the data on hdb is completely ignored. Data Neil - can you explain the algorithm that stamps the superblocks with an event count, once and for all? (until further amendment :-). It goes without saying that sb's are not stamped at every write, and the event count is not incremented at every write, so when and when? Thanks Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 11:29 ` Peter T. Breuer @ 2005-03-29 16:46 ` Luca Berra 2005-03-29 18:43 ` Peter T. Breuer 2005-03-29 20:07 ` Mario Holbe 2005-04-04 20:06 ` Doug Ledford 2 siblings, 1 reply; 18+ messages in thread From: Luca Berra @ 2005-03-29 16:46 UTC (permalink / raw) To: linux-raid On Tue, Mar 29, 2005 at 01:29:22PM +0200, Peter T. Breuer wrote: >Neil Brown <neilb@cse.unsw.edu.au> wrote: >> Due to the system crash the data on hdb is completely ignored. Data > >Neil - can you explain the algorithm that stamps the superblocks with >an event count, once and for all? (until further amendment :-). IIRC it is updated at every event (start, stop, add, remove, fail etc...) >It goes without saying that sb's are not stamped at every write, and the >event count is not incremented at every write, so when and when? the event count is not incremented at every write, but the dirty flag is, and it is cleared lazily after some idle time. in older code it was set at array start and cleared only at stop. so in case of a disk failure the other disks get updated about the failure. in case of a restart (crash) the array will be dirty and a coin tossed to chose which mirror to use as an authoritative source (the coin is biased, but it doesn't matter). At this point any possible parallel reality is squashed out of existance. L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 16:46 ` Luca Berra @ 2005-03-29 18:43 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2005-03-29 18:43 UTC (permalink / raw) To: linux-raid Luca Berra <bluca@comedia.it> wrote: > On Tue, Mar 29, 2005 at 01:29:22PM +0200, Peter T. Breuer wrote: > >Neil Brown <neilb@cse.unsw.edu.au> wrote: > >> Due to the system crash the data on hdb is completely ignored. Data > > > >Neil - can you explain the algorithm that stamps the superblocks with > >an event count, once and for all? (until further amendment :-). > > IIRC it is updated at every event (start, stop, add, remove, fail etc...) Hmm .. I see it updated sometimes twice and sometimes once between a setfaulty and a hotadd (no writes between). There may be a race. It's a bit of a problem because when I start a bitmap (which is when a disk is faulted from the array), I copy the event count at that time to the bitmap. When the disk is re-inserted, I look at the event count on its sb, and see that it may sometimes be one, sometimes two behind the count on the bitmap. And then sometimes the array event count jumps by ten or so. Here's an example: md0: repairing old mirror component 300015 (disk 306 >= bitmap 294) I had done exactly one write on the degraded array. And maybe a setfaulty and a hotadd. The test cycle before that (exactly the same) I got: md0: repairing old mirror component 300015 (disk 298 >= bitmap 294) and at the very first separation (first test cycle) I saw md0: warning - new disk 300015 nearly too old for repair (disk 292 < bitmap 294) (Yeah, these are my printk's - so what). So it's all consistent with the idea that the event count is incremented more frequently than you say. Anyway, what you are saying is that if a crash occurs on the node with the array, then the event counts on BOTH mirrors will be the same. Thus there is no way of knowing which is the more uptodate. > >It goes without saying that sb's are not stamped at every write, and the > >event count is not incremented at every write, so when and when? > > the event count is not incremented at every write, but the dirty flag > is, and it is cleared lazily after some idle time. > in older code it was set at array start and cleared only at stop. Hmmm. You mean this int sb_dirty; in the mddev? I don't think that's written out .. well, it may be, if the whole sb is written, but that's very big. What exactly are you referencing with "the dirty flag" above? > so in case of a disk failure the other disks get updated about the > failure. Well, yes, but in the case of an array node crash ... > in case of a restart (crash) the array will be dirty and a coin tossed > to chose which mirror to use as an authoritative source (the coin is > biased, but it doesn't matter). At this point any possible parallel > reality is squashed out of existance. It is my opinion that one ought always to roll back anything in the journal (any journal) on a restart. On the grounds that you can't know for sure if it went to the other mirror. Would you like me to make a patch to make sure that writes go to all mirrors or else error back t the user? The only question in my mind is how to turn such a policy on or off per array. Any suggestion? I'm not familiar with most of mdadm's neer capabilities. I'd use the sysctl interface, but it's not set up to be "per array". It should be. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 11:29 ` Peter T. Breuer 2005-03-29 16:46 ` Luca Berra @ 2005-03-29 20:07 ` Mario Holbe 2005-04-04 20:06 ` Doug Ledford 2 siblings, 0 replies; 18+ messages in thread From: Mario Holbe @ 2005-03-29 20:07 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > Neil Brown <neilb@cse.unsw.edu.au> wrote: >> Due to the system crash the data on hdb is completely ignored. Data > Neil - can you explain the algorithm that stamps the superblocks with > an event count, once and for all? (until further amendment :-). This has nothing to do with the event count but with the state bits. And the state is set to active/dirty on start and set to clean on stop/ro. It's just like a filesystem's state but on block device layer. regards Mario -- I thought the only thing the internet was good for was porn. -- Futurama ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-29 11:29 ` Peter T. Breuer 2005-03-29 16:46 ` Luca Berra 2005-03-29 20:07 ` Mario Holbe @ 2005-04-04 20:06 ` Doug Ledford 2005-04-08 12:16 ` Peter T. Breuer 2 siblings, 1 reply; 18+ messages in thread From: Doug Ledford @ 2005-04-04 20:06 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tue, 2005-03-29 at 13:29 +0200, Peter T. Breuer wrote: > Neil Brown <neilb@cse.unsw.edu.au> wrote: > > Due to the system crash the data on hdb is completely ignored. Data > > Neil - can you explain the algorithm that stamps the superblocks with > an event count, once and for all? (until further amendment :-). I think the best explanation is this: any change in array state that would necessitate kicking a drive out of the array if it didn't also make this change in state with the rest of the drives in the array results in an increment to the event counter and a flush of the superblocks. For example, a transition of a raid array from ro to rw mode results in an event update and a flush of superblocks. If the system then goes down and one drive has the superblock update and another doesn't, then you know that one was behind the other and to kick it. > It goes without saying that sb's are not stamped at every write, and the > event count is not incremented at every write, so when and when? Transition from ro -> rw or from rw -> ro, transition from clean to dirty or dirty to clean, any change in the distribution of disks in the superblock (aka, change in number of working disks, active disks, spare disks, failed disks, etc.), or any ordering updates of disk devices in the rdisk array (for example, when a spare is done being rebuilt to replace a failed device, it gets moved from it's current position in the array to the position it was just rebuilt to replace as part of the final transition from being rebuilt to being an active, live component in the array). -- Doug Ledford <dledford@redhat.com> http://people.redhat.com/dledford ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-04-04 20:06 ` Doug Ledford @ 2005-04-08 12:16 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2005-04-08 12:16 UTC (permalink / raw) To: linux-raid I forgot to say "thanks"! Thanks for the breakdown. Doug Ledford <dledford@redhat.com> wrote: (of event count increment) > I think the best explanation is this: any change in array state that OK .. > would necessitate kicking a drive out of the array if it didn't also > make this change in state with the rest of the drives in the array Hmmm. > results in an increment to the event counter and a flush of the > superblocks. > Transition from ro -> rw or from rw -> ro, transition from clean to > dirty or dirty to clean, any change in the distribution of disks in the > superblock (aka, change in number of working disks, active disks, spare > disks, failed disks, etc.), or any ordering updates of disk devices in > the rdisk array (for example, when a spare is done being rebuilt to > replace a failed device, it gets moved from it's current position in the > array to the position it was just rebuilt to replace as part of the > final transition from being rebuilt to being an active, live component > in the array). I still see about 8-10 changes in the event count between faulting a disk out and bringing it back into the array for hot-repair, even if nothing is written meantime. I suppose I could investigate! Of concern to me (only) is that I observe that a faulted disk seems to have an event count that is 1-2 counts behind that stamped on the bitmap left behind on the array as it starts up in response to the fault. The number behind varies. Something races. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* AW: RAID1 and data safety? @ 2005-04-07 15:35 Schuett Thomas EXT 2005-04-07 16:05 ` Doug Ledford 0 siblings, 1 reply; 18+ messages in thread From: Schuett Thomas EXT @ 2005-04-07 15:35 UTC (permalink / raw) To: 'Doug Ledford'; +Cc: linux-raid [Please excuse, my mailtool breaks threads ...] Reply to mail from 2005-04-05 Hello Doug, many thanks for this highly detailed and structured posting. A few questions are left: Is it common today, that a (eide) HD does not state a write as finished (aka send completion events, if I got this right), before it was written to *media*? I am happy to hear about this "write barriers", even as I am astonished, that it doesn't bring down the whole system performance (at least for raid1). > This is where the event counters > come into play. That's what md uses to be able to tell which drives in > an array are up to date versus those that aren't, which is what's needed > to satisfy C. So event counters are the 2nd type of information, that gets written with write barriers. One is the journal data from the (j)fs (and actually the real data too, to make it gain sence, otherwise the end-of-transaction-write is like a semaphor with only one of the two parties using it), and the other is the event counter. > Now, if I recall correctly, Peter posted a patch that changed this > semantic in the raid1 code. The raid1 code does not complete a write to > the upper layers of the kernel until it's been completed on all devices > and his patch made it such that as soon as it hit 1 device it returned > the write to the upper layers of the kernel. I am glad to hear, that the behaviour is such, that the barrier stops, until *all* media got written. That was one of the things that really made me worrying. I hope, the patch is backed out and didn't went into any distros. > had in its queue. Being a nice, smart SCSI disk with tagged queuing > enabled, it then proceeds to complete the whole queue of writes in > whatever order is most efficient for it. But just to make sure: Your previous statement "...when the linux block layer did not provide any means of write barriers. As a result, they used completion events as write barriers." indicates, that even "nice, smart SCSI disk with tagged queuing enabled" will act as demanded, because the special way of write with appended "completion events testing" will make sure they do? --- You mentioned data journaling, and it sounded like it is reliable working. Which one of the existing journaling fs did you have in your mind? --- Afaik a read only reads from *one* HD (in raid1). So how to be sure, that *both* HDs are still perfectly o.k.? Am I am fine to do a cat /dev/hda2 > /dev/null ; cat /dev/hdb2 > /dev/null even *during* the md is active and getting used r/w? best regards, Thomas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: AW: RAID1 and data safety? 2005-04-07 15:35 AW: " Schuett Thomas EXT @ 2005-04-07 16:05 ` Doug Ledford 2005-04-10 17:54 ` Peter T. Breuer 0 siblings, 1 reply; 18+ messages in thread From: Doug Ledford @ 2005-04-07 16:05 UTC (permalink / raw) To: Schuett Thomas EXT; +Cc: linux-raid On Thu, 2005-04-07 at 17:35 +0200, Schuett Thomas EXT wrote: > [Please excuse, my mailtool breaks threads ...] > Reply to mail from 2005-04-05 > > Hello Doug, > > many thanks for this highly detailed and structured posting. You're welcome. > A few questions are left: Is it common today, that a (eide) HD does > not state a write as finished (aka send completion events, if I got this > right), before it was written to *media*? Depends on the state of the Write Cache bit in the drive's configuration page. If this bit is enabled, the drive is allowed to cache writes in the on board RAM and complete the command. Should the drive have a power failure event before the data is written to drive, then it might get lost. If the bit is not set, then the drive is suppossed to actually have the data on media before returning (or at the very absolute least, it should be in a small enough queue of pending writes that should the power get lost, it can still write the last bits out during spin down). > I am happy to hear about this "write barriers", even as I am astonished, that > it doesn't bring down the whole system performance (at least for raid1). It doesn't bring down system performance because the journaling filesystem isn't single task so to speak. What this means is that when you have a large number of writes queued up to be flushed, the journaling fs can create a journal transaction for just some of the writes, then issue an end of journal transaction, wait for that to complete, then it can proceed to release all those writes to the filesystem proper. At the same time that the filesystem proper writes are getting under way, it can issue another stream of writes to start the next journal transaction. As soon as all the journal writes are complete, it can issue an end of journal transaction, wait for it to complete, then issue all those writes to the filesystem proper. So you see, it's not that writes to the filesystem and the journal are exclusive of each other so that one waits entirely on the other, it's that writes from a single journal transaction are exclusive to writes to the filesystem for *that particular transaction*. By keeping ongoing journal transactions in process, the journaling filesystem is able to also stream data to the filesystem proper without much degradation, it's just that the filesystem proper writes are delayed somewhat from the corresponding journal transaction writes. Make sense? > > > This is where the event counters > > come into play. That's what md uses to be able to tell which drives in > > an array are up to date versus those that aren't, which is what's needed > > to satisfy C. > > So event counters are the 2nd type of information, that gets written with write > barriers. One is the journal data from the (j)fs (and actually the real data > too, to make it gain sence, otherwise the end-of-transaction-write is like a > semaphor with only one of the two parties using it), and the other is the event > counter. Not really. The event counter is *much* courser grained than journal entries. A raid array may be in use for years and never have the event counter get above 20 or so if it stays up most of the time and doesn't suffer disk add/remove events. It's really only intended to mark events like drive failures so that if you have a drive fail on shutdown, then on reboot we know that it failed because we did an immediate superblock event counter update on all drives except the failed one when the failure happened. > > Now, if I recall correctly, Peter posted a patch that changed this > > semantic in the raid1 code. The raid1 code does not complete a write to > > the upper layers of the kernel until it's been completed on all devices > > and his patch made it such that as soon as it hit 1 device it returned > > the write to the upper layers of the kernel. > > I am glad to hear, that the behaviour is such, that the barrier stops, until > *all* media got written. That was one of the things that really made me > worrying. I hope, the patch is backed out and didn't went into any distros. No it never went anywhere. It was just a "Hey guys, I played with this optimization, here's the patch" type posting and no one picked it up for inclusion in any upstream or distro kernels. > > had in its queue. Being a nice, smart SCSI disk with tagged queuing > > enabled, it then proceeds to complete the whole queue of writes in > > whatever order is most efficient for it. > > But just to make sure: Your previous statement "...when the linux block layer > did not provide any means of write barriers. As a result, they used completion > events as write barriers." indicates, that even "nice, smart SCSI disk with > tagged queuing enabled" will act as demanded, because the special way of write > with appended "completion events testing" will make sure they do? Yes. We know that drives are allowed to reorder writes, so anytime we want a barrier for a given write (say you want all journal transactions complete before writing the end of journal entry), then you basically wait for all your journal transactions to complete before sending the end of journal transaction. You don't have to wait for *all* writes to the drive to complete, just the journal writes. This is why performance isn't killed by journaling. The filesystem proper writes for previous journal transactions can be taking place while you are doing this waiting. > --- > > You mentioned data journaling, and it sounded like it is reliable working. > Which one of the existing journaling fs did you have in your mind? I use ext3 personally. But that's as much because it's the default filesystem and I know Stephen Tweedie will fix it if it's broken ;-) > --- > > Afaik a read only reads from *one* HD (in raid1). So how to be sure, > that *both* HDs are still perfectly o.k.? Am I am fine to do a > cat /dev/hda2 > /dev/null ; cat /dev/hdb2 > /dev/null > even *during* the md is active and getting used r/w? It's ok to do this. However, reads happen from both hard drives in a raid1 array in a sort of round robin fashion. You don't really know which reads are going to go where, but each drive will get read from. Doing what you suggest will get you a full read check on each drive and do so safely. Of course, if it's supported on your system, you could also just enable the SMART daemon and have it tell the drives to do continuous background media checks to detect sectors that are either already bad or getting ready to go bad (corrected error conditions). -- Doug Ledford <dledford@xsintricity.com> http://www.xsintricity.com/dledford http://www.livejournal.com/users/deerslayer AIM: DeerObliterator ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-04-07 16:05 ` Doug Ledford @ 2005-04-10 17:54 ` Peter T. Breuer 0 siblings, 0 replies; 18+ messages in thread From: Peter T. Breuer @ 2005-04-10 17:54 UTC (permalink / raw) To: linux-raid Doug Ledford <dledford@xsintricity.com> wrote: > > > Now, if I recall correctly, Peter posted a patch that changed this > > > semantic in the raid1 code. The raid1 code does not complete a write to > > > the upper layers of the kernel until it's been completed on all devices > > > and his patch made it such that as soon as it hit 1 device it returned > > > the write to the upper layers of the kernel. > > > > I am glad to hear, that the behaviour is such, that the barrier stops, until > > *all* media got written. That was one of the things that really made me > > worrying. I hope, the patch is backed out and didn't went into any distros. > > No it never went anywhere. It was just a "Hey guys, I played with this > optimization, here's the patch" type posting and no one picked it up for > inclusion in any upstream or distro kernels. I'll just remark that the patch depended on a bitmap, so it _couldn't_ have been picked up (until now?). And anyway, async writes (that's the name) were switched on by a module /kernel parameter, and were off by default. I suppose maybe Paul's 2.6 patches also offer the possibility of async writes (I haven't checked). It isn't very dangerous - the bitmap marks the write as not done until all the components have been written, even though the write is acked back to the kernel after the first of the components have been written. There are extra openings for data loss if you choose that mode, but they're relatively improbable. You're likely to lose data under several circumstances during normal raid1 operation (see for example the "split brain" discussion!). Choosing to decrease write latency by half against some minor extra opportunity for data loss is an admin decision that should be available to you, I think. Umm ... what's the extra vulnerability? Well, I suppose that with ONE bitmap, writes could be somewhat delayed to TWO DIFFERENT components in turn. Then if we lose the array node at that point, writes will be outstanding to both components, and when we resync neither will have perfect data to copy back over the other. And we won't even be able to know which was right, because of the single bitmap. Shrug. We probably wouldn't have known which mirror component was the good one in any case. But with TWO bitmaps, we'd know which components were lacking what, and we could maybe do a better recovery job. Or not. We'd always choose one component to copy from, and that would overwrite the right data that the other had. Even with sync (not async) writes, we could get an array node crash that left BOTH components of the mirror without some info that the other component already had had written to it, and then copying from either component over the other would lose data. Yer pays yer money and yer takes yer choice. So I don't see it as a big thing. It's a question of evaluating probabilities, and benefits. BTW - async writes without the presence of a bitmap also seems to me to be a valid admin choice. Surely if a single component dies, and the array stays up, everything will be fine. The problem is when the array node crashes on its own. And that may cause data loss anyway. Peter ^ permalink raw reply [flat|nested] 18+ messages in thread
* RAID1 and data safety? @ 2005-03-16 9:13 Molle Bestefich 2005-03-21 22:26 ` Neil Brown 0 siblings, 1 reply; 18+ messages in thread From: Molle Bestefich @ 2005-03-16 9:13 UTC (permalink / raw) To: linux-raid Just wondering; Is there any way to tell MD to do verify-on-write and read-from-all-disks on a RAID1 array? I was thinking of setting up a couple of RAID1s with maximum data safety. I'd like to verify after each write to a disk plus I'd like to read from all disks and perform data comparison whenever something is read. I'd then run a RAID0 over the RAID1 arrays, to regain some of the speed lost from all of the excessive checking. Just wondering if it could be done :-). Apologies if the answer is in the docs. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-16 9:13 Molle Bestefich @ 2005-03-21 22:26 ` Neil Brown 2005-03-22 8:48 ` Molle Bestefich 0 siblings, 1 reply; 18+ messages in thread From: Neil Brown @ 2005-03-21 22:26 UTC (permalink / raw) To: Molle Bestefich; +Cc: linux-raid On Wednesday March 16, molle.bestefich@gmail.com wrote: > Just wondering; > > Is there any way to tell MD to do verify-on-write and > read-from-all-disks on a RAID1 array? No. I would have thought that modern disk drives did some sort of verify-on-write, else how would they detect write errors, and they are certainly in the best place to do verify-on-write. Doing it at the md level would be problematic as you would have to ensure that you really were reading from the media and not from some cache somewhere in the data path. I doubt it would be a mechanism that would actually increase confidence in the safety of the data. read-from-all-disks would require at least three drives before there would be any real value in it. There would be an enormous overhead, but possibly that could be justified in some circumstances. If we ever implement background-data-checking, it might become relatively easy to implement this. However I think that checksum based checking would be more effective, and that it should be done at the filesystem level. Imagine a filesystem that could access multiple devices, and where it kept index information it didn't just keep one block address, but rather kept two block address, each on different devices, and a strong checksum of the data block. This would allow much the same robustness as read-from-all-drives and much lower overhead. It is very possibly the Sun's new ZFS filesystem works like this, though I haven't seen precise technical details. In summary: - you cannot do it now. - I don't think md is at the right level to solve these sort of problems. I think a filesystem could do it much better. (I'm working on a filesystem .... slowly...) - read-from-all-disks might get implemented one day. verify-on-write is much less likely. > > Apologies if the answer is in the docs. It isn't. But it is in the list archives now.... NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID1 and data safety? 2005-03-21 22:26 ` Neil Brown @ 2005-03-22 8:48 ` Molle Bestefich 0 siblings, 0 replies; 18+ messages in thread From: Molle Bestefich @ 2005-03-22 8:48 UTC (permalink / raw) To: linux-raid Neil Brown wrote: >> Is there any way to tell MD to do verify-on-write and >> read-from-all-disks on a RAID1 array? > > No. > I would have thought that modern disk drives did some sort of > verify-on-write, else how would they detect write errors, and they are > certainly in the best place to do verify-on-write. Really? My guess was that they wouldn't, because it would lead to less performance. And that's why read errors crop up at read time. > Doing it at the md level would be problematic as you would have to > ensure that you really were reading from the media and not from some > cache somewhere in the data path. I doubt it would be a mechanism > that would actually increase confidence in the safety of the data. Hmm. Could hack it by reading / writing blocks larger than the cache. Ugly. > Imagine a filesystem that could access multiple devices, and where it > kept index information it didn't just keep one block address, but > rather kept two block address, each on different devices, and a strong > checksum of the data block. This would allow much the same robustness > as read-from-all-drives and much lower overhead. As in, "if the checksum fails, try loading the data blocks [again] from the other device"? Not sure why a checksum of X data blocks should be cheaper performance-wise than a comparison between X data blocks, but I can see the point in that you only have to load the data once and check the checksum. Not quite the same security, but almost. > In summary: > - you cannot do it now. > - I don't think md is at the right level to solve these sort of problems. > I think a filesystem could do it much better. (I'm working on a > filesystem .... slowly...) > - read-from-all-disks might get implemented one day. verify-on-write > is much less likely. > >> Apologies if the answer is in the docs. > > It isn't. But it is in the list archives now.... Thanks! :-) (Guess I'll drop the idea for the time being...) ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-04-10 17:54 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-03-29 8:54 AW: AW: RAID1 and data safety? Schuett Thomas EXT 2005-03-29 9:27 ` Peter T. Breuer 2005-03-29 10:09 ` Neil Brown 2005-03-29 11:26 ` Peter T. Breuer 2005-03-29 12:13 ` Lars Marowsky-Bree 2005-04-04 22:57 ` Doug Ledford 2005-03-29 9:30 ` AW: " Molle Bestefich 2005-03-29 10:08 ` AW: " Neil Brown 2005-03-29 11:29 ` Peter T. Breuer 2005-03-29 16:46 ` Luca Berra 2005-03-29 18:43 ` Peter T. Breuer 2005-03-29 20:07 ` Mario Holbe 2005-04-04 20:06 ` Doug Ledford 2005-04-08 12:16 ` Peter T. Breuer -- strict thread matches above, loose matches on Subject: below -- 2005-04-07 15:35 AW: " Schuett Thomas EXT 2005-04-07 16:05 ` Doug Ledford 2005-04-10 17:54 ` Peter T. Breuer 2005-03-16 9:13 Molle Bestefich 2005-03-21 22:26 ` Neil Brown 2005-03-22 8:48 ` Molle Bestefich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).