* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) [not found] <200501030916.j039Gqe23568@inv.it.uc3m.es> @ 2005-01-03 10:17 ` Guy 2005-01-03 11:31 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Guy @ 2005-01-03 10:17 UTC (permalink / raw) To: ptb, 'linux raid' See notes below with **. Guy -----Original Message----- From: ptb@inv.it.uc3m.es [mailto:ptb@inv.it.uc3m.es] Sent: Monday, January 03, 2005 4:17 AM To: Guy Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) "Also sprach Guy:" > "Well, you can make somewhere. You only require an 8MB (one cylinder) > partition." > > So, it is ok for your system to fail when this disk fails? You lose the journal, that's all. You can react with a simple tune2fs -O ^journal or whatever is appropriate. And a journal is ONLY there in order to protect you against crashes of the SYSTEM (not the disk), so what was the point of having the journal in the first place? ** When you lose the journal, does the system continue without it? ** Or does it require user intervention? > I don't want system failures when a disk fails, "don't use a journal then" seems to be the easy answer for you, but probably "put it on an ultrasafe medium like gold-plated persistent ram" works better! **RAM will be lost if you crash or lose power. Your scenario seems to be that you have the disks of your mirror on the ame physical system. That's funamentally dangerous. They're both subject to damage when the system blows up. I don't. I have an array node (where the journal is kept), and a local mirror component and a remote mirror component. That system is doubled, and each half of the double hosts the others remote mirror component. Each half fails over to the other. ** So, you have 2 systems, 1 fails and the "system" switches to the other. ** I am not going for a 5 nines system. ** I just don't want any down time if a disk fails. ** A disk failing is the most common failure a system can have (IMO). ** In a computer room with about 20 Unix systems, in 1 year I have seen 10 or so disk failures and no other failures. ** Are your 2 systems in the same state? ** They should be at least 50 miles apart (at a minimum). ** Otherwise if your data center blows up, your system is down! ** In my case, this is so rare, it is not an issue. ** Just use off-site tape backups. ** My computer room is for development and testing, no customer access. ** If the data center is gone, the workers have nowhere to work anyway (in my case). ** Some of our customers do have failover systems 50+ miles apart. > so mirror (or RAID5) > everything required to keep your system running. > > "And there is a risk of silent corruption on all raid systems - that is > well known." > I question this.... Why! ** You lost me here. I did not make the above statement. But, in the case of RAID5, I believe it can occur. Your system crashes while a RAID5 stripe is being written, but the stripe is not completely written. During the re-sync, the parity will be adjusted, but it may be more current than 1 or more of the other disks. But this would be similar to what would happen to a non-RAID disk (some data not written). ** Also with RAID1 or RAID5, if corruption does occur without a crash or re-boot, then a disk fails, the corrupt data will be copied to the replacement disk. With RAID1 a 50% risk of copying the corruption, and 50% risk of correcting the corruption. With RAID5, risk % depends on the number of disks in the array. > I bet a non-mirror disk has similar risk as a RAID1. But with a RAID1, you The corruption risk is doubled for a 2-way mirror, and there is a 50% chance of it not being detected at all even if you try and check for it, because you may be reading from the wrong mirror at the time you pass over the imperfection in the check. ** After a crash, md will re-sync the array. ** But during the re-sync, md could be checking for differences and reporting them. ** It won't help correct anything, but it could explain why you may be having problems with your data. ** Since md re-syncs after a crash, I don't think the risk is double. Isn't that simply the most naive calculation? So why would you make your bet? ** I don't understand this. And then of course you don't generally check at all, ever. ** True, But I would like md to report when a mirror is wrong. ** Or a RAID5 parity is wrong. But whether you check or not, corruptions simply have only a 50% chancce of being seen (you look on the wrong mirror when you look), and a 200% chance of occuring (twice as much real estate) wrt normal rate. ** Since md re-syncs after a crash, I don't think the risk is double. ** Also, I don't think most corruption would be detectable (ignoring a RAID problem). ** It depends to the type of data. ** Example: Your MP3 collection would go undetected until someone listened to the corrupt file. In contrast, on a single disk they have a 100% chance of detection (if you look!) and a 100% chance of occuring, wrt normal rate. ** Are you talking about the disk drive detecting the error? ** If so, are you referring to a read error or what? ** Please explain the nature of the detectable error. > know when a difference occurs, if you want. How? ** Compare the 2 halves or the RAID1, or check the parity of RAID5. Peter ** Guy ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 10:17 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy @ 2005-01-03 11:31 ` Peter T. Breuer 2005-01-03 17:34 ` Guy 2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten 0 siblings, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-03 11:31 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > "Also sprach Guy:" > > "Well, you can make somewhere. You only require an 8MB (one cylinder) > > partition." > > > > So, it is ok for your system to fail when this disk fails? > > You lose the journal, that's all. You can react with a simple tune2fs > -O ^journal or whatever is appropriate. And a journal is ONLY there in > order to protect you against crashes of the SYSTEM (not the disk), so > what was the point of having the journal in the first place? > > ** When you lose the journal, does the system continue without it? > ** Or does it require user intervention? I don't recall. It certainly at least puts itself into read-only mode (if that's the error mode specified via tune2fs). And the situation probably changes from version t version. On a side note, I don't know why you think user intervention is not required when a raid system dies. As a matter of liklihoods, I have never seen a disk die while IN a working soft (or hard) raid system, and the system continue working afterwards, instead the normal disaster sequence as I have experienced it is: 1) lightning strikes rails, or a/c goes out and room full of servers overheats. All lights go off. 2) when sysadmin arrives to sort out the smoking wrecks, he finds that 1 in 3 random disks are fried - they're simply the points of failure that died first, and they took down the hardware with them. 3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware to piece together the raid arrays from the surviving disks, and hastily does a copy to somewhere very safe and distant, while an assistant holds off howling hordes outside the door with a shutgun. In this scenario, a disk simply acts as the weakest link in a fuse chain, and the whole chain goes down. But despite my dramatisation it is likely that a hardware failure will take out or damage your hardware! Ide disks live on an electric bus conected to other hardware. Try a shortcircuit and see what happens. You can't even yank them out while the bus is operating if you want to keep your insurance policy. For scsi the situation is better wrt hot-swap, but still not perfect. And you have the electric connections also. That makes it likely that a real nasty hardware failure will do nasty things (tm) to whatever is in the same electric environment. It is possible if not likely that you will lose contact with scsi disks further along the bus, if you don't actually blow the controller. That said, there ARE situations which raid protects you from - simply a "gentle disconnect" (a totally failed disk that goes open circuit), or a "gradual failure" (a disk that runs out of spare sectors). In the latter case the raid will fail the disk completely at the first detected error, which may well be what you want (or may not be!). However, I don't see how you can expect to replace a failed disk without taking down the system. For that reason you are expected to be running "spare disks" that you can virtually insert hot into the array (caveat, it is possible with scsi, but you will need to rescan the bus, which will take it out of commission for some seconds, which may require you to take the bus offline first, and it MAY be possible with recent IDE buses that purport to support hotswap - I don't know). So I think the relevant question is: "what is it that you are protecting yourself from by this strategy of yours". When you have the scenario, you can evaluate risks. > > I don't want system failures when a disk fails, > > Your scenario seems to be that you have the disks of your mirror on the > same physical system. That's fundamentally dangerous. They're both > subject to damage when the system blows up. [ ... ] I have an array > node (where the journal is kept), and a local mirror component and a > remote mirror component. > > That system is doubled, and each half of the double hosts the others > remote mirror component. Each half fails over to the other. > ** So, you have 2 systems, 1 fails and the "system" switches to the other. > ** I am not going for a 5 nines system. > ** I just don't want any down time if a disk fails. Well, (1) how likely is it that a disk will fail without taking down the system (2) how likely is it that a disk will fail (3) how likely is it that a whole system will fail I would say that (2) is about 10% per year. I would say that (3) is about 1200% per year. It is therefore difficult to calculate (1), which is your protection scenario, since it doesn't show up very often in the stats! > ** A disk failing is the most common failure a system can have (IMO). Not in my experience. See above. I'd say each disk has about a 10% failure expectation per year. Whereas I can guarantee that an unexpected system failure will occur about once a month, on every important system. If you think about it that is quite likely, since a system is by definition a complicated thing. And then it is subject to all kinds of horrible outside influences, like people rewiring the server room in order to reroute cables under the floor instead of through he ceiling, and the maintenance people spraying the building with insecticide, everywhere, or just "turning off the electricity in order to test it" (that happens about four times a year here - hey, I remember when they tested the giant UPS by turning off the electricity! Wrong switch. Bummer). Yes, you can try and keep these systems out of harms way on a colocation site, or something, but by then you are at professional level paranoia. For "home systems", whole system failures are far more common than disk failures. I am not saying that RAID is useless! Just the opposite. It is a useful and EASY way of allowing you to pick up the pieces when everything falls apart. In contrast, running a backup regime is DIFFICULT. > ** In a computer room with about 20 Unix systems, in 1 year I have seen 10 > or so disk failures and no other failures. Well, let's see. If each system has 2 disks, then that would be 25% per disk per year, which I would say indicates low quality IDE disks, but is about the level I would agree with as experiential. > ** Are your 2 systems in the same state? No, why should they be? > ** They should be at least 50 miles apart (at a minimum). They aren't - they are in two different rooms. Different systems copy them every day to somewhere else. I have no need for instant survivability across nuclear attack. > ** Otherwise if your data center blows up, your system is down! True. So what? I don't care. The cost of such a thing is zero, because if my data center goes I get a lot of insurance money and can retire. The data is already backed up elsewhere if anyone cares. > ** In my case, this is so rare, it is not an issue. It's very common here and everywhere else I know! Think about it - if your disks are in the same box then it is statistically likely that when one disks fails it is BECAUSE of some local cause, and that therefore the other disk will also be affected by it. It's your "if your date center burns down" reasoning, applied to your box. > ** Just use off-site tape backups. No way! I hate tapes. I backup to other disks. > ** My computer room is for development and testing, no customer access. Unfortunately, the admins do most of the sabotage. > ** If the data center is gone, the workers have nowhere to work anyway (in > my case). I agree. Therefore who cares. OTOH, if only the server room smokes out, they have plenty of places to work, but nothing to work on. Tut tut. > ** Some of our customers do have failover systems 50+ miles apart. Banks don't (hey, I wrote the interbank communications encryption software here on the peninsula) here. They have tapes. As far as I know, the tapes are sent to vaults. It often happens that their systems go down. In fact, I have NEVER managed to connect via the internet page to their systems at a time when they were in working order. And I have been trying on and off for about three years. And I often have been in the bank managers office (discussing mortgages, national debt, etc.) when the bank's internal systems have gone down, nationwide. > > so mirror (or RAID5) > > everything required to keep your system running. > > > > "And there is a risk of silent corruption on all raid systems - that is > > well known." > > I question this.... > > Why! > ** You lost me here. I did not make the above statement. Yes you did. You can see from the quoting that you did. > But, in the case > of RAID5, I believe it can occur. So do I. I am asking why you "question this"> > Your system crashes while a RAID5 stripe > is being written, but the stripe is not completely written. This is fairly meaningless. I don't now what precise meaning the word "stripe" has in raid5 but it's irrelevant. Simply, if you write redundant data, whatever way you write it, raid 1, 5, 6 or whatever, there is a possibility that you write only one of the copies before the system goes down. Then when the system comes up it has two different sources of data to choose to believe. > During the > re-sync, the parity will be adjusted, See. "when the system comes up ...". There is no need to go into detail and I don't know why you do! > but it may be more current than 1 or > more of the other disks. But this would be similar to what would happen to > a non-RAID disk (some data not written). No, it would not be similar. You don't seem to understand the mechanism. The mechanism for corruption is that there are two different versions of the data available when the system comes back up, and you and the raid system don't know which is more correct. Or even what it means to be "correct". Maybe the earlier written data is "correct"! > ** Also with RAID1 or RAID5, if corruption does occur without a crash or > re-boot, then a disk fails, the corrupt data will be copied to the > replacement disk. Exactly so. It's a generic problem with redundant data sources. You don't know which one to believe when they disagree! > With RAID1 a 50% risk of copying the corruption, and 50% > risk of correcting the corruption. With RAID5, risk % depends on the number > of disks in the array. It's the same. There are two sources of data that you can believe. The "real data on disk" or "all the other data blocks in the 'stripe' plus the parity block". You get to choose which you believe. > > I bet a non-mirror disk has similar risk as a RAID1. But with a RAID1, you > > The corruption risk is doubled for a 2-way mirror, and there is a 50% > chance of it not being detected at all even if you try and check for it, > because you may be reading from the wrong mirror at the time you pass > over the imperfection in the check. > ** After a crash, md will re-sync the array. It doesn't know which disk to believe is correct. There is a stamp on the disks superblocks, but it is only updated every so often. If the whole system dies while both disks are OK, I don't know what will be stamped or what will happen (which will be believed) at resync. I suspect it is random. I would appreciate clarificaton from Neil. > ** But during the re-sync, md could be checking for differences and > reporting them. It could. That might be helpful. > ** It won't help correct anything, but it could explain why you may be > having problems with your data. Indeed, it sounds a good idea. It could slow down RAID1 resync, but I don't think the impact on RAID5 would be noticable. > ** Since md re-syncs after a crash, I don't think the risk is double. That is not germane. I already pointed out that you are 50% likely to copy the "wrong" data IF you copy (and WHEN you copy). Actually doing the copy merely brings that calculation into play at the moment of the resync, instead of later, at the moment when one of the two disks actually dies and yu have to use the remaining one. > Isn't that simply the most naive calculation? So why would you make > your bet? > ** I don't understand this. Evidently! :) > And then of course you don't generally check at all, ever. > ** True, But I would like md to report when a mirror is wrong. > ** Or a RAID5 parity is wrong. Software raid does not spin off threads randomly checking data. If you don't use it, you don't get to check at all. So just leaving disks sitting there exposes them to corruption that is checked least of all. > But whether you check or not, corruptions simply have only a 50% chancce > of being seen (you look on the wrong mirror when you look), and a 200% > chance of occuring (twice as much real estate) wrt normal rate. > ** Since md re-syncs after a crash, I don't think the risk is double. It remains double whatever you think. The question is whether you detect it or not. You cannot detect it without checking. > ** Also, I don't think most corruption would be detectable (ignoring a RAID > problem). You wouldn't know which disk was right. The disk might know if it was a hardware problem Incidentally, I wish raid would NOT offline the disk when it detects a read error. It should fall back to the redundant data. I may submit a patch for that. In 2.6 the raid system may even do that. The resync thread comments SAY that it retries reads. I don't know if it actually does. Neil? > ** It depends to the type of data. > ** Example: Your MP3 collection would go undetected until someone listened > to the corrupt file. :-). > In contrast, on a single disk they have a 100% chance of detection (if > you look!) and a 100% chance of occuring, wrt normal rate. > ** Are you talking about the disk drive detecting the error? No. You are quite right. I should categorise the types of error more precisely. We want to distinguish 1) hard errors (detectable by the disk firmware) 2) soft errors (not detected by the above) > ** If so, are you referring to a read error or what? Read. > ** Please explain the nature of the detectable error. "wrong data on the disk or as read from the disk". Define "wrong"! > > know when a difference occurs, if you want. > > How? > ** Compare the 2 halves or the RAID1, or check the parity of RAID5. You wouldn't necesarily know which of the two data sources was "correct". Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 11:31 ` Peter T. Breuer @ 2005-01-03 17:34 ` Guy 2005-01-03 19:20 ` ext3 Gordon Henderson 2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten 1 sibling, 1 reply; 92+ messages in thread From: Guy @ 2005-01-03 17:34 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid Having a filesystem go into read only mode is a "down system". Not acceptable to me! Maybe ok for a home system, but I don't assume Linux is limited to home use. In my case, this is not acceptable for my home system. Time is money! About user intervention. If the system stops working until someone does something, that is a down system. That is what I meant by user intervention. Replacing a disk Monday that failed Friday night, is what I would expect. This is a normal failure to me. Even if a re-boot is required, as long as it can be scheduled, it is acceptable to me. You and I have had very different failures over the years! In my case, most failures are disks, and most of the time the system continues to work just fine, without user intervention. If spare disks are configured, the array re-builds to the spare. At my convenience, I replace the disk, without a system re-boot. Most Unix systems I have used have SCSI disks. IDE tends to be in home systems. My home system is Linux with 17 SCSI disks. I have replaced a disk without a re-boot, but the disk cabinet is not hot-swap, so I tend to shut down the system to replace a disk. My 20 systems had anywhere from 4 to about 44 disks. You should expect 1 disk failure out of 25-100 disks per year. There are good years and bad! Our largest customer system has more than 300 disks. I don't know the failure rate, but most failures do not take the system down! Our customer systems tend to have hardware RAID systems. HP, EMC, DG (now EMC). If you have a 10% disk failure rate per year, something else is wrong! You may have a bad building ground, or too much current flowing on the building ground line. All sorts of power problems are very common. Most if not all electricians only know the building code. They are not qualified to debug all power problems. I once talked to an expert in the field. He said thunder causes more power problems than lighting! Most buildings use conduit for ground, no separate ground wire. The thunder will shake the conduit and loosen the connections. This causes a bad ground during the thunder, which could crash computer systems (including hardware RAID boxes). Never depend a conduit for ground, always have a separate ground wire. This is just one example of many issues he had, I don't recall all the details, and I am not an expert on building power. I know of 1 event that matches your most common failure. A PC with a $50 case and power supply, the power supply failed in such a way that it put 120V on the 12V and/or 5V line. Everything in the case was lost! Well, the heat sink was ok. :) The system was not repaired, it went into the trash. But this was a home user's clone PC, not a server. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Monday, January 03, 2005 6:32 AM To: linux-raid@vger.kernel.org Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy <bugzilla@watkins-home.com> wrote: > "Also sprach Guy:" > > "Well, you can make somewhere. You only require an 8MB (one cylinder) > > partition." > > > > So, it is ok for your system to fail when this disk fails? > > You lose the journal, that's all. You can react with a simple tune2fs > -O ^journal or whatever is appropriate. And a journal is ONLY there in > order to protect you against crashes of the SYSTEM (not the disk), so > what was the point of having the journal in the first place? > > ** When you lose the journal, does the system continue without it? > ** Or does it require user intervention? I don't recall. It certainly at least puts itself into read-only mode (if that's the error mode specified via tune2fs). And the situation probably changes from version t version. On a side note, I don't know why you think user intervention is not required when a raid system dies. As a matter of liklihoods, I have never seen a disk die while IN a working soft (or hard) raid system, and the system continue working afterwards, instead the normal disaster sequence as I have experienced it is: 1) lightning strikes rails, or a/c goes out and room full of servers overheats. All lights go off. 2) when sysadmin arrives to sort out the smoking wrecks, he finds that 1 in 3 random disks are fried - they're simply the points of failure that died first, and they took down the hardware with them. 3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware to piece together the raid arrays from the surviving disks, and hastily does a copy to somewhere very safe and distant, while an assistant holds off howling hordes outside the door with a shutgun. In this scenario, a disk simply acts as the weakest link in a fuse chain, and the whole chain goes down. But despite my dramatisation it is likely that a hardware failure will take out or damage your hardware! Ide disks live on an electric bus conected to other hardware. Try a shortcircuit and see what happens. You can't even yank them out while the bus is operating if you want to keep your insurance policy. For scsi the situation is better wrt hot-swap, but still not perfect. And you have the electric connections also. That makes it likely that a real nasty hardware failure will do nasty things (tm) to whatever is in the same electric environment. It is possible if not likely that you will lose contact with scsi disks further along the bus, if you don't actually blow the controller. That said, there ARE situations which raid protects you from - simply a "gentle disconnect" (a totally failed disk that goes open circuit), or a "gradual failure" (a disk that runs out of spare sectors). In the latter case the raid will fail the disk completely at the first detected error, which may well be what you want (or may not be!). However, I don't see how you can expect to replace a failed disk without taking down the system. For that reason you are expected to be running "spare disks" that you can virtually insert hot into the array (caveat, it is possible with scsi, but you will need to rescan the bus, which will take it out of commission for some seconds, which may require you to take the bus offline first, and it MAY be possible with recent IDE buses that purport to support hotswap - I don't know). So I think the relevant question is: "what is it that you are protecting yourself from by this strategy of yours". When you have the scenario, you can evaluate risks. > > I don't want system failures when a disk fails, > > Your scenario seems to be that you have the disks of your mirror on the > same physical system. That's fundamentally dangerous. They're both > subject to damage when the system blows up. [ ... ] I have an array > node (where the journal is kept), and a local mirror component and a > remote mirror component. > > That system is doubled, and each half of the double hosts the others > remote mirror component. Each half fails over to the other. > ** So, you have 2 systems, 1 fails and the "system" switches to the other. > ** I am not going for a 5 nines system. > ** I just don't want any down time if a disk fails. Well, (1) how likely is it that a disk will fail without taking down the system (2) how likely is it that a disk will fail (3) how likely is it that a whole system will fail I would say that (2) is about 10% per year. I would say that (3) is about 1200% per year. It is therefore difficult to calculate (1), which is your protection scenario, since it doesn't show up very often in the stats! > ** A disk failing is the most common failure a system can have (IMO). Not in my experience. See above. I'd say each disk has about a 10% failure expectation per year. Whereas I can guarantee that an unexpected system failure will occur about once a month, on every important system. If you think about it that is quite likely, since a system is by definition a complicated thing. And then it is subject to all kinds of horrible outside influences, like people rewiring the server room in order to reroute cables under the floor instead of through he ceiling, and the maintenance people spraying the building with insecticide, everywhere, or just "turning off the electricity in order to test it" (that happens about four times a year here - hey, I remember when they tested the giant UPS by turning off the electricity! Wrong switch. Bummer). Yes, you can try and keep these systems out of harms way on a colocation site, or something, but by then you are at professional level paranoia. For "home systems", whole system failures are far more common than disk failures. I am not saying that RAID is useless! Just the opposite. It is a useful and EASY way of allowing you to pick up the pieces when everything falls apart. In contrast, running a backup regime is DIFFICULT. > ** In a computer room with about 20 Unix systems, in 1 year I have seen 10 > or so disk failures and no other failures. Well, let's see. If each system has 2 disks, then that would be 25% per disk per year, which I would say indicates low quality IDE disks, but is about the level I would agree with as experiential. > ** Are your 2 systems in the same state? No, why should they be? > ** They should be at least 50 miles apart (at a minimum). They aren't - they are in two different rooms. Different systems copy them every day to somewhere else. I have no need for instant survivability across nuclear attack. > ** Otherwise if your data center blows up, your system is down! True. So what? I don't care. The cost of such a thing is zero, because if my data center goes I get a lot of insurance money and can retire. The data is already backed up elsewhere if anyone cares. > ** In my case, this is so rare, it is not an issue. It's very common here and everywhere else I know! Think about it - if your disks are in the same box then it is statistically likely that when one disks fails it is BECAUSE of some local cause, and that therefore the other disk will also be affected by it. It's your "if your date center burns down" reasoning, applied to your box. > ** Just use off-site tape backups. No way! I hate tapes. I backup to other disks. > ** My computer room is for development and testing, no customer access. Unfortunately, the admins do most of the sabotage. > ** If the data center is gone, the workers have nowhere to work anyway (in > my case). I agree. Therefore who cares. OTOH, if only the server room smokes out, they have plenty of places to work, but nothing to work on. Tut tut. > ** Some of our customers do have failover systems 50+ miles apart. Banks don't (hey, I wrote the interbank communications encryption software here on the peninsula) here. They have tapes. As far as I know, the tapes are sent to vaults. It often happens that their systems go down. In fact, I have NEVER managed to connect via the internet page to their systems at a time when they were in working order. And I have been trying on and off for about three years. And I often have been in the bank managers office (discussing mortgages, national debt, etc.) when the bank's internal systems have gone down, nationwide. > > so mirror (or RAID5) > > everything required to keep your system running. > > > > "And there is a risk of silent corruption on all raid systems - that is > > well known." > > I question this.... > > Why! > ** You lost me here. I did not make the above statement. Yes you did. You can see from the quoting that you did. > But, in the case > of RAID5, I believe it can occur. So do I. I am asking why you "question this"> > Your system crashes while a RAID5 stripe > is being written, but the stripe is not completely written. This is fairly meaningless. I don't now what precise meaning the word "stripe" has in raid5 but it's irrelevant. Simply, if you write redundant data, whatever way you write it, raid 1, 5, 6 or whatever, there is a possibility that you write only one of the copies before the system goes down. Then when the system comes up it has two different sources of data to choose to believe. > During the > re-sync, the parity will be adjusted, See. "when the system comes up ...". There is no need to go into detail and I don't know why you do! > but it may be more current than 1 or > more of the other disks. But this would be similar to what would happen to > a non-RAID disk (some data not written). No, it would not be similar. You don't seem to understand the mechanism. The mechanism for corruption is that there are two different versions of the data available when the system comes back up, and you and the raid system don't know which is more correct. Or even what it means to be "correct". Maybe the earlier written data is "correct"! > ** Also with RAID1 or RAID5, if corruption does occur without a crash or > re-boot, then a disk fails, the corrupt data will be copied to the > replacement disk. Exactly so. It's a generic problem with redundant data sources. You don't know which one to believe when they disagree! > With RAID1 a 50% risk of copying the corruption, and 50% > risk of correcting the corruption. With RAID5, risk % depends on the number > of disks in the array. It's the same. There are two sources of data that you can believe. The "real data on disk" or "all the other data blocks in the 'stripe' plus the parity block". You get to choose which you believe. > > I bet a non-mirror disk has similar risk as a RAID1. But with a RAID1, you > > The corruption risk is doubled for a 2-way mirror, and there is a 50% > chance of it not being detected at all even if you try and check for it, > because you may be reading from the wrong mirror at the time you pass > over the imperfection in the check. > ** After a crash, md will re-sync the array. It doesn't know which disk to believe is correct. There is a stamp on the disks superblocks, but it is only updated every so often. If the whole system dies while both disks are OK, I don't know what will be stamped or what will happen (which will be believed) at resync. I suspect it is random. I would appreciate clarificaton from Neil. > ** But during the re-sync, md could be checking for differences and > reporting them. It could. That might be helpful. > ** It won't help correct anything, but it could explain why you may be > having problems with your data. Indeed, it sounds a good idea. It could slow down RAID1 resync, but I don't think the impact on RAID5 would be noticable. > ** Since md re-syncs after a crash, I don't think the risk is double. That is not germane. I already pointed out that you are 50% likely to copy the "wrong" data IF you copy (and WHEN you copy). Actually doing the copy merely brings that calculation into play at the moment of the resync, instead of later, at the moment when one of the two disks actually dies and yu have to use the remaining one. > Isn't that simply the most naive calculation? So why would you make > your bet? > ** I don't understand this. Evidently! :) > And then of course you don't generally check at all, ever. > ** True, But I would like md to report when a mirror is wrong. > ** Or a RAID5 parity is wrong. Software raid does not spin off threads randomly checking data. If you don't use it, you don't get to check at all. So just leaving disks sitting there exposes them to corruption that is checked least of all. > But whether you check or not, corruptions simply have only a 50% chancce > of being seen (you look on the wrong mirror when you look), and a 200% > chance of occuring (twice as much real estate) wrt normal rate. > ** Since md re-syncs after a crash, I don't think the risk is double. It remains double whatever you think. The question is whether you detect it or not. You cannot detect it without checking. > ** Also, I don't think most corruption would be detectable (ignoring a RAID > problem). You wouldn't know which disk was right. The disk might know if it was a hardware problem Incidentally, I wish raid would NOT offline the disk when it detects a read error. It should fall back to the redundant data. I may submit a patch for that. In 2.6 the raid system may even do that. The resync thread comments SAY that it retries reads. I don't know if it actually does. Neil? > ** It depends to the type of data. > ** Example: Your MP3 collection would go undetected until someone listened > to the corrupt file. :-). > In contrast, on a single disk they have a 100% chance of detection (if > you look!) and a 100% chance of occuring, wrt normal rate. > ** Are you talking about the disk drive detecting the error? No. You are quite right. I should categorise the types of error more precisely. We want to distinguish 1) hard errors (detectable by the disk firmware) 2) soft errors (not detected by the above) > ** If so, are you referring to a read error or what? Read. > ** Please explain the nature of the detectable error. "wrong data on the disk or as read from the disk". Define "wrong"! > > know when a difference occurs, if you want. > > How? > ** Compare the 2 halves or the RAID1, or check the parity of RAID5. You wouldn't necesarily know which of the two data sources was "correct". Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* ext3 .. 2005-01-03 17:34 ` Guy @ 2005-01-03 19:20 ` Gordon Henderson 2005-01-03 19:47 ` ext3 Morten Sylvest Olsen 0 siblings, 1 reply; 92+ messages in thread From: Gordon Henderson @ 2005-01-03 19:20 UTC (permalink / raw) To: linux-raid Been folowing this with interst as just about everything I'm building these days has raid1 to boot and data (typical small server setup), and raid5 in larger boxes for data and ext3 ... No problems with this yet - several power failures and disks lost and it's all generally behaved as I expected it to. I've hot-chanaged SCSI drives which have failed and cold changed IDE drives at a convenient time for the server... I did have a problem recently though - had a disk fail in an 8-disk external SCSI array, arranged as a 7+1 RAID5 ... Then 5 minutes later had a 2nd disk fail. So to the upper layers, ext3, userland, etc. that should look like a catastrophic hardware failure -- anything trying to read/write to it should (IMO) have simply returned with IO errors. What actually happened was that the kernel panicked and the whole box ground to a halt. The server could have carried on doing usefull stuff without this disk partition, but a big oops and halt wasn't useful. (This is 2.4.27 in-case it matters) I didn't have time to work out the why/what/wherevers of the problem, the box was power cycled and brought online minus the external array. Ext3 did its thing and enabled the box to come up in seconds rather than hours (it's a big Dell - it boots Linux faster than it goes through its BIOS!) As for the external array, well, that was resurected with mdadm with no data lost, but thats another story... Gordon ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 .. 2005-01-03 19:20 ` ext3 Gordon Henderson @ 2005-01-03 19:47 ` Morten Sylvest Olsen 2005-01-03 20:05 ` ext3 Gordon Henderson 0 siblings, 1 reply; 92+ messages in thread From: Morten Sylvest Olsen @ 2005-01-03 19:47 UTC (permalink / raw) To: Gordon Henderson; +Cc: linux-raid > Been folowing this with interst as just about everything I'm building > these days has raid1 to boot and data (typical small server setup), and > raid5 in larger boxes for data and ext3 ... > > No problems with this yet - several power failures and disks lost and it's > all generally behaved as I expected it to. I've hot-chanaged SCSI drives > which have failed and cold changed IDE drives at a convenient time for the > server... > > I did have a problem recently though - had a disk fail in an 8-disk > external SCSI array, arranged as a 7+1 RAID5 ... Then 5 minutes later had > a 2nd disk fail. > > So to the upper layers, ext3, userland, etc. that should look like a > catastrophic hardware failure -- anything trying to read/write to it > should (IMO) have simply returned with IO errors. That depends on the options when the filesystem was mounted. Or the options set in the superblock. The choices are continue, remount read-only or panic. Regards, Morten ---- A: No. Q: Should I include quotations after my reply? ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 .. 2005-01-03 19:47 ` ext3 Morten Sylvest Olsen @ 2005-01-03 20:05 ` Gordon Henderson 0 siblings, 0 replies; 92+ messages in thread From: Gordon Henderson @ 2005-01-03 20:05 UTC (permalink / raw) To: Morten Sylvest Olsen; +Cc: linux-raid On Mon, 3 Jan 2005, Morten Sylvest Olsen wrote: > > So to the upper layers, ext3, userland, etc. that should look like a > > catastrophic hardware failure -- anything trying to read/write to it > > should (IMO) have simply returned with IO errors. > > That depends on the options when the filesystem was mounted. Or the > options set in the superblock. The choices are continue, remount > read-only or panic. however - on the system in question: xena:/home/gordonh# dumpe2fs -h /dev/md5 ... Errors behavior: Continue ... and there are no mount options to say otherwise. There are lots of ext3 whinges in the log-file so I guess it just got fed-up... And this really isn't a linux-raid issue, just something I noticed recently to do with ext3... I did try XFS a while back, but had more problems with it and no satisfactory answers, so gave up on it... The trouble is, you get too used to what works and tend to stick with it... Ah well. Gordon ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 11:31 ` Peter T. Breuer 2005-01-03 17:34 ` Guy @ 2005-01-03 17:46 ` maarten 2005-01-03 19:52 ` maarten ` (2 more replies) 1 sibling, 3 replies; 92+ messages in thread From: maarten @ 2005-01-03 17:46 UTC (permalink / raw) To: linux-raid On Monday 03 January 2005 12:31, Peter T. Breuer wrote: > Guy <bugzilla@watkins-home.com> wrote: > > "Also sprach Guy:" > 1) lightning strikes rails, or a/c goes out and room full of servers > overheats. All lights go off. > > 2) when sysadmin arrives to sort out the smoking wrecks, he finds > that 1 in 3 random disks are fried - they're simply the points > of failure that died first, and they took down the hardware with > them. > > 3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware > to piece together the raid arrays from the surviving disks, and > hastily does a copy to somewhere very safe and distant, while > an assistant holds off howling hordes outside the door with > a shutgun. > > In this scenario, a disk simply acts as the weakest link in a fuse > chain, and the whole chain goes down. But despite my dramatisation it > is likely that a hardware failure will take out or damage your hardware! > Ide disks live on an electric bus conected to other hardware. Try a > shortcircuit and see what happens. You can't even yank them out while > the bus is operating if you want to keep your insurance policy. The chance of a PSU blowing up or lightning striking is, reasonably, much less than an isolated disk failure. If this simple fact is not true for you personally, you really ought to reevaluate the quality of your PSU (et al) and / or the buildings' defenses against a lightning strike... > However, I don't see how you can expect to replace a failed disk > without taking down the system. For that reason you are expected to be > running "spare disks" that you can virtually insert hot into the array > (caveat, it is possible with scsi, but you will need to rescan the bus, > which will take it out of commission for some seconds, which may > require you to take the bus offline first, and it MAY be possible with > recent IDE buses that purport to support hotswap - I don't know). I think the point is not what actions one has to take at time T+1 to replace the disk, but rather whether at time T, when the failure first occurs, the system survives the failure or not. > (1) how likely is it that a disk will fail without taking down the system > (2) how likely is it that a disk will fail > (3) how likely is it that a whole system will fail > > I would say that (2) is about 10% per year. I would say that (3) is > about 1200% per year. It is therefore difficult to calculate (1), which > is your protection scenario, since it doesn't show up very often in the > stats! I don't understand your math. For one, percentage is measured from 0 to 100, not from 0 to 1200. What is that, 12 twelve times 'absolute certainty' that something will occur ? But besides that, I'd wager that from your list number (3) has, by far, the smallest chance of occurring. Choosing between (1) and (2) is more difficult, my experiences with IDE disks are definitely that it will take the system down, but that is very biased since I always used non-mirrored swap. I sure can understand a system dying if it loses part of its memory... > > ** A disk failing is the most common failure a system can have (IMO). I fully agree. > Not in my experience. See above. I'd say each disk has about a 10% > failure expectation per year. Whereas I can guarantee that an > unexpected system failure will occur about once a month, on every > important system. Whoa ! What are you running, windows perhaps ?!? ;-) No but seriously, joking aside, you have 12 system failures per year ? I would not be alone in thinking that figure is VERY high. My uptimes generally are in the three-digit range, and most *certainly* not in the low 2-digit range. > If you think about it that is quite likely, since a system is by > definition a complicated thing. And then it is subject to all kinds of > horrible outside influences, like people rewiring the server room in > order to reroute cables under the floor instead of through he ceiling, > and the maintenance people spraying the building with insecticide, > everywhere, or just "turning off the electricity in order to test it" > (that happens about four times a year here - hey, I remember when they > tested the giant UPS by turning off the electricity! Wrong switch. > Bummer). If you have building maintenance people and other random staff that can access your server room unattended and unmonitored, you have far worse problems than making decicions about raid lavels. IMNSHO. By your description you could almost be the guy the joke with the recurring 7 o'clock system crash is about (where the cleaning lady unplugs the server every morning in order to plug in her vacuum cleaner) ;-) > Yes, you can try and keep these systems out of harms way on a > colocation site, or something, but by then you are at professional > level paranoia. For "home systems", whole system failures are far more > common than disk failures. Don't agree. Not only do disk failures occur more often than full system failures, disk failures are also much more time-consuming to recover from. Compare changing a system board or PSU with changing a drive and finding, copying and verifying a backup (if you even have one that's 100% up to date) > > ** In a computer room with about 20 Unix systems, in 1 year I have seen > > 10 or so disk failures and no other failures. > > Well, let's see. If each system has 2 disks, then that would be 25% per > disk per year, which I would say indicates low quality IDE disks, but > is about the level I would agree with as experiential. The point here was, disk failures being more common than other failures... > No way! I hate tapes. I backup to other disks. Then for your sake, I hope they're kept offline, in a safe. > > ** My computer room is for development and testing, no customer access. > > Unfortunately, the admins do most of the sabotage. Change admins. I could understand an admin making typing errors and such, but then again that would not usually lead to a total system failure. Some daemon not working, sure. Good admins review or test their changes, for one thing, and in most cases any such mistake is rectified much simpler and faster than a failed disk anyway. Except maybe for lilo errors with no boot media available. ;-\ > Yes you did. You can see from the quoting that you did. Or the quoting got messed up. That is known to happen in threads. > > but it may be more current than 1 or > > more of the other disks. But this would be similar to what would happen > > to a non-RAID disk (some data not written). > > No, it would not be similar. You don't seem to understand the > mechanism. The mechanism for corruption is that there are two different > versions of the data available when the system comes back up, and you > and the raid system don't know which is more correct. Or even what it > means to be "correct". Maybe the earlier written data is "correct"! That is not the whole truth. To be fair, the mechanism works like this: With raid, you have a 50% chance the wrong, corrupted, data is used. Without raid, thus only having a single disk, the chance of using the corrupted data is 100% (obviously, since there is only one source) Or, much more elaborate: Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5 With raid, you always have a 50% chance of reading faultty data IF one of the drives holds faulty data. For the drives itself, the chance of both disks being wrong is 0.5x0.5=0.25(scenario A). Similarly, 25 % chance both disks are good (scenario B). The chance of one of the disks being wrong is 50% (scenarios C & D together). In scenarios A & B the outcome is certain. In scenarios C & D the chance of the raid choosing the false mirror is 50%. Accumulating those chances one can say that the chance of reading false data is: in scenario A: 100% in scenario B: 0% scenario C: 50% scenario D: 50% Doing the math, the outcome is still (200% divided by four)= 50%. Ergo: the same as with a single disk. No change. > > In contrast, on a single disk they have a 100% chance of detection (if > > you look!) and a 100% chance of occuring, wrt normal rate. > > ** Are you talking about the disk drive detecting the error? No, you have a zero chance of detection, since there is nothing to compare TO. Raid-1 at least gives you a 50/50 chance to choose the right data. With a single disk, the chance of reusing the corrupted data is 100% and there is no mechanism to detect the odd 'tumbled bit' at all. > > How? > > ** Compare the 2 halves or the RAID1, or check the parity of RAID5. > > You wouldn't necesarily know which of the two data sources was > "correct". No, but you have a theoretical choice, and a 50% chance of being right. Not so without raid, where you get no choice, and a 100% chance of getting the wrong data, in the case of a corruption. Maarten -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten @ 2005-01-03 19:52 ` maarten 2005-01-03 20:41 ` Peter T. Breuer 2005-01-03 20:22 ` Peter T. Breuer 2005-01-03 21:36 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy 2 siblings, 1 reply; 92+ messages in thread From: maarten @ 2005-01-03 19:52 UTC (permalink / raw) To: linux-raid On Monday 03 January 2005 18:46, maarten wrote: > On Monday 03 January 2005 12:31, Peter T. Breuer wrote: > > Guy <bugzilla@watkins-home.com> wrote: > > Doing the math, the outcome is still (200% divided by four)= 50%. > Ergo: the same as with a single disk. No change. Just for laughs, I calculated this chance also for a three-way raid-1 setup using a lower 'failure possibility' percentage. The outcome does not change. The (statisticly higher) chance of a disk failing is exactly offset by the greater likelyhood that the raid system chooses one of the good drives to read from. (Obviously this is only valid for raid level 1, not for level 5 or others) Let us (randomly) assume there is a 10% chance of a disk failure. We use three raid-1 disks, numbered 1 through 3. We therefore have eight possible scenarios: A disk1 fail disk2 good disk3 good B disk1 good disk2 fail disk3 good C disk1 good disk2 good disk3 fail D disk1 fail disk2 fail disk3 good E disk1 fail disk2 good disk3 fail F disk1 good disk2 fail disk3 fail G disk1 fail disk2 fail disk3 fail H disk1 good disk2 good disk3 good Scenarios A, B and C are similar (one disk failed). Scenario's D, E and F are also similar (two disk failures). Scenarios G and H are special, the chances of that occurring are calculated seperately. H: the chance of all good disks is (0.9x0.9x0.9) = 0.729 G: the chance of all disks bad is (0.1x0.1x0.1) = 0.001 The chance of A, B or C (one bad disk) is (0.9x0.9x0.1) = 0.081 The chance of D, E or F (two bad disks) is (0.9x0.1x0.1) = 0.009 The chance of (A, B or C) and (D, E or F) occurring must be multiplied by three as there are three scenarios each. So this becomes: The chance of one bad disk is = 0.243 The chance of two bad disks is = 0.027 Now let's see. It is certain that the raid subsystem will read the good data in H. The chance of that in scenario G is zero. The chance in (A, B or C) is two-thirds. And for D, E or F the chance the raid system getting the good data is one-third. Let's calculate all this. [ABC] x 0.667 = 0.243 x 0.667 = 0.162 [DEF] x 0.333 = 0.027 x 0.333 = 0.008 [G] x 0 = 0.0 [H] x 1.0 = 0.729 (total added up is 0.9) Conversely, the chance of reading the BAD data: [ABC] x 0.333 = 0.243 x 0.333 = 0.081 [DEF] x 0.667 = 0.027 x 0.667 = 0.018 [G] x 1.0 = 0.001 [H] x 0.0 = 0.0 (total added up is 0.1) Which, again, is exactly the same chance a single disk will get corrupted, as we assumed above in line one is 10%. Ergo, using raid-1 does not make the risks of bad data creeping in any worse. Nor does it make it better either. Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 19:52 ` maarten @ 2005-01-03 20:41 ` Peter T. Breuer 2005-01-03 23:19 ` Peter T. Breuer 2005-01-04 0:45 ` maarten 0 siblings, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-03 20:41 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > Just for laughs, I calculated this chance also for a three-way raid-1 setup There's no need for you to do this - your calculations are unfortunately not meaningful. > Let us (randomly) assume there is a 10% chance of a disk failure. No, call it "p". That is the correct name. And I presume you mean "an error", not "a failure". > We therefore have eight possible scenarios: Oh, puhleeeeze. Infantile arithmetic instead of elementary probabilistic algebra is not something I wish to suffer through ... > A > disk1 fail > disk2 good > disk3 good ... > H > disk1 good > disk2 good > disk3 good Was that all? 8 was it? 1 all good, 3 with one good, 3 with two good, 1 with all fail? Have we got the binomial theorem now! > Scenarios A, B and C are similar (one disk failed). Hoot. 3p. > Scenario's D, E and F are > also similar (two disk failures). There is no need for you to consider these scenarios. The probability is 3p^2, which is tiny. Forget it. (actually 3p^2(1-p), but forget the cube term). > Scenarios G and H are special, the chances > of that occurring are calculated seperately. No, they are NOT special. one of them is the chance that everything is OK, which is (1-p)^3, or approx 1-3p (surprise surprise). The other is the completely forgetable probaility p^3 that all three are bad at that spot. > H: the chance of all good disks is (0.9x0.9x0.9) = 0.729 Surprisingly enough, 1-3p, even though you have such improbably large probability p as to make the approximation only approximate! Puhleeze. This is excruciatingly poor baby math! > G: the chance of all disks bad is (0.1x0.1x0.1) = 0.001 Surprise. p^3. > The chance of A, B or C (one bad disk) is (0.9x0.9x0.1) = 0.081 > The chance of D, E or F (two bad disks) is (0.9x0.1x0.1) = 0.009 > > The chance of (A, B or C) and (D, E or F) occurring must be multiplied by > three as there are three scenarios each. So this becomes: > The chance of one bad disk is = 0.243 > The chance of two bad disks is = 0.027 Surprise, surprise. 3p and 3p^2(1-p) (well, call it 3p^2). > Now let's see. It is certain that the raid subsystem will read the good data > in H. The chance of that in scenario G is zero. The chance in (A, B or C) is > two-thirds. And for D, E or F the chance the raid system getting the good > data is one-third. > > Let's calculate all this. > [ABC] x 0.667 = 0.243 x 0.667 = 0.162 > [DEF] x 0.333 = 0.027 x 0.333 = 0.008 > [G] x 0 = 0.0 > [H] x 1.0 = 0.729 > > (total added up is 0.9) The chance of reading good data is 1 (1-3p) + 2/3 3p or Or approx 1-p. Probably exactly so, were I to do the calculation exactly, which I won't. > Conversely, the chance of reading the BAD data: > [ABC] x 0.333 = 0.243 x 0.333 = 0.081 > [DEF] x 0.667 = 0.027 x 0.667 = 0.018 > [G] x 1.0 = 0.001 > [H] x 0.0 = 0.0 > > (total added up is 0.1) It should be p! It is one minus your previous result. SIgh ... 0 (1-3p) + 1/3 3p = p > Which, again, is exactly the same chance a single disk will get corrupted, as > we assumed above in line one is 10%. Ergo, using raid-1 does not make the > risks of bad data creeping in any worse. Nor does it make it better either. All false. And baby false at that. Annoying! Look, the chance of an undetected detectable failure occuring is 0 (1-3p) + 2/3 3p = 2p and it grows with the number n of disks, as you may expect, being proportional to n-1. With one disk, it is zero. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 20:41 ` Peter T. Breuer @ 2005-01-03 23:19 ` Peter T. Breuer 2005-01-03 23:46 ` Neil Brown 2005-01-04 0:45 ` maarten 1 sibling, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-03 23:19 UTC (permalink / raw) To: linux-raid Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > No, call it "p". That is the correct name. And I presume you mean "an > error", not "a failure". I'll do this thoroughly, so you can see how it goes. Let p = probability of a detectible error occuring on a disk in a unit time p'= ................ indetectible ..................................... Then the probability of an error occuring UNdetected on a n-disk raid array is (n-1)p + np' and on a 1 disk system (a 1-disk raid array :) it is p' OK? (hey, I'm a mathematician, it's obvious to me). Exercise .. calculate effect of majority voting! Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 23:19 ` Peter T. Breuer @ 2005-01-03 23:46 ` Neil Brown 2005-01-04 0:28 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Neil Brown @ 2005-01-03 23:46 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > > No, call it "p". That is the correct name. And I presume you mean "an > > error", not "a failure". > > I'll do this thoroughly, so you can see how it goes. > > Let > > p = probability of a detectible error occuring on a disk in a unit time > p'= ................ indetectible ..................................... > > Then the probability of an error occuring UNdetected on a n-disk raid > array is > > (n-1)p + np' > > and on a 1 disk system (a 1-disk raid array :) it is > > p' > > OK? (hey, I'm a mathematician, it's obvious to me). It may be obvious, but it is also wrong. But then probability is, I think, the branch of mathematics that has the highest ratio of people who think that understand it to people to actually do (witness the success of lotteries). The probability of an event occurring lies between 0 and 1 inclusive. You have given a formula for a probability which could clearly evaluate to a number greater than 1. So it must be wrong. You have also been very sloppy in your language, or your definitions. What do you mean by a "detectable error occurring"? Is it a bit getting flipped on the media, or the drive detecting a CRC error during read? And what is your senario for an undetectable error happening? My understanding of drive technology and CRCs suggests that undetectable errors don't happen without some sort of very subtle hardware error, or high level software error (i.e. the wrong data was written - and that doesn't really count). NeilBrown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 23:46 ` Neil Brown @ 2005-01-04 0:28 ` Peter T. Breuer 2005-01-04 1:18 ` Alvin Oga 2005-01-04 2:07 ` Neil Brown 0 siblings, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 0:28 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > Peter T. Breuer <ptb@lab.it.uc3m.es> wrote: > > > No, call it "p". That is the correct name. And I presume you mean "an > > > error", not "a failure". > > > > I'll do this thoroughly, so you can see how it goes. > > > > Let > > > > p = probability of a detectible error occuring on a disk in a unit time > > p'= ................ indetectible ..................................... > > > > Then the probability of an error occuring UNdetected on a n-disk raid > > array is > > > > (n-1)p + np' > > > > and on a 1 disk system (a 1-disk raid array :) it is > > > > p' > > > > OK? (hey, I'm a mathematician, it's obvious to me). > > It may be obvious, but it is also wrong. No, it's quite correct. > But then probability is, I > think, the branch of mathematics that has the highest ratio of people > who think that understand it to people to actually do (witness the > success of lotteries). Possibly. But not all of them teach probability at university level (and did so when they were 21, at the University of Cambridge to boot, and continued teaching pure math there at all subjects and all levels until the age of twenty-eight - so puhleeeze don't bother!). > The probability of an event occurring lies between 0 and 1 inclusive. > You have given a formula for a probability which could clearly evaluate > to a number greater than 1. So it must be wrong. The hypothesis here is that p is vanishingly small. I.e. this is a Poisson distribution - the analysis assumes that only one event can occcur per unit time. Take the unit too be one second if you like. Does that make it true enough for you? Poisson distros are pre-A level math. > You have also been very sloppy in your language, or your definitions. > What do you mean by a "detectable error occurring"? I mean an error occurs that can be detected (by the experiment you run, which is prsumably an fsck, but I don't presume to dictate to you). > Is it a bit > getting flipped on the media, or the drive detecting a CRC error > during read? I don't know. It's whatever your test can detect. You can tell me! > And what is your senario for an undetectable error happening? Likewise, I don't know. It's whatever error your experiment (presumably an fsck) will miss. > My > understanding of drive technology and CRCs suggests that undetectable > errors don't happen without some sort of very subtle hardware error, They happen all the time - just write a 1 to disk A and a zero to disk B in the middle of the data in some file, and you will have an undetectible error (vis a vis your experimental observation, which is presumably an fsck). > or high level software error (i.e. the wrong data was written - and > that doesn't really count). It counts just fine, since it's what does happen :- consider a system crash that happens AFTER one of a pair of writes to the two disk components has completed, but BEFORE the second has completed. Then on reboot your experiment (an fsck) has the task of finding the error (which exists at least as a discrepency between the two disks), if it can, and shouting at you about it. All I am saying is that the error is either detectible by your experiment (the fsck), or not. If it IS detectible, then there is a 50% chance that it WON'T be deetcted, even though it COULD be detected, because the system unfortunately chose to read the wrong disk at that moment. However, the error is twice as likely as with only one disk, whatever it is (you can argue aboutthe real multiplier, but it is about that). And if it is not detectible, it's still twice as likely as with one disk, for the same reason - more real estate for it to happen on. This is just elementary operational research! Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 0:28 ` Peter T. Breuer @ 2005-01-04 1:18 ` Alvin Oga 2005-01-04 4:29 ` Neil Brown 2005-01-04 2:07 ` Neil Brown 1 sibling, 1 reply; 92+ messages in thread From: Alvin Oga @ 2005-01-04 1:18 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tue, 4 Jan 2005, Peter T. Breuer wrote: > Neil Brown <neilb@cse.unsw.edu.au> wrote: > > > Let > > > > > > p = probability of a detectible error occuring on a disk in a unit time > > > p'= ................ indetectible ..................................... > > > i think the definitions and modes of failures is what each reader is interpretting from their perspective ?? > > think, the branch of mathematics that has the highest ratio of people > > who think that understand it to people to actually do (witness the > > success of lotteries). ahh ... but the stock market is the worlds largest casino > Possibly. But not all of them teach probability at university level > (and did so when they were 21, at the University of Cambridge to boot, > and continued teaching pure math there at all subjects and all levels > until the age of twenty-eight - so puhleeeze don't bother!). :-) > I mean an error occurs that can be detected (by the experiment you run, > which is prsumably an fsck, but I don't presume to dictate to you). or more simply, the disk doesnt work .. what you write is not what you get back ?? - below that level, there'd be crc errors, some fixable some not - below that, there'd be disk controller problems with bad block mapping and temperature sensitive failures - below that ... flaky heads and platters and oxide .. > > Is it a bit > > getting flipped on the media, or the drive detecting a CRC error > > during read? different error conditions ... - bit flipping is trivially fixed ... and the user probasbly doesnt know about it - crc error of 1 bit error or 2 bit error or burst errors ( all are different crc errors and ecc problems ) > I don't know. It's whatever your test can detect. You can tell me! i think most people only care about ... can we read the "right data" back some time later after we had previously written it "supposedly correctly" > > And what is your senario for an undetectable error happening? there's lots of undetectable errors ... there's lots of detectable errors that was fixed, so that the user doesnt know abut the underlying errors > Likewise, I don't know. It's whatever error your experiment > (presumably an fsck) will miss. fsck is too high a level to be worried about errors... - it assume the disk is workiing fine and fsck fixes the filesystem inodes and doesnt worry about "disk errors" > > My > > understanding of drive technology and CRCs suggests that undetectable > > errors don't happen without some sort of very subtle hardware error, some crc ( ecc ) will fix it ... some errors are Not fixable "crc" is not used too much ... ecc is used ... > > > or high level software error (i.e. the wrong data was written - and > > that doesn't really count). > > It counts just fine, since it's what does happen :- consider a system > crash that happens AFTER one of a pair of writes to the two disk > components has completed, but BEFORE the second has completed. Then on > reboot your experiment (an fsck) has the task of finding the error > (which exists at least as a discrepency between the two disks), if it > can, and shouting at you about it. a common problem ... that data is partially written during a crash very hard to fix .. without knowing what the data should have been > All I am saying is that the error is either detectible by your > experiment (the fsck), or not. or detectable/undetectable/fixable by other "methods" If it IS detectible, then there > is a 50% chance that it WON'T be deetcted, that'd depend on what the failure mode was .. > even though it COULD be > detected, because the system unfortunately chose to read the wrong > disk at that moment. the assumption is that if one writes data ... that the crc/ecc is written somewhere else that is correct or vice versa, but both could be written wrong > And if it is not detectible, it's still twice as likely as with one > disk, for the same reason - more real estate for it to happen on. more "(disk) real estate" increases the places where errors can occur ... but todays, disk drives is lots lots better than the old days and todays dd copying of disk might work, but doing dd on old disks w/ bad oxides will create lots of problems ... == == fun stuff ... how do you make your data more secure ... == and reliable == c ya alvin ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 1:18 ` Alvin Oga @ 2005-01-04 4:29 ` Neil Brown 2005-01-04 8:43 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Neil Brown @ 2005-01-04 4:29 UTC (permalink / raw) To: Alvin Oga; +Cc: Peter T. Breuer, linux-raid On Monday January 3, aoga@ns.Linux-Consulting.com wrote: > > > > think, the branch of mathematics that has the highest ratio of people > > > who think that understand it to people to actually do (witness the > > > success of lotteries). > > ahh ... but the stock market is the worlds largest casino and how many people do you know who make money on stock markets. Now compare that with how many loose money on lotteries. Find out the ratio and ..... > > > Possibly. But not all of them teach probability at university level > > (and did so when they were 21, at the University of Cambridge to boot, > > and continued teaching pure math there at all subjects and all levels > > until the age of twenty-eight - so puhleeeze don't bother!). Apparently teaching probability at University doesn't necessary mean that you understand it. I cannot comment on your understanding, but if you ask google about the Monty Hall problem and include search terms like "professor" or "maths department" you will find plenty of (reported) cases of University staff not getting it. e.g. http://www25.brinkster.com/ranmath/marlright/montynyt.htm "Our math department had a good, self-righteous laugh at your expense," wrote Mary Jane Still, a professor at Palm Beach Junior College. Robert Sachs, a professor of mathematics at George Mason University in Fairfax, Va., expressed the prevailing view that there was no reason to switch doors. They were both wrong. NeilBrown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 4:29 ` Neil Brown @ 2005-01-04 8:43 ` Peter T. Breuer 0 siblings, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 8:43 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Monday January 3, aoga@ns.Linux-Consulting.com wrote: > > > > > > think, the branch of mathematics that has the highest ratio of people > > > > who think that understand it to people to actually do (witness the > > > > success of lotteries). > > > > ahh ... but the stock market is the worlds largest casino > > and how many people do you know who make money on stock markets. Ooh .. several mathematicians (pay is not very high!). > Now compare that with how many loose money on lotteries. I don't have to - I wouldn't place money in a lottery. The expected gain is negative whatever you do. I stick to investments where I have an expectation of a positive gain with at least some strategy. Mind you, as Conway often said, statistics don't apply to improbable events. So you should bet on anything which is not likely to occur more than once or twice a lifetime (theory - if you win, just don't try it again; if you die first, well, you won't care). Stick to someting more certain, like blackjack, if you want to make $. > Apparently teaching probability at University doesn't necessary mean > that you understand it. Perhaps the problem is at your end? > I cannot comment on your understanding, but > if you ask google about the Monty Hall problem and include search > terms like "professor" or "maths department" you will find plenty of > (reported) cases of University staff not getting it. Who cares? It's easy to concoct problems that a person will get wrong if they answer according to intuition. I can do that trick easily on you! (or anyone). > They were both wrong. What are you trying to "prove"? There is no need to be insulting. Simply pick up the technical conversation! Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 0:28 ` Peter T. Breuer 2005-01-04 1:18 ` Alvin Oga @ 2005-01-04 2:07 ` Neil Brown 2005-01-04 2:16 ` Ewan Grantham 2005-01-04 9:40 ` Peter T. Breuer 1 sibling, 2 replies; 92+ messages in thread From: Neil Brown @ 2005-01-04 2:07 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > > Then the probability of an error occuring UNdetected on a n-disk raid > > > array is > > > > > > (n-1)p + np' > > > > > > The probability of an event occurring lies between 0 and 1 inclusive. > > You have given a formula for a probability which could clearly evaluate > > to a number greater than 1. So it must be wrong. > > The hypothesis here is that p is vanishingly small. I.e. this is a Poisson > distribution - the analysis assumes that only one event can occcur per > unit time. Take the unit too be one second if you like. Does that make > it true enough for you? Sorry, I didn't see any such hypothesis stated and I don't like to assUme. So what you are really saying is that: for sufficiently small p and p' (i.e. p-squared terms can be ignored) the probability of an error occurring undetected approximates (n-1)p + np' this may be true, but I'm still having trouble understanding what your p and p' really mean. > > You have also been very sloppy in your language, or your definitions. > > What do you mean by a "detectable error occurring"? > > I mean an error occurs that can be detected (by the experiment you run, > which is prsumably an fsck, but I don't presume to dictate to you). > The whole point of RAID is that fsck should NEVER see any error caused by drive failure. I think we have a major communication failure here, because I have no idea what sort of failure scenario you are imagining. > > Is it a bit > > getting flipped on the media, or the drive detecting a CRC error > > during read? > > I don't know. It's whatever your test can detect. You can tell me! > > > And what is your senario for an undetectable error happening? > > Likewise, I don't know. It's whatever error your experiment > (presumably an fsck) will miss. But 'fsck's primary purpose is not to detect errors on the disk. It is to repair a filesystem after an unclean shutdown. It can help out a bit after disk corruption, but usually disk corruption (apart from very minimal problems) causes fsck to fail to do anything useful. > > > My > > understanding of drive technology and CRCs suggests that undetectable > > errors don't happen without some sort of very subtle hardware error, > > They happen all the time - just write a 1 to disk A and a zero to disk > B in the middle of the data in some file, and you will have an > undetectible error (vis a vis your experimental observation, which is > presumably an fsck). But this doesn't happen. You *don't* write 1 to disk A and 0 to disk B. I admit that this can actually happen occasionally (but certainly not "all the time"). But when it does, there will be subsequent writes to both A and B with new, correct, data. During the intervening time that block will not be read from A or B. If there is a system crash before correct, consistent data is written, then on restart, disk B will not be read at all until disk A as been completely copied on it. So again, I fail to see your failure scenario. > > > or high level software error (i.e. the wrong data was written - and > > that doesn't really count). > > It counts just fine, since it's what does happen :- consider a system > crash that happens AFTER one of a pair of writes to the two disk > components has completed, but BEFORE the second has completed. Then on > reboot your experiment (an fsck) has the task of finding the error > (which exists at least as a discrepency between the two disks), if it > can, and shouting at you about it. No. RAID will not let you see that discrepancy, and will not let the discrepancy last any longer that it takes to read on drive and write the other. > > All I am saying is that the error is either detectible by your > experiment (the fsck), or not. If it IS detectible, then there > is a 50% chance that it WON'T be deetcted, even though it COULD be > detected, because the system unfortunately chose to read the wrong > disk at that moment. However, the error is twice as likely as with only > one disk, whatever it is (you can argue aboutthe real multiplier, but > it is about that). > > And if it is not detectible, it's still twice as likely as with one > disk, for the same reason - more real estate for it to happen on. Maybe I'm beginning to understand your failure scenario. It involves different data being written to the drives. Correct? That only happens if: 1/ there is a software error 2/ there is an admin error You seem to be saying that if this happens, then raid is less reliable than non-raid. There may be some truth in this, but it is irrelevant. The likelyhood of such a software error or admin error happening on a well-managed machine is substantially less than the likelyhood of a drive media error, and raid will protect from drive media errors. So using raid might reduce reliability in a tiny number of cases, but will increase it substantially in a vastly greater number of cases. NeilBrown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:07 ` Neil Brown @ 2005-01-04 2:16 ` Ewan Grantham 2005-01-04 2:22 ` Neil Brown 2005-01-04 9:40 ` Peter T. Breuer 1 sibling, 1 reply; 92+ messages in thread From: Ewan Grantham @ 2005-01-04 2:16 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid I are confused... Which perhaps should be a lesson to a slightly knowlegeable user not to read a thread like this. But having given myself a headache trying to figure this all out, I guess I'll just go ahead and ask directly. I've setup a RAID-5 array using two internal 250 Gig HDs and two external 250 Gig HDs through a USB-2 interface. Each of the externals is on it's own card, and the internals are on seperate IDE channels. I "thought" I was doing a good thing by doing all of this and then setting them up using an ext3 filesystem. From the reading on here I'm not clear if I should have specified something besides whatever ext3 does by default when you set it up, and if so if it's something I can still do without having to redo everything. Something I'd rather not do to be honest. Thanks in advance, Ewan --- http://a1.blogspot.com - commentary since 2002 ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:16 ` Ewan Grantham @ 2005-01-04 2:22 ` Neil Brown 2005-01-04 2:41 ` Andy Smith 0 siblings, 1 reply; 92+ messages in thread From: Neil Brown @ 2005-01-04 2:22 UTC (permalink / raw) To: Ewan Grantham; +Cc: linux-raid On Monday January 3, ewan.grantham@gmail.com wrote: > I are confused... > > Which perhaps should be a lesson to a slightly knowlegeable user not > to read a thread like this. > > But having given myself a headache trying to figure this all out, I > guess I'll just go ahead and ask directly. > > I've setup a RAID-5 array using two internal 250 Gig HDs and two > external 250 Gig HDs through a USB-2 interface. Each of the externals > is on it's own card, and the internals are on seperate IDE channels. > > I "thought" I was doing a good thing by doing all of this and then > setting them up using an ext3 filesystem. Sounds like a perfectly fine setup (providing always that external cables are safe from stray feet etc). No need to change anything. NeilBrown > > >From the reading on here I'm not clear if I should have specified > something besides whatever ext3 does by default when you set it up, > and if so if it's something I can still do without having to redo > everything. Something I'd rather not do to be honest. > > Thanks in advance, > Ewan > --- > http://a1.blogspot.com - commentary since 2002 ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:22 ` Neil Brown @ 2005-01-04 2:41 ` Andy Smith 2005-01-04 3:42 ` Neil Brown ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Andy Smith @ 2005-01-04 2:41 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 891 bytes --] On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > On Monday January 3, ewan.grantham@gmail.com wrote: > > I've setup a RAID-5 array using two internal 250 Gig HDs and two > > external 250 Gig HDs through a USB-2 interface. Each of the externals > > is on it's own card, and the internals are on seperate IDE channels. > > > > I "thought" I was doing a good thing by doing all of this and then > > setting them up using an ext3 filesystem. > > Sounds like a perfectly fine setup (providing always that external > cables are safe from stray feet etc). > > No need to change anything. Except that Peter says that the ext3 journals should be on separate non-mirrored devices and the reason this is not mentioned in any documentation (md / ext3) is that everyone sees it as obvious. Whether it is true or not it's clear to me that it's not obvious to everyone. [-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:41 ` Andy Smith @ 2005-01-04 3:42 ` Neil Brown 2005-01-04 9:50 ` Peter T. Breuer 2005-01-04 9:30 ` Maarten 2005-01-04 9:46 ` Peter T. Breuer 2 siblings, 1 reply; 92+ messages in thread From: Neil Brown @ 2005-01-04 3:42 UTC (permalink / raw) To: Andy Smith; +Cc: linux-raid On Tuesday January 4, andy@strugglers.net wrote: > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > On Monday January 3, ewan.grantham@gmail.com wrote: > > > I've setup a RAID-5 array using two internal 250 Gig HDs and two > > > external 250 Gig HDs through a USB-2 interface. Each of the externals > > > is on it's own card, and the internals are on seperate IDE channels. > > > > > > I "thought" I was doing a good thing by doing all of this and then > > > setting them up using an ext3 filesystem. > > > > Sounds like a perfectly fine setup (providing always that external > > cables are safe from stray feet etc). > > > > No need to change anything. > > Except that Peter says that the ext3 journals should be on separate > non-mirrored devices and the reason this is not mentioned in any > documentation (md / ext3) is that everyone sees it as obvious. > Whether it is true or not it's clear to me that it's not obvious to > everyone. If Peter says that, then Peter is WRONG. ext3 journals are much safer on mirrored devices than on non-mirrored devices just the same as any other data is safer on mirrored than on non-mirrored. In the case in question, it is raid5, not mirrored, but still raid5 is safer than raid0 or single devices (possibly not quite as safe was raid1). NeilBrown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 3:42 ` Neil Brown @ 2005-01-04 9:50 ` Peter T. Breuer 2005-01-04 14:15 ` David Greaves 2005-01-04 16:42 ` Guy 0 siblings, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 9:50 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday January 4, andy@strugglers.net wrote: > > Except that Peter says that the ext3 journals should be on separate > > non-mirrored devices and the reason this is not mentioned in any > > documentation (md / ext3) is that everyone sees it as obvious. > > Whether it is true or not it's clear to me that it's not obvious to > > everyone. > > If Peter says that, then Peter is WRONG. But Peter does NOT say that. > ext3 journals are much safer on mirrored devices than on non-mirrored That's irrelevant - you don't care what's in the journal, because if your system crashes before committal you WANT the data in the journal to be lost, rolled back, whatever, and you don't want your machine to have acked the write until it actually has gone to disk. Or at least that's what *I* want. But then everyone has different wants and needs. What is obvious, however, are the issues involved. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 9:50 ` Peter T. Breuer @ 2005-01-04 14:15 ` David Greaves 2005-01-04 15:20 ` Peter T. Breuer 2005-01-04 16:42 ` Guy 1 sibling, 1 reply; 92+ messages in thread From: David Greaves @ 2005-01-04 14:15 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter T. Breuer wrote: >>ext3 journals are much safer on mirrored devices than on non-mirrored >> >> >That's irrelevant - you don't care what's in the journal, because if >your system crashes before committal you WANT the data in the journal >to be lost, rolled back, whatever, and you don't want your machine to >have acked the write until it actually has gone to disk. > >Or at least that's what *I* want. But then everyone has different >wants and needs. What is obvious, however, are the issues involved. > > err, no. If the journal is safely written to the journal device and the machine crashes whilst updating the main filesystem you want the journal to be replayed, not erased. The journal entries are designed to be replayable to a partially updated filesystem. That's the whole point of journalling filesystems, write the deltas to the journal, make the changes to the fs, delete the deltas from the journal. If the machine crashes whilst the deltas are being written then you won't play them back - but your fs will be consistent. Journaled filesystems simply ensure the integrity of the fs metadata - they don't protect against random acts of application/user level vandalism (ie power failure). David ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 14:15 ` David Greaves @ 2005-01-04 15:20 ` Peter T. Breuer 0 siblings, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 15:20 UTC (permalink / raw) To: linux-raid David Greaves <david@dgreaves.com> wrote: > Peter T. Breuer wrote: > > >>ext3 journals are much safer on mirrored devices than on non-mirrored > >That's irrelevant - you don't care what's in the journal, because if > >your system crashes before committal you WANT the data in the journal > >to be lost, rolled back, whatever, and you don't want your machine to > >have acked the write until it actually has gone to disk. > > > >Or at least that's what *I* want. But then everyone has different > >wants and needs. What is obvious, however, are the issues involved. > > If the journal is safely written to the journal device and the machine You don't know it has been. Raid can't tell. > crashes whilst updating the main filesystem you want the journal to be > replayed, not erased. The journal entries are designed to be replayable > to a partially updated filesystem. It doesn't work. You can easily get a block written to the journal on disk A, but not on disk B (supposing raid 1 with disks A and B). According to you "this" should be replayed. Well, which result do you want? Raid has no way of telling. Suppose that A contains the last block to be written to a file, and does not. Yet B is chosen by raid as the "reliable" source. Then what happens? Is the transaction declared "completed" with incomplete data? With incorrect data? Myself I'd hope it were rolled back, whichever of A or B were chosen, because some final annotation was missing from the journal, saying "finished and ready to send" (alternating bit protocol :-). But you can't win ... what if the "final" annotation were written to journal on A but not on B. Then what would happen? Well, then whichever of A or B the raid chose, you'd either get the data rolled forward or backward. Which would you prefer? I'd just prefer that it was all rolled back. > That's the whole point of journalling filesystems, write the deltas to > the journal, make the changes to the fs, delete the deltas from the journal. Consider the above. There is no magic. > If the machine crashes whilst the deltas are being written then you > won't play them back - but your fs will be consistent. What if the delta is written to one journal, but not to the other, when the machine crashes? I outlined the problem above. You can't win this game. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 9:50 ` Peter T. Breuer 2005-01-04 14:15 ` David Greaves @ 2005-01-04 16:42 ` Guy 2005-01-04 17:46 ` Peter T. Breuer 1 sibling, 1 reply; 92+ messages in thread From: Guy @ 2005-01-04 16:42 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid This may be a stupid question... But it seems obvious to me! If you don't want your journal after a crash, why have a journal? Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Tuesday, January 04, 2005 4:51 AM To: linux-raid@vger.kernel.org Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday January 4, andy@strugglers.net wrote: > > Except that Peter says that the ext3 journals should be on separate > > non-mirrored devices and the reason this is not mentioned in any > > documentation (md / ext3) is that everyone sees it as obvious. > > Whether it is true or not it's clear to me that it's not obvious to > > everyone. > > If Peter says that, then Peter is WRONG. But Peter does NOT say that. > ext3 journals are much safer on mirrored devices than on non-mirrored That's irrelevant - you don't care what's in the journal, because if your system crashes before committal you WANT the data in the journal to be lost, rolled back, whatever, and you don't want your machine to have acked the write until it actually has gone to disk. Or at least that's what *I* want. But then everyone has different wants and needs. What is obvious, however, are the issues involved. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 16:42 ` Guy @ 2005-01-04 17:46 ` Peter T. Breuer 0 siblings, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 17:46 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > This may be a stupid question... But it seems obvious to me! > If you don't want your journal after a crash, why have a journal? Journalled fs's have the property that their file systems are always coherent (provided other corruption has not occurred). This is often advantageous in terms of providing you with the ability to at least boot. The fs code is oranised so that everuthig is set up for a metadata change, and then a single "final" atomic operation occurs that finalizes the change. It is THAT property that is desirable. It is not intrinsic to journalled file systems, but in practice only journalled file systems have implemented it. In other words, what I'd like here is a journalled file system with a zero size journal. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:41 ` Andy Smith 2005-01-04 3:42 ` Neil Brown @ 2005-01-04 9:30 ` Maarten 2005-01-04 10:18 ` Peter T. Breuer 2005-01-04 9:46 ` Peter T. Breuer 2 siblings, 1 reply; 92+ messages in thread From: Maarten @ 2005-01-04 9:30 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 03:41, Andy Smith wrote: > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > On Monday January 3, ewan.grantham@gmail.com wrote: > > No need to change anything. > > Except that Peter says that the ext3 journals should be on separate > non-mirrored devices and the reason this is not mentioned in any > documentation (md / ext3) is that everyone sees it as obvious. > Whether it is true or not it's clear to me that it's not obvious to > everyone. Be that as it may, with all that Peter wrote in the last 24 hours I tend to weigh his expertise a bit less than I did before. YMMV, but his descriptions of his data center do not instill a very high confidence, do they ? While it may be true that genius math people may make lousy server admins (and vice versa), when I read someone claiming there are random undetected errors propagating through raid, yet this person cannot even regulate his own "random, undetected" power supply problems, then I start to wonder. Would you believe that at one point, for a minute I wondered whether Peter was actually a troll ? (yeah, sorry for that, but it happened...) So no, he apparently is employed at a Spanish university, and he even has a Freshmeat project entry, something to do with raid... So I'm left with a blank stare, trying to figure out what to make of it. Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 9:30 ` Maarten @ 2005-01-04 10:18 ` Peter T. Breuer 2005-01-04 13:36 ` Maarten 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 10:18 UTC (permalink / raw) To: linux-raid Maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 03:41, Andy Smith wrote: > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > > On Monday January 3, ewan.grantham@gmail.com wrote: > > > > No need to change anything. > > > > Except that Peter says that the ext3 journals should be on separate > > non-mirrored devices and the reason this is not mentioned in any > > documentation (md / ext3) is that everyone sees it as obvious. > > Whether it is true or not it's clear to me that it's not obvious to > > everyone. > > Be that as it may, with all that Peter wrote in the last 24 hours I tend to > weigh his expertise a bit less than I did before. YMMV, but his descriptions > of his data center do not instill a very high confidence, do they ? It's not "my" data center. It is what it is. I can only control certain things in it, such as the software on the machines, and which machines are bought. Nor is it a "data center", but a working environment for about 200 scientists and engineers, plus thousands of incompetent monkeys. I.e., a university department. It would be good of you to refrain from justifications based on denigration. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 10:18 ` Peter T. Breuer @ 2005-01-04 13:36 ` Maarten 2005-01-04 14:13 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Maarten @ 2005-01-04 13:36 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote: > Maarten <maarten@ultratux.net> wrote: > > On Tuesday 04 January 2005 03:41, Andy Smith wrote: > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > > > On Monday January 3, ewan.grantham@gmail.com wrote: > > It's not "my" data center. It is what it is. I can only control > certain things in it, such as the software on the machines, and which > machines are bought. Nor is it a "data center", but a working > environment for about 200 scientists and engineers, plus thousands of > incompetent monkeys. I.e., a university department. > > It would be good of you to refrain from justifications based on > denigration. I seem to recall you starting off boasting about the systems you had in place, with the rsync mirroring and all-servers-bought-in-duplicate. If then later on your whole secure data center turns out to be a school department, undoubtedly with viruses rampant, students hacking at the schools' systems, peer to peer networks installed on the big fileservers unbeknownst to the admins, and only mains power when you're lucky, yes, then I get a completely other picture than you drew at first. You can't blame me for that. This does not mean you're incompetent, it just means you called a univ IT dept something that it is not, and never will be: secure, stable and organized. In other words, if you dislike being put down, you best not boast so much. Now you'll have to excuse me, I have things to get done today. Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 13:36 ` Maarten @ 2005-01-04 14:13 ` Peter T. Breuer 2005-01-04 19:22 ` maarten 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 14:13 UTC (permalink / raw) To: linux-raid Maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote: > > Maarten <maarten@ultratux.net> wrote: > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote: > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > > > > On Monday January 3, ewan.grantham@gmail.com wrote: > > > > It's not "my" data center. It is what it is. I can only control > > certain things in it, such as the software on the machines, and which > > machines are bought. Nor is it a "data center", but a working > > environment for about 200 scientists and engineers, plus thousands of > > incompetent monkeys. I.e., a university department. > > > > It would be good of you to refrain from justifications based on > > denigration. > > I seem to recall you starting off boasting about the systems you had in place, I'm not "boasting" about them. They simply ARE. > with the rsync mirroring and all-servers-bought-in-duplicate. If then later That's what there is. Is that supposed to be boasting? The servers are always bought in pairs. They always failover to each other. They contain each others mirrors. Etc. > on your whole secure data center turns out to be a school department, Eh? > undoubtedly with viruses rampant, students hacking at the schools' systems, Sure - that's precisely what there is. > peer to peer networks installed on the big fileservers unbeknownst to the Uh, no. We don't run windos. Well, it is on the clients, but I simply sabaotage them whenever I can :). That saves time. Then they can boot into the right o/s. > admins, and only mains power when you're lucky, yes, then I get a completely > other picture than you drew at first. You can't blame me for that. I don't "draw any picture". I am simply telling you it as it is. > This does not mean you're incompetent, it just means you called a univ IT dept > something that it is not, and never will be: secure, stable and organized. Eh? It's as secure stable and organised as it can be, given that nobody is in charge of anything. > In other words, if you dislike being put down, you best not boast so much. About what! > Now you'll have to excuse me, I have things to get done today. I don't. I just have to go generate some viruses and introduce chaos into some otherwise perfectly stable systems. Ho hum. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 14:13 ` Peter T. Breuer @ 2005-01-04 19:22 ` maarten 2005-01-04 20:05 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: maarten @ 2005-01-04 19:22 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote: > Maarten <maarten@ultratux.net> wrote: > > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote: > > > Maarten <maarten@ultratux.net> wrote: > > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote: > > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > > > > > On Monday January 3, ewan.grantham@gmail.com wrote: > > > > > > It's not "my" data center. It is what it is. I can only control > > > certain things in it, such as the software on the machines, and which > > > machines are bought. Nor is it a "data center", but a working > > > environment for about 200 scientists and engineers, plus thousands of > > > incompetent monkeys. I.e., a university department. > I'm not "boasting" about them. They simply ARE. Are you not boasting about it, simply by providing all the little details no one cares about, except that it makes your story more believable ? If I state my IQ was tested as above 140, am I then boasting, or simply stating a fact ? Stating a fact and boasting are not mutually exclusive. > > on your whole secure data center turns out to be a school department, > > Eh? What, "Eh?" ? Are you taking offense to me calling a "university department" a school ? Is it not what you are, you are an educational institution, ie. a school. > > undoubtedly with viruses rampant, students hacking at the schools' > > systems, > > Sure - that's precisely what there is. Hah. Show me one school where there isn't. > > peer to peer networks installed on the big fileservers unbeknownst to the > > Uh, no. We don't run windos. Well, it is on the clients, but I simply > sabaotage them whenever I can :). That saves time. Then they can boot > into the right o/s. Ehm. p2p exists for linux too. Look into it. Are you so dead certain no student of yours ever found a local root hole ? Then you have more balls than you can carry. > Eh? It's as secure stable and organised as it can be, given that nobody > is in charge of anything. Normal people usually refer to such a state as "an anarchy". Not a perfect example of stability, security or organization by any stretch of the imagination... Maarten -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 19:22 ` maarten @ 2005-01-04 20:05 ` Peter T. Breuer 2005-01-04 21:38 ` Guy 2005-01-04 21:48 ` maarten 0 siblings, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 20:05 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote: > > Maarten <maarten@ultratux.net> wrote: > > > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote: > > > > Maarten <maarten@ultratux.net> wrote: > > > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote: > > > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > > > > > > On Monday January 3, ewan.grantham@gmail.com wrote: > > > > > > > > It's not "my" data center. It is what it is. I can only control > > > > certain things in it, such as the software on the machines, and which > > > > machines are bought. Nor is it a "data center", but a working > > > > environment for about 200 scientists and engineers, plus thousands of > > > > incompetent monkeys. I.e., a university department. > > > I'm not "boasting" about them. They simply ARE. > > Are you not boasting about it, simply by providing all the little details no > one cares about, except that it makes your story more believable ? What "little details"? Really, this is most aggravating! > If I state my IQ was tested as above 140, am I then boasting, or simply > stating a fact ? You're being an improbability. > Stating a fact and boasting are not mutually exclusive. But about WHAT? I have no idea what you may consider boasting! > > > on your whole secure data center turns out to be a school department, > > > > Eh? > > What, "Eh?" ? What "secure data center"? I have never mentioned such a thing! We have a big floor of a big building, plus a few labs in the basement, and a few labs in another couple of buldings. We used to locate the servers in secure places in the various buildings, but nowadays we tend to dump most of them in a single highly a/c'ed rack room up here. Mind you, I still have six in my office, and I don't use a/c. I guess others are scattered arund the place too. However, all the servers are paired, and fail over to each other, and do each others mirroring. If I recall correctly, even the pairs fail over to backup pairs. Last xmas I distinctly remember holding up the department on a single surviving server because a faulty cable had intermittently taken out one pair, and a faulty router had taken out another. I forget what had happened to the remaining server. Probably the cleaners switched it off! Anyway, one survived and everything failed over to it, in a planned degradation. It would have been amusing, if I hadn't had to deal with a horrible mail loop caused by mail being bounced by he server with intermittent contact through the faulty cable. There was no way of stopping it, since I couldn't open the building till Jan 6! > Are you taking offense to me calling a "university department" a school ? No - it is a school. "La escuela superior de ...". What the french call an "ecole superior". > Is > it not what you are, you are an educational institution, ie. a school. Schools are not generally universityies, except perhaps in the united states! Elsewhere one goes to learn, not to be taught. > > > undoubtedly with viruses rampant, students hacking at the schools' > > > systems, > > > > Sure - that's precisely what there is. > > Hah. Show me one school where there isn't. It doesn't matter. There is nothing they can do (provided that is, the comp department manages to learn how to configure ldap so that people don't send their passwords in the clear to their server for confirmation ... however, only you and I know that, eh?). > > Uh, no. We don't run windos. Well, it is on the clients, but I simply > > sabaotage them whenever I can :). That saves time. Then they can boot > > into the right o/s. > > Ehm. p2p exists for linux too. Look into it. Are you so dead certain no Do you mean edonkey and emule by that? "p2p" signifies nothing to me except "peer to peer", which is pretty well everything. For example, samba. There's nothing wrong with using such protocols. If you mean using it to download fillums, that's a personal question - we don't check data contents, and indeed it's not clear that we legally could, since the digital information acts here recognise digital "property rights" and "rights to privacy" that we cannot intrude into. Legally, of course. > student of yours ever found a local root hole ? Absolutely. Besides - it would be trivial to do. I do it all the time. That's really not the point - we would see it at once if they decided to do anything with root - all the alarm systems would trigger if _anyone_ does anything with root. All the machines are alarmed like mines, checked daily, byte by byte, and rootkits are easy to see, whenever they turn up. I have a nice collection. Really, I am surprised at you! Any experienced sysadmin would know that such things are trivialities to spot and remove. It is merely an intelligence test, and the attacker does not have more intelligence or experience than the defenders! Quite the opposite. > Then you have more balls than you can carry. ??? > > Eh? It's as secure stable and organised as it can be, given that nobody > > is in charge of anything. > > Normal people usually refer to such a state as "an anarchy". Good - that's the way I like it. Coping with and managing chaos amounts to giving the maximum freedom to all, and preserving their freedoms. Including the freedom to mess up. To help them, I maintain copies of their work for them, and guard them against each other and outside threats. > Not a perfect example of stability, security or organization by any stretch of > the imagination... Sounds great to me! Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 20:05 ` Peter T. Breuer @ 2005-01-04 21:38 ` Guy 2005-01-04 23:53 ` Peter T. Breuer 2005-01-05 0:58 ` Mikael Abrahamsson 2005-01-04 21:48 ` maarten 1 sibling, 2 replies; 92+ messages in thread From: Guy @ 2005-01-04 21:38 UTC (permalink / raw) To: linux-raid Back to MTBF please..... I agree that 1M hours MTBF is very bogus. I don't really know how they compute MTBF. But I would like to see them compute the MTBF of a birthday candle. A birthday candle lasts about 2 minutes (as a guess). I think they would light 1000 candles at the same time. Then monitor them until the first one fails, say at 2 minutes. I think the MTBF would then be computed as 2000 minutes MTBF! But we can be sure that by 2.5 minutes, at least 90% of them would have failed. Guy ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:38 ` Guy @ 2005-01-04 23:53 ` Peter T. Breuer 2005-01-05 0:58 ` Mikael Abrahamsson 1 sibling, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 23:53 UTC (permalink / raw) To: linux-raid Guy <bugzilla@watkins-home.com> wrote: > A birthday candle lasts about 2 minutes (as a guess). I think they would > light 1000 candles at the same time. Then monitor them until the first one > fails, say at 2 minutes. I think the MTBF would then be computed as 2000 > minutes MTBF! If the distribution is Poisson (i.e. the probabilty of dying per moment time is constant over time) then that is correct. I don't know offhand if that is an unbiassed estimator. I would imagine not. It would be biassed to the short side. > But we can be sure that by 2.5 minutes, at least 90% of them > would have failed. Then you would be sure that the distribution was not Poisson. What is the problem here, exactly? Many different distributions can have the same mean. For example, this one: deaths per unit time | | /\ | / \ | / \ |/ \ ---------->t and this one deaths per unit time | |\ / | \ / | \ / | \/ ---------->t have the same mean. The same mtbf. Is this a surprise ? The mean on its own is only one parameter of a distribution - for a posson distribution, it is the only parameter, but that is a particular case. For the normal disribution you require both the mean and the standard deviation in order to specify the distribution. You can get very different normal distributions with the same mean! I can't draw a Poisson distribution in ascii, but it has a short sharp rise to the peak, then a long slow decline to infinity. If you were to imagine that half the machines had died by the time the mtbf were reached, you would be very wrong! Many more have died than half. But that long tail of those very few machines that live a LOT longer than the mtbf balances it out. I already did this once for you, but I'll do it again: if the mtbf is ten years, then 10% die every year. Or 90% survive every year. This means that by the time 10 years have passed only 35% have survived (90%^10). So 2/3 of the machines have died by the time the mtbf is reached! If you want to know where the peak of the death rate occurs, well, it looks to me as though it is at the mtbf (but I am calculating mentally, not on paper, so do your own checks). After that deaths become less frequent in the population as a whole. To estimate the mtbf, I would imagine that one averages the proportion of the population that die per month, for several months. But I guess serious appicative statisticians have evolved far more sophisticated and more efficient estimators. And then there is the problem that the distribution is bipolar, not pure poisson. There will be a subpopulation of faulty disks that die off earlier. So they need to discount early measurements in favour of the later ones (bad luck if you get one of the subpopulation of defectives :) - but that's what their return policy is for). Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:38 ` Guy 2005-01-04 23:53 ` Peter T. Breuer @ 2005-01-05 0:58 ` Mikael Abrahamsson 1 sibling, 0 replies; 92+ messages in thread From: Mikael Abrahamsson @ 2005-01-05 0:58 UTC (permalink / raw) To: Guy; +Cc: linux-raid On Tue, 4 Jan 2005, Guy wrote: > light 1000 candles at the same time. Then monitor them until the first one > fails, say at 2 minutes. I think the MTBF would then be computed as 2000 > minutes MTBF! But we can be sure that by 2.5 minutes, at least 90% of them > would have failed. Which is why you, when you purchase a lot of stuff, should ask for an annual return rate value, which probably makes more sense than MTBF, even though these values are related. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 20:05 ` Peter T. Breuer 2005-01-04 21:38 ` Guy @ 2005-01-04 21:48 ` maarten 2005-01-04 23:14 ` Peter T. Breuer 1 sibling, 1 reply; 92+ messages in thread From: maarten @ 2005-01-04 21:48 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: > > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote: > > > Maarten <maarten@ultratux.net> wrote: > > Are you not boasting about it, simply by providing all the little details > > no one cares about, except that it makes your story more believable ? > > What "little details"? Really, this is most aggravating! These little details, as you scribbled, very helpfully I might add, below. ;) | | V > over to backup pairs. Last xmas I distinctly remember holding up the > department on a single surviving server because a faulty cable had > intermittently taken out one pair, and a faulty router had taken out > another. I forget what had happened to the remaining server. Probably > the cleaners switched it off! Anyway, one survived and everything > failed over to it, in a planned degradation. > > It would have been amusing, if I hadn't had to deal with a horrible > mail loop caused by mail being bounced by he server with intermittent > contact through the faulty cable. There was no way of stopping it, > since I couldn't open the building till Jan 6! And another fine example of the various hurdles you encounter ;-) Couldn't you just get the key from someone ? If not, what if you saw something far worse happening, like all servers in one room dying shortly after another, or a full encompassing system compromise going on ?? > > Hah. Show me one school where there isn't. > > It doesn't matter. There is nothing they can do (provided that is, the > comp department manages to learn how to configure ldap so that people > don't send their passwords in the clear to their server for > confirmation ... however, only you and I know that, eh?). Yes. This is not a public mailing list. Ceci n'est pas une pipe. There is nothing they can do... except of course, running p2p nets, spreading viruses, changing their grades, finding out other students' personal info and trying out new ways to collect credit card numbers. Is that what you meant ? > Do you mean edonkey and emule by that? "p2p" signifies nothing to me > except "peer to peer", which is pretty well everything. For example, > samba. There's nothing wrong with using such protocols. If you mean > using it to download fillums, that's a personal question - we don't > check data contents, and indeed it's not clear that we legally could, > since the digital information acts here recognise digital "property > rights" and "rights to privacy" that we cannot intrude into. Legally, > of course. P2p might encompass samba in theory, but the term as used by everybody specifically targets more or less rogue networks that share movies et al. I know of the legal uncertainties associated with it (I'm in the EU too) and I do not condemn the use of them even. It's just that this type of activity can wreak havoc on a network, just from a purely technical standpoint alone. > Absolutely. Besides - it would be trivial to do. I do it all the time. > > That's really not the point - we would see it at once if they decided to > do anything with root - all the alarm systems would trigger if _anyone_ > does anything with root. All the machines are alarmed like mines, > checked daily, byte by byte, and rootkits are easy to see, whenever they > turn up. I have a nice collection. Yes, well, someday someone may come up with a way to defeat your alarms and tripwire / AIDE or whatever you have in place... For instance, how do you check for a rogue LKM ? If coded correctly, there is little you can do to find out it is loaded (all the while feeding you the md5 checksums you expect to find, without any of you being the wiser) apart from booting off a set of known good read-only media... AFAIK. > Really, I am surprised at you! Any experienced sysadmin would know that > such things are trivialities to spot and remove. It is merely an > intelligence test, and the attacker does not have more intelligence > or experience than the defenders! Quite the opposite. Uh-huh. Defeating a random worm, yes. Finding a rogue 4777 /tmp/.../bash shell or an extra "..... root /bin/sh" line in inetd.conf is, too. Those things are scriptkiddies at work. But from math students I expect much more, and so should you, I think. You are dealing with highly intelligent people, some of whom already know more about computers than you'll ever know. (the same holds true for me though, as I'm no young student anymore either...) Maarten -- When I answered where I wanted to go today, they just hung up -- Unknown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:48 ` maarten @ 2005-01-04 23:14 ` Peter T. Breuer 2005-01-05 1:53 ` maarten 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 23:14 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote: > > maarten <maarten@ultratux.net> wrote: > > > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote: > > > > Maarten <maarten@ultratux.net> wrote: > > > > > Are you not boasting about it, simply by providing all the little details > > > no one cares about, except that it makes your story more believable ? > > > > What "little details"? Really, this is most aggravating! > These little details, as you scribbled, very helpfully I might add, below. ;) > | > | > V > > > over to backup pairs. Last xmas I distinctly remember holding up the > > department on a single surviving server because a faulty cable had > > intermittently taken out one pair, and a faulty router had taken out > > another. I forget what had happened to the remaining server. Probably > > the cleaners switched it off! Anyway, one survived and everything > > failed over to it, in a planned degradation. This is in response to your strange statement that I had a "data center". I hope it gives you a better idea. > > It would have been amusing, if I hadn't had to deal with a horrible > > mail loop caused by mail being bounced by he server with intermittent > > contact through the faulty cable. There was no way of stopping it, > > since I couldn't open the building till Jan 6! > > And another fine example of the various hurdles you encounter ;-) > Couldn't you just get the key from someone ? Xmas holidays are total here, from 24 december to 6 jan. There is nobody in the building. Maybe a security guard doing a round, but they certainly do not have authority to let anyone in. Just the opposite! > If not, what if you saw > something far worse happening, like all servers in one room dying shortly > after another, or a full encompassing system compromise going on ?? Nothing - I could not get in. > > It doesn't matter. There is nothing they can do (provided that is, the > > comp department manages to learn how to configure ldap so that people > > don't send their passwords in the clear to their server for > > confirmation ... however, only you and I know that, eh?). > > Yes. This is not a public mailing list. Ceci n'est pas une pipe. Indeed. Let's keep it to ourselves, then. > There is nothing they can do... except of course, running p2p nets, spreading > viruses, changing their grades, finding out other students' personal info and > trying out new ways to collect credit card numbers. Is that what you meant ? No - they can't do any of those things. P2p nets are not illegal, and we would see the traffic if there were any. They cannot "change their grades" because they do not have access to them - nobody does. They are sent to goodness knows where in a govt bulding somewhere via ssl (an improvement from the times when we had to fill in a computer card marked in ink, for goodness sake, but I haven't done the sending in myself lately, so I don't know the details - I give the list to the secretary rather than suffer). As to reading MY disk, anyone can do that. I don't have secrets, be it marks on anything else. Indeed, my disk will nfs mount on the student machines if they so much as cd to my home directory (but don't tell them that!). Of course they'd then have to figure out how to become root in order to change uid so they could read my data, and they can't do that - all the alarms in the building would go off! su isn't even executable, let alone suid, and root login is disabled so many places I forget (heh, .profile in /root ays something awful to you, and then exits), and then there are the reapers, the monitors, oh, everything, waiting for just such an opportunity to ring the alarm bells. As to holes in other protocols, I can't even remenber a daemon that runs as root nowadays without looking! What? And so what? If they got a root shell, everything would start howling. And then if they got a root shell and did something, all the alrms would go off again as the checks swung in on the hour. Why would they risk it? Na .. we only get breakin attempts from script-kiddies outside, not inside. As to credit card numbers - nobody has one. Students don't earn enough to get credit cards. Heck, even the profs don't! As to personal info, they can see whatever anyone can see. There is no special "personal info" anywhere in particular. If somebody wants to keep their password in a file labelled "my passwords" in a world readable directory of their own creation, called "secret", that is their own lookout. If somebody else steals their "digital identity" (if only they knew what it was) they can change their password - heck they have enough trouble remembering the one they have! I'm not paranoid - this is an ordinary place. They either act All I do is provide copies of their accounts, and failover services for them to use, or to hld up other services. And they have plenty of illegal things to do on their own without involving me. > P2p might encompass samba in theory, but the term as used by everybody > specifically targets more or less rogue networks that share movies et al. Not by me - you must be in a particular clique. This is a networking department! It would be strange if anyone were NOT running a peer to peer system! > do not condemn the use of them even. It's just that this type of activity can > wreak havoc on a network, just from a purely technical standpoint alone. Why should it wreak havoc? We have no problem with bandwidth. We have far more prblems when the routing classes deliberately change the network topologies, or some practical implements RIP and one student gets it wrong! There is a time of year when the network bounces like a yo yo because the students are implementing proxy arp and getting it completely wrong! > > That's really not the point - we would see it at once if they decided to > > do anything with root - all the alarm systems would trigger if _anyone_ > > does anything with root. All the machines are alarmed like mines, > > checked daily, byte by byte, and rootkits are easy to see, whenever they > > turn up. I have a nice collection. > > Yes, well, someday someone may come up with a way to defeat your alarms and > tripwire / AIDE or whatever you have in place... For instance, how do you No they won't. And if they do, so what? They will fall over the next one along the line! There is no way they can do it. I couldn't do it if I were trying to avoid me seeing - I'm too experienced as a defender. I can edit a running kernel to reset syscalls that have been altered by adore, and see them. I regularly have I-get-root duels, and I have no problem with getting and keeping root, while letting someone else also stay root. I can get out of a chroot jail, since I have root. I run uml honeypots. > check for a rogue LKM ? Easily. Not worth mentioning. Apart from the classic error that hidden processes have a different count in /proc than via other syscalls, one can see other giveaways like directories with the wrong entry count, and one can see from the outside open ports that are not visibly occupied by anything on the inside. > If coded correctly, there is little you can do to > find out it is loaded (all the while feeding you the md5 checksums you expect They can't predict what attack I can use against them to see it! And there is no defense against an unknown attack. > to find, They don't know what I expect to find, and they would have to keep the original data around, something which would show up in the free space count. And anyway I don't have to see the md5sums to know when a computer is acting strangely - it's entire signature would have changed in terms of reactions to stimuli, the apparant load versus the actual, and so on. You are not being very imaginative! > without any of you being the wiser) apart from booting off a set of > known good read-only media... AFAIK. I don't have to - they don't even know what IS the boot media. > > Really, I am surprised at you! Any experienced sysadmin would know that > > such things are trivialities to spot and remove. It is merely an > > intelligence test, and the attacker does not have more intelligence > > or experience than the defenders! Quite the opposite. > > Uh-huh. Defeating a random worm, yes. Finding a rogue 4777 /tmp/.../bash > shell or an extra "..... root /bin/sh" line in inetd.conf is, too. Those > things are scriptkiddies at work. Sure - and that's all they are. > But from math students I expect much more, I don't, but then neither are these math students - they're telecommunications engineers. > and so should you, I think. You are dealing with highly intelligent people, No I'm not! They are mostly idiots with computers. Most of them couldn't tell you how to copy a file from one place to another (I've seen it), or think of that concept to replace "move to where the file is". If any one of them were to develop to the point of being good enough to even know how to run a skript, I would be pleased. I'd even be pleased if the concept of "change your desktop environment to suit yourself" entered their head, along with "indent your code", "keep comments to less than 80 characters", and so on. If someone were to actually be capable of writing something that looked capable, I would be pleased. I've only seen decent code from overseas students - logical concepts don't seem to penetrate the environment here. The first year of the technical school (as distinct to the "superior" school) is spent trying bring some small percentage of the technical students up to the concept of loops in code - which they mostly cannot grasp. (the technical school has three-year courses, the superior school has six-year courses, though it is not unusual to take eight or nine years, or more). And if they were to be good enough to get root even for a moment, I would be plee3ed. But of course they aren't - they have enough problems passing the exams and finding somebody else to copy practicals off (which they can do simply by paying). > some of whom already know more about computers than you'll ever know. > (the same holds true for me though, as I'm no young student anymore either...) If anyone were good enough to notice, I would notice. And what would make me notice would not be good. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 23:14 ` Peter T. Breuer @ 2005-01-05 1:53 ` maarten 0 siblings, 0 replies; 92+ messages in thread From: maarten @ 2005-01-05 1:53 UTC (permalink / raw) To: linux-raid On Wednesday 05 January 2005 00:14, Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: > > On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote: > > > maarten <maarten@ultratux.net> wrote: > > If not, what if you saw > > something far worse happening, like all servers in one room dying shortly > > after another, or a full encompassing system compromise going on ?? > > Nothing - I could not get in. Now that is a sensible solution ! The fans in the server died off, you have 30 minutes before everything overheats and subsequently incinerates the whole building, and you have no way to prevent that. Great ! Well played. > No - they can't do any of those things. P2p nets are not illegal, and > we would see the traffic if there were any. They cannot "change their > grades" because they do not have access to them - nobody does. They are > sent to goodness knows where in a govt bulding somewhere via ssl (an > improvement from the times when we had to fill in a computer card marked > in ink, for goodness sake, but I haven't done the sending in myself > lately, so I don't know the details - I give the list to the secretary > rather than suffer). As to reading MY disk, anyone can do that. I > don't have secrets, be it marks on anything else. Indeed, my disk will > nfs mount on the student machines if they so much as cd to my home > directory (but don't tell them that!). Of course they'd then have to > figure out how to become root in order to change uid so they could read > my data, and they can't do that - all the alarms in the building would > go off! su isn't even executable, let alone suid, and root login is > disabled so many places I forget (heh, .profile in /root ays something > awful to you, and then exits), and then there are the reapers, the > monitors, oh, everything, waiting for just such an opportunity to ring > the alarm bells. As to holes in other protocols, I can't even remenber > a daemon that runs as root nowadays without looking! What? And so > what? If they got a root shell, everything would start howling. And > then if they got a root shell and did something, all the alrms would go > off again as the checks swung in on the hour. Why would they risk it? > Na .. we only get breakin attempts from script-kiddies outside, not > inside. Uh-oh. Where to start. Shall I start by saying that when you exploit a local root hole you _are_ root and there is no need for any su ? Or shall I start by saying that if they can get access to their tests well in advance they need not access to their grades ? Or perhaps... That your alarm bells probably are just as predictable and reliable as your UPS systems ? Let's leave it at that shall we. > > P2p might encompass samba in theory, but the term as used by everybody > > specifically targets more or less rogue networks that share movies et al. > > Not by me - you must be in a particular clique. This is a networking > department! It would be strange if anyone were NOT running a peer to > peer system! Read a newspaper someday, why don't you...? > There is a time of year when the network bounces like a yo yo because > the students are implementing proxy arp and getting it completely > wrong! Yeah. So maybe they are proxy-arping that PC you mentioned above that sends the grades over SSL. But nooo, no man in the middle attack there, is there ? > > Yes, well, someday someone may come up with a way to defeat your alarms > > and tripwire / AIDE or whatever you have in place... For instance, how > > do you > > No they won't. And if they do, so what? They will fall over the next > one along the line! There is no way they can do it. I couldn't do it > if I were trying to avoid me seeing - I'm too experienced as a defender. > I can edit a running kernel to reset syscalls that have been altered by > adore, and see them. I regularly have I-get-root duels, and I have no > problem with getting and keeping root, while letting someone else also > stay root. I can get out of a chroot jail, since I have root. I run > uml honeypots. W0w you'r3 5o l33t, P3ter ! But thanks, this solves our mystery here ! If you routinely change syscalls on a running kernel that has already been compromised by a rootkit, then it is no wonder you also flip a bit here and there in random files. So you were the culprit all along !!! > and one can see from the outside open ports that are not visibly > occupied by anything on the inside. Oh suuuuure. Never thought about them NOT opening an extra port to the outside ? By means of trojaning sshd, or login, or whatever. W00t ! Or else by portknocking, which google will yield results for I'm sure. > > If coded correctly, there is little you can do to > > find out it is loaded (all the while feeding you the md5 checksums you > > expect > > They can't predict what attack I can use against them to see it! And > there is no defense against an unknown attack. Nope, rather the reverse is true. YOU don't know how to defend yourself, since you don't know what they'll hit you with, when (I'll bet during the two weeks mandatory absense of christmas!) and where they're coming from. > They don't know what I expect to find, and they would have to keep the > original data around, something which would show up in the free space > count. And anyway I don't have to see the md5sums to know when a > computer is acting strangely - it's entire signature would have changed > in terms of reactions to stimuli, the apparant load versus the actual, > and so on. You are not being very imaginative! They have all the time in the world to research all your procedures, if they even have to. For one, this post is googleable. Next, they can snoop around all year on your system just behaving like the good students they seem, and last but not least you seem pretty vulnerable to a social engineering attack; you tell me -a complete stranger- all about it without the least of effort from my side. A minute longer and you'd have told me how your scripts make the md5 snapshots, what bells you have in place and what time you usually read your logfiles (giving a precise window to work in undetected). Please. Enough with the endless arrogance. You are not invincible. The fact alone that you have a "nice stack of rootkits" already is a clear sign on how well your past endeavours fared stopping intruders... > I don't, but then neither are these math students - they're > telecommunications engineers. Oh, telecom engineers ? Oh, indeed, those guys know nothing about computers whatsoever. Nothing. There isn't a single computer to be found in the telecom industry. > If someone were to actually be capable of writing something that looked > capable, I would be pleased. I've only seen decent code from overseas > students - logical concepts don't seem to penetrate the environment > here. The first year of the technical school (as distinct to the > "superior" school) is spent trying bring some small percentage of the > technical students up to the concept of loops in code - which they > mostly cannot grasp. The true blackhat will make an effort NOT to be noticed, so he'll be the last that will try to impress you with an impressive piece of code! It's very strange not to realize even that. I might be paranoid, but you are naive like I've never seen before... > And if they were to be good enough to get root even for a moment, I > would be plee3ed. Of course you would, but then again chances are they will not tell you they got root as that is precisely the point of the whole game. :-) > But of course they aren't - they have enough problems passing the exams > and finding somebody else to copy practicals off (which they can do > simply by paying). Or just copying it off the server directory. > If anyone were good enough to notice, I would notice. And what would > make me notice would not be good. Sure thing, Peter Mitnick... Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:41 ` Andy Smith 2005-01-04 3:42 ` Neil Brown 2005-01-04 9:30 ` Maarten @ 2005-01-04 9:46 ` Peter T. Breuer 2005-01-04 19:02 ` maarten 2 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 9:46 UTC (permalink / raw) To: linux-raid Andy Smith <andy@strugglers.net> wrote: > [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines --] > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > On Monday January 3, ewan.grantham@gmail.com wrote: > > > I've setup a RAID-5 array using two internal 250 Gig HDs and two > > > external 250 Gig HDs through a USB-2 interface. Each of the externals > > > is on it's own card, and the internals are on seperate IDE channels. > > > > > > I "thought" I was doing a good thing by doing all of this and then > > > setting them up using an ext3 filesystem. > > > > Sounds like a perfectly fine setup (providing always that external > > cables are safe from stray feet etc). > > > > No need to change anything. > > Except that Peter says that the ext3 journals should be on separate > non-mirrored devices and the reason this is not mentioned in any > documentation (md / ext3) is that everyone sees it as obvious. No, I dont say the "SHOULD BE" is obvious. I say the issues are obvious. The "should be" is up to you to decide, based on the obvious issues involved :-). > Whether it is true or not it's clear to me that it's not obvious to > everyone. It's not obvious to anyone, where by "it" I mean whether or not you "should" put a journal on the same raid device. There are pros and cons. I would not. My reasoning is that I don't want data in the journal to be subject to the same kinds of creeping invisible corruption on reboot and resync that raid is subject to. But you can achieve that by simply not putting data in the journal at all. But what good does the journal do you then? Well, it helps you avoid an fsck on reboot. But do you wantto avoid an fsck? And reason onwards ... I won't rehash the arguments. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 9:46 ` Peter T. Breuer @ 2005-01-04 19:02 ` maarten 2005-01-04 19:12 ` David Greaves 2005-01-04 21:08 ` Peter T. Breuer 0 siblings, 2 replies; 92+ messages in thread From: maarten @ 2005-01-04 19:02 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 10:46, Peter T. Breuer wrote: > Andy Smith <andy@strugglers.net> wrote: > > [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines > > --] > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote: > > > On Monday January 3, ewan.grantham@gmail.com wrote: > > Except that Peter says that the ext3 journals should be on separate > > non-mirrored devices and the reason this is not mentioned in any > > documentation (md / ext3) is that everyone sees it as obvious. > > It's not obvious to anyone, where by "it" I mean whether or not you > "should" put a journal on the same raid device. There are pros and > cons. I would not. My reasoning is that I don't want data in the > journal to be subject to the same kinds of creeping invisible corruption > on reboot and resync that raid is subject to. But you can achieve that [ I'll attempt to adress all issues that have come up in this entire thread until now here... please bear with me. ] @Peter: I still need you to clarify what can cause such creeping corruption. There are several possible cases: 1) A bit flipped on the platter or the drive firmware had a 'thinko'. This will be signalled by the CRC / ECC on the drive. You can't flip a bit unnoticed. Or in fact, bits get 'flipped' constantly, therefore the highly sophisticated error correction code in modern drives. If the ECC can't rectify such a read error, it will issue a read error to the OS. Obviously, the raid or FS code handles this error in the usual way; this is what we call a bad sector, and we have routines that handle that perfectly. 2) An incomplete write due to a crash. This can't happen on the drive itself, as the onboard cache will ensure everything that's in there gets written to the platter. I have no reason to doubt what the manufacturer promises here, but it is easy to check if one really wants to; just issue a couple thousand cycles of well timed <write block, kill power to drive> commands, and verify if it all got written. (If not: start a class action suit against the manufacturer) Another possibility is it happening in a higher layer, the raid code or the FS code. Let's examine this further. The raid code does not promise that that can't happen ("MD raid is no substitute for a UPS"). But, the FS helps here. In the case of a journaled FS, the first that must be written is the delta. Then the data, then the delta is removed again. From this we can trivially deduce that indeed a journaled FS will not(*) suffer write reordering; as that is the only way data could get written without there first being a journal delta on disk. So at least that part is correct indeed(!) So in fact, a journaled FS will either have to rely on lower layers *not* reordering writes, or will have to wait for the ACK on the journal delta before issuing the actual_data write command(!). (*) unless it waits for the ACK mentioned above. Further, we thus can split up the write in separate actions: A) the time during which the journal delta gets written B) the time during which the data gets written C) the time during which the journal delta gets removed. Now at what point do or did we crash ? If it is at A) the data is consistent, no matter whether the delta got written or not. If it is at B) the data block is in an unknown state and the journal reflects that, so the journal code rolls back. If it is at C) the data is again consistent. Depending on what sense the journal delta makes, there can be a rollback, or not. In either case, the data still remains fully consistent. It's really very simple, no ? Now to get to the real point of the discussion. What changes when we have a mirror ? Well, if you think hard about that: NOTHING. What Peter tends to forget it that there is no magical mixup of drive 1's journal with drive 2's data (yep, THAT would wreak havoc!). At any point in time -whether mirror 1 is chosen as true or mirror 2 gets chosen does not matter as we will see- the metadata+data on _that_ mirror by definition will be one of the cases A through C outlined above. IT DOES NOT MATTER that mirror one might be at stage B and mirror two at stage C. We use but one mirror, and we read from that and the FS rectifies what it needs to rectify. This IS true because the raid code at boot time sees that the shutdown was not clean, and will sync the mirrors. At this point, the FS layer has not even come into play. Only when the resync has finished, the FS gets to examine its journal. -> !! At this point the mirrors are already in sync again !! <- If, for whatever reason, the raid code would NOT have seen the unclean shutdown, _then_ you may have a point, since in that special case it would be possible for the journal entry from mirror one (crashed during stage C) gets used to evaluate the data block on mirror two (being in state B). In those cases, bad things may happen obviously. If I'm not mistaken, this is what happens when one has to assemble --force an array that has had issues. But as far as I can see, that is the only time... Am I making sense so far ? (Peter, this is not adressed to you, as I already know your answer beforehand: I'd be "baby raid tech talk", correct ?) So. What possible scenarios have I overlooked until now...? Oh yeah, the possibility number 3). 3) The inconsistent write comes from a bug in the CPU, RAM, code or such. As Neil already pointed out, you gotta trust your CPU to work right otherwise all bets are off. But even if this could happen, there is no blaming the FS or the raid code, as the faulty request was carried out as directed. The drives may not be in sync, but neither the drive, the raid code nor the FS knows this (and cannot reasonably know!) If a bit in RAM gets flipped in between two writes there is nothing except ECC ram that's going to help you. Last possible theoretical case: the bug is actually IN the raid code. Well, in this case, the error will most certainly be reproduceable. I cannot speak for the code as I haven't written nor reviewed it (nor would I be able to...) but this really seems far-fetched. Lots of people use and test the code, it would have been spotted at some point. Does this make any sense to anybody ? (I sure hope so...) Maarten -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 19:02 ` maarten @ 2005-01-04 19:12 ` David Greaves 2005-01-04 21:08 ` Peter T. Breuer 1 sibling, 0 replies; 92+ messages in thread From: David Greaves @ 2005-01-04 19:12 UTC (permalink / raw) To: maarten; +Cc: linux-raid maarten wrote: >Does this make any sense to anybody ? (I sure hope so...) > >Maarten > Oh yeah! David ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 19:02 ` maarten 2005-01-04 19:12 ` David Greaves @ 2005-01-04 21:08 ` Peter T. Breuer 2005-01-04 22:02 ` Brad Campbell ` (3 more replies) 1 sibling, 4 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 21:08 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > @Peter: > I still need you to clarify what can cause such creeping corruption. The classical cause in raid systems is 1) that data is only partially written to the array on system crash and on recovery the inappropriate choice of alternate datasets from the redundant possibles is propagated. 2) corruption occurs unnoticed in a part of the redundant data that is not currently in use, but a disk in the array then drops out, bringing the data with the error into use. On recovery of the failed disk, the error data is then propagated over the correct data. Plus the usual causes. And anything else I can't think of just now. > 1) A bit flipped on the platter or the drive firmware had a 'thinko'. > > This will be signalled by the CRC / ECC on the drive. Bits flip on our client disks all the time :(. It would be nice if it were the case that they didn't, but it isn't. Mind you, I don't know precisely HOW. I suppose more bits than the CRC can recover change, or something, and the CRC coincides. Anyway, it happens. Probably cpu -mediated. Sorry but I haven't kept any recent logs of 1-bit errors in files on readonly file systems for you to look at. > You can't flip a bit > unnoticed. Not by me, but then I run md5sum every day. Of course, there is a question if the bit changed on disk, in ram, or in the cpu's fevered miscalculations. I've seen all of those. One can tell which after a bit more detective work. > Or in fact, bits get 'flipped' constantly, therefore the highly > sophisticated error correction code in modern drives. If the ECC can't > rectify such a read error, it will issue a read error to the OS. Nope. Or at least, we see one-bit errors. > Obviously, the raid or FS code handles this error in the usual way; this is This is not an error, it is a "failure"! An error is a wrong result, not a complete failure. > what we call a bad sector, and we have routines that handle that perfectly. Well, as I recall the raid code, it doesn't handle it correctly - it simply faults the disk implicated offline. Mind you, there are indications in the comments (eg for the resync thread) that it was intended that reads (or writes?) be retried there, but I don't recall any actual code for it. > 2) An incomplete write due to a crash. > > This can't happen on the drive itself, as the onboard cache will ensure Of course it can! I thought you were the one that didn't swallow manufacturer's figures! > everything that's in there gets written to the platter. I have no reason to > doubt what the manufacturer promises here, but it is easy to check if one Oh yes you do. > Another possibility is it happening in a higher layer, the raid code or the FS > code. Let's examine this further. The raid code does not promise that that There's no need to. All these modes are possible and very well known. > In the case of a journaled FS, the first that must be written is the delta. > Then the data, then the delta is removed again. From this we can trivially > deduce that indeed a journaled FS will not(*) suffer write reordering; as Eh, we can't. Or do you mean "suffer" as in "withstand"? Yes, of course it's vulnerable to it. > So in fact, a journaled FS will either have to rely on lower layers *not* > reordering writes, or will have to wait for the ACK on the journal delta > before issuing the actual_data write command(!). Stephen (not Neil, sorry) says that ext3 requires just acks after write completed. Hans has said that reiserfs required no write reordering (i don't know if that has changed since he said it). (analysis of a putative journal update sequence - depending strongly on ordered writes to the journal area) > A) the time during which the journal delta gets written > B) the time during which the data gets written > C) the time during which the journal delta gets removed. > > Now at what point do or did we crash ? If it is at A) the data is consistent, The FS metadata is ALWAYS consistent. There is no need for this. > no matter whether the delta got written or not. Uh, that's not at issue. The question is whether it is CORRECT, not whether it is consistent. > If it is at B) the data > block is in an unknown state and the journal reflects that, so the journal > code rolls back. Is a rollback correct? I maintain it is always correct. > If it is at C) the data is again consistent. Depending on > what sense the journal delta makes, there can be a rollback, or not. In > either case, the data still remains fully consistent. > It's really very simple, no ? Yes - I don't know why you consistently dive into details and miss the big picture! This is nonsense - the question is not if it is consistent, but if it is CORRECT. Consistency is guaranteed. However, it will likely be incorrect. > Now to get to the real point of the discussion. What changes when we have a > mirror ? Well, if you think hard about that: NOTHING. What Peter tends to > forget it that there is no magical mixup of drive 1's journal with drive 2's > data (yep, THAT would wreak havoc!). There is. Raid knows nothing about journals. The raid read strategy is normally 128 blocks from one disk, then 128 blocks from the next disk - in kernel 2.4 . In kernel 2.6 it seems to me that it reads from the disk that it calculates the heads are best positoned for the read (in itself a bogus calculation). As to what happens on a resync rather than a read, well, it will read from one disk or another - so the journals will not be mixed up - but the result will still likely be incorrect, and always consistent (in that case). There is nothing unusual here. Will you please stop fighting about NOTHING? > At any point in time -whether mirror 1 is chosen as true or mirror 2 gets > chosen does not matter as we will see- the metadata+data on _that_ mirror by And what if there are three mirrors? You don't know either the raid read startegy or the raid resync strategy - that is plain. > definition will be one of the cases A through C outlined above. IT DOES NOT > MATTER that mirror one might be at stage B and mirror two at stage C. We use > but one mirror, and we read from that and the FS rectifies what it needs to > rectify. Unfortunately, EVEN given your unwarranted assumption that things are like that, the result is still likely to be incorrect, but will be consistent! > This IS true because the raid code at boot time sees that the shutdown was not > clean, and will sync the mirrors. But it has no way of knowing which mirror is the correct one. > At this point, the FS layer has not even > come into play. Only when the resync has finished, the FS gets to examine > its journal. -> !! At this point the mirrors are already in sync again !! <- Sure! So? > If, for whatever reason, the raid code would NOT have seen the unclean > shutdown, _then_ you may have a point, since in that special case it would be > possible for the journal entry from mirror one (crashed during stage C) gets > used to evaluate the data block on mirror two (being in state B). In those > cases, bad things may happen obviously. And do you know what happens in the case of a three way mirror, with a 2-1 split on what's in the mirrored journals, and the raid resyncs? (I don't!) > If I'm not mistaken, this is what happens when one has to assemble --force an > array that has had issues. But as far as I can see, that is the only time... > > Am I making sense so far ? (Peter, this is not adressed to you, as I already Not very much. As usual you are bogged down in trivialities, and are missing the big picture :(. There is no need for this little baby step analysis! We know perfectly well that crashing can leave the different journals in different states. I even suppose that half a block can e written to one of them (a sector), instead of a whole block. Are journals written to in sectors or blocks? Logic would say that it should be written in sectors, for atomicity, but I haven't checked the ext3fs code. And then you haven't considered the problem of what happens if only some bytes get sent over the BUS before hitting the disk. What happens? I don't know. I suppose bytes are acked only in units of 512. > know your answer beforehand: I'd be "baby raid tech talk", correct ?) More or less - this is horribly low-level, it doesn't get anywhere. > So. What possible scenarios have I overlooked until now...? All of them. > 3) The inconsistent write comes from a bug in the CPU, RAM, code or such. It doesn't matter! You really cannot see the wood for the trees. > As Neil already pointed out, you gotta trust your CPU to work right otherwise > all bets are off. Tough - when it overheats it can and does do anything. Ditto memory. LKML is full of Linus doing Zen debugging of an oops, saying "oooooooom, ooooooom, you have a one bit flip in bit 7 at address 17436987, ooooooom". > But even if this could happen, there is no blaming the FS > or the raid code, as the faulty request was carried out as directed. The Who's blaming! This is most odd! It simply happens, that's all. > Does this make any sense to anybody ? (I sure hope so...) No. It is neither useful nor sensical, the latter largely because of the former. APART from your interesting layout of the sequence of steps in writing the journal. Tell me, what do you mean by "a delta"? (to be able to rollback it is either a xor of the intended block versus the original, or a copy of the original block plus a copy of the intended block). Note that it is not at all necessary that a journal work that way. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:08 ` Peter T. Breuer @ 2005-01-04 22:02 ` Brad Campbell 2005-01-04 23:20 ` Peter T. Breuer 2005-01-04 22:21 ` Neil Brown ` (2 subsequent siblings) 3 siblings, 1 reply; 92+ messages in thread From: Brad Campbell @ 2005-01-04 22:02 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: >>You can't flip a bit >>unnoticed. > > > Not by me, but then I run md5sum every day. Of course, there is a > question if the bit changed on disk, in ram, or in the cpu's fevered > miscalculations. I've seen all of those. One can tell which after a bit > more detective work. > I'm wondering how difficult it may be for you to extend your md5sum script to diff the pair of files and actually determine the extent of the corruption. bit/byte/word/.../sector/.../stripe wise? I have 2 RAID-5 arrays here. a 3x233GiB and a 10x233GiB and I when I install new data on the drives I add the md5sum of that data to an existing database stored on another machine. This gets compared against the data on the arrays weekly and I have yet to see a silent corruption in 18 months. I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet. Honestly, in my years running Linux and multiple drive arrays I have never experienced errors such as you are getting. Oh.. and both my arrays are running ext3 with an internal journal (as are all my other partitions on all my other machines). Perhaps I'm lucky? Brad ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 22:02 ` Brad Campbell @ 2005-01-04 23:20 ` Peter T. Breuer 2005-01-05 5:44 ` Brad Campbell 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 23:20 UTC (permalink / raw) To: linux-raid Brad Campbell <brad@wasp.net.au> wrote: > I'm wondering how difficult it may be for you to extend your md5sum script to diff the pair of files > and actually determine the extent of the corruption. bit/byte/word/.../sector/.../stripe wise? Not much. But I don't bother. It's a majority vote amongst all the identical machines involved and the loser gets rewritten. The script identifies a majority group and a minority group. If the minority is 1 it rewrites it without question. If the minority group is bigger it refers the notice to me. > I have 2 RAID-5 arrays here. a 3x233GiB and a 10x233GiB and I when I install new data on the drives > I add the md5sum of that data to an existing database stored on another machine. This gets compared > against the data on the arrays weekly and I have yet to see a silent corruption in 18 months. Looking at the lists of pending repairs over xmas, I see a pile that will have to be investigated. I am about to do it, since you reminded me to look at these. > I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and > should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet. No - it should not show it. > Honestly, in my years running Linux and multiple drive arrays I have never experienced errors such > as you are getting. Then you are not trying to manage hundreds of clients at a time. > Oh.. and both my arrays are running ext3 with an internal journal (as are all my other partitions on > all my other machines). > > Perhaps I'm lucky? You're both not looking in the right way and not running the right experiment. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 23:20 ` Peter T. Breuer @ 2005-01-05 5:44 ` Brad Campbell 2005-01-05 9:00 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Brad Campbell @ 2005-01-05 5:44 UTC (permalink / raw) To: linux-raid Peter T. Breuer wrote: > Brad Campbell <brad@wasp.net.au> wrote: > >>I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and >>should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet. > > > No - it should not show it. > If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk and put it back in, the array is going to be written from parity data that is not quite right. (The problem I believe you were talking about where you have two identical disks and one is inconsistent, which one do you read from? is similar). And thus the reconstructed array is going to have different contents to the array before I failed the disk. Therefore it should show the error. No? Brad ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 5:44 ` Brad Campbell @ 2005-01-05 9:00 ` Peter T. Breuer 2005-01-05 9:14 ` Brad Campbell 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-05 9:00 UTC (permalink / raw) To: linux-raid Brad Campbell <brad@wasp.net.au> wrote: > If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk > and put it back in, the array is going to be written from parity data that is not quite right. (The > problem I believe you were talking about where you have two identical disks and one is inconsistent, > which one do you read from? is similar). And thus the reconstructed array is going to have different > contents to the array before I failed the disk. > > Therefore it should show the error. No? It will not detect it as an error, if that is what you mean. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 9:00 ` Peter T. Breuer @ 2005-01-05 9:14 ` Brad Campbell 2005-01-05 9:28 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Brad Campbell @ 2005-01-05 9:14 UTC (permalink / raw) To: RAID Linux Peter T. Breuer wrote: > Brad Campbell <brad@wasp.net.au> wrote: > >>If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk >>and put it back in, the array is going to be written from parity data that is not quite right. (The >>problem I believe you were talking about where you have two identical disks and one is inconsistent, >>which one do you read from? is similar). And thus the reconstructed array is going to have different >>contents to the array before I failed the disk. >> >>Therefore it should show the error. No? > > > It will not detect it as an error, if that is what you mean. Now here we have a difference of opinion. I'm detecting errors using md5sums and fsck. If the drive checks out clean 1 minute, but has a bit error in a parity stripe and I remove/re-add a drive the array is going to rebuild that disk from the remaning data and parity. Therefore the data on that drive is going to differ compared to what it was previously. Next time I do an fsck or md5sum I'm going to notice that something has changed. I'd call that an error. Brad ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 9:14 ` Brad Campbell @ 2005-01-05 9:28 ` Peter T. Breuer 2005-01-05 9:43 ` Brad Campbell 2005-01-05 10:04 ` Andy Smith 0 siblings, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-05 9:28 UTC (permalink / raw) To: linux-raid Brad Campbell <brad@wasp.net.au> wrote: > I'm detecting errors using md5sums and fsck. > > If the drive checks out clean 1 minute, but has a bit error in a parity stripe and I remove/re-add a > drive the array is going to rebuild that disk from the remaning data and parity. Therefore the data > on that drive is going to differ compared to what it was previously. Indeed. > Next time I do an fsck or md5sum I'm going to notice that something has changed. I'd call that an error. If your check can find that type of error, then it will detect it, but it is intrinsically unlikely that an fsck will see it because the "real estate" argument say that it is 99% likely that the error occurs inside a file or in free space rather than in metadata, so it is 99% likely that fsck will not see anything amiss. If you do an md5sum on file contents and compare with a previous md5sum run, then it will be detected provided that the error occurs in a file, but assuming that your disk is 50% full, that is only 50% likely. I.e. "it depends on your test". Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 9:28 ` Peter T. Breuer @ 2005-01-05 9:43 ` Brad Campbell 2005-01-05 15:09 ` Guy 2005-01-05 10:04 ` Andy Smith 1 sibling, 1 reply; 92+ messages in thread From: Brad Campbell @ 2005-01-05 9:43 UTC (permalink / raw) To: RAID Linux Sorry, sent this privately by mistake. Peter T. Breuer wrote: > If you do an md5sum on file contents and compare with a previous md5sum > run, then it will be detected provided that the error occurs in a file, > but assuming that your disk is 50% full, that is only 50% likely. > > I.e. "it depends on your test". brad@srv:~$ df -h | grep md0 /dev/md0 2.1T 2.1T 9.2G 100% /raid I'd say likely :p) Regards, Brad ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 9:43 ` Brad Campbell @ 2005-01-05 15:09 ` Guy 2005-01-05 15:52 ` maarten 0 siblings, 1 reply; 92+ messages in thread From: Guy @ 2005-01-05 15:09 UTC (permalink / raw) To: 'Brad Campbell', 'RAID Linux' Dude! That's a lot of mp3 files! :) -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell Sent: Wednesday, January 05, 2005 4:44 AM To: RAID Linux Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Sorry, sent this privately by mistake. Peter T. Breuer wrote: > If you do an md5sum on file contents and compare with a previous md5sum > run, then it will be detected provided that the error occurs in a file, > but assuming that your disk is 50% full, that is only 50% likely. > > I.e. "it depends on your test". brad@srv:~$ df -h | grep md0 /dev/md0 2.1T 2.1T 9.2G 100% /raid I'd say likely :p) Regards, Brad - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 15:09 ` Guy @ 2005-01-05 15:52 ` maarten 0 siblings, 0 replies; 92+ messages in thread From: maarten @ 2005-01-05 15:52 UTC (permalink / raw) To: linux-raid On Wednesday 05 January 2005 16:09, Guy wrote: > Dude! That's a lot of mp3 files! :) Indeed. I "only" have this now: /dev/md1 590G 590G 187M 100% /disk md1 : active raid5 sdb3[4] sda3[3] hda3[0] hdc3[5] hde3[1] hdg3[2] 618437888 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU] ...but 4 new 250 GB disks are on their way as we speak. :-) P.S.: This is my last post for a while, I have very important work to get done the rest of this week. So see you all next time! Regards, Maarten > -----Original Message----- > From: linux-raid-owner@vger.kernel.org > [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell > Sent: Wednesday, January 05, 2005 4:44 AM > To: RAID Linux > Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 > crashing repeatedly and hard) > Peter T. Breuer wrote: > > If you do an md5sum on file contents and compare with a previous md5sum > > run, then it will be detected provided that the error occurs in a file, > > but assuming that your disk is 50% full, that is only 50% likely. > > > > I.e. "it depends on your test". > > brad@srv:~$ df -h | grep md0 > /dev/md0 2.1T 2.1T 9.2G 100% /raid -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 9:28 ` Peter T. Breuer 2005-01-05 9:43 ` Brad Campbell @ 2005-01-05 10:04 ` Andy Smith 1 sibling, 0 replies; 92+ messages in thread From: Andy Smith @ 2005-01-05 10:04 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 612 bytes --] On Wed, Jan 05, 2005 at 10:28:53AM +0100, Peter T. Breuer wrote: > If you do an md5sum on file contents and compare with a previous md5sum > run, then it will be detected provided that the error occurs in a file, > but assuming that your disk is 50% full, that is only 50% likely. "If a bit flips in the unused area of the disk and there is no one there to md5sum it, did it really flip at all?" :) Out of interest Peter could you go into some details about how you automate the md5sum of your filesystems? Obviously I can think of ways I would do it but I'm interested to hear how you have it set up first. [-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:08 ` Peter T. Breuer 2005-01-04 22:02 ` Brad Campbell @ 2005-01-04 22:21 ` Neil Brown 2005-01-05 0:08 ` Peter T. Breuer 2005-01-04 22:29 ` Neil Brown 2005-01-05 0:38 ` maarten 3 siblings, 1 reply; 92+ messages in thread From: Neil Brown @ 2005-01-04 22:21 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > Uh, that's not at issue. The question is whether it is CORRECT, not > whether it is consistent. > What exactly do you mean by "correct". If I have a program that writes some data: write(fd, buffer, 8192); and then makes sure the data is on disk: fsync(fd); but the computer crashes sometime between when the write call started and the fsync called ended, then I reboot and read back that block of data from disc, what is the "CORRECT" value that I should read back? The answer is, of course, that there is no one "correct" value. It would be correct to find the data that I had tried to write. It would also be correct to find the data that had been in the file before I started the write. If the size of the write is larger than the blocksize of the filesystem, it would also be correct to find a mixture of the old data and the new data. Exactly the same is true at every level of the storage stack. There is a point in time where a write request starts, and a point in time where the request is known to complete, and between those two times the content of the affected area of storage is undefined, and could have any of several (probably 2) "correct" values. After an unclean shutdown of a raid1 array, every (working) device has correct data on it. They may not all be the same, but they are all correct. md arbitrarily chooses one of these correct values, and replicates it across all drives. While it is replicating, all reads are served by the chosen drive. NeilBrown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 22:21 ` Neil Brown @ 2005-01-05 0:08 ` Peter T. Breuer 0 siblings, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-05 0:08 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > > > Uh, that's not at issue. The question is whether it is CORRECT, not > > whether it is consistent. > > > > What exactly do you mean by "correct". Whatever you mean by it - I don't have a preference myself, though I might have an opinion in specific situations. It means whatever you consider and it is up to you to make your own definition for yourself, to your own satisfaction in particular circumstances, if you feel you need a constructive definition in other terms (and I don't!). I merely gave the concept a name for you. > If I have a program that writes some data: > write(fd, buffer, 8192); > and then makes sure the data is on disk: > fsync(fd); > > but the computer crashes sometime between when the write call started > and the fsync called ended, then I reboot and read back that block of > data from disc, what is the "CORRECT" value that I should read back? I would say that if nothing on your machine or elsewhere "noticed" you doing the write of any part of the block, then the correct answer is "the block as it was before you wrote any of it". However, if nothing cares at all one way or the other, then it could be annything, what you wrote, what you got, or even any old nonsense. In other words, I would say "anything that conforms with what the universe outside the program has observed of the transaction". If you wish to apply a general one-size-fits rule, then I would say "as many blocks as you have written that have been read by other processes which in turn have communicated their state elsewhere should be present on the disk as is necessary to conform with that state". So if you had some other process watching the file grow, you would need to be sure that as much as that other process had seen was actually on the disk. Anyway, I don't care. It's up to you. I merely ask that you assign a probability to it (and don't tell me what it is! Please). > The answer is, of course, that there is no one "correct" value. There is whatever one pleases you as correct. Please do not fall into a sollipsistic trap! It is up to YOU to decide what is correct and assign probabilities. I only pointed out how those probabilities scale with the size of a RAID array WHATEVER THEY ARE. I don't care what they are to you. It's absurd to ask me. I only tell you how they grow with size of array. > After an unclean shutdown of a raid1 array, every (working) device > has correct data on it. They may not all be the same, but they are > all correct. No they are not. They are consistent. That is different! Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:08 ` Peter T. Breuer 2005-01-04 22:02 ` Brad Campbell 2005-01-04 22:21 ` Neil Brown @ 2005-01-04 22:29 ` Neil Brown 2005-01-05 0:19 ` Peter T. Breuer 2005-01-05 0:38 ` maarten 3 siblings, 1 reply; 92+ messages in thread From: Neil Brown @ 2005-01-04 22:29 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > Bits flip on our client disks all the time :(. You seem to be alone in reporting this. I certainly have never experienced anything quite like what you seem to be reporting. Certainly there are reports of flipped bits in memory. If you have non-ecc memory, then this is a real risk and when it happens you replace the memory. Usually it happens with a sufficiently high frequency that the computer is effectively unusable. But bits being flipped on disk, without the drive reporting an error, and without the filesystem very quickly becoming unusable, is (except for your report) unheard of. md/raid would definitely not help that sort of situation at all. NeilBrown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 22:29 ` Neil Brown @ 2005-01-05 0:19 ` Peter T. Breuer 2005-01-05 1:19 ` Jure Pe_ar 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-05 0:19 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > Bits flip on our client disks all the time :(. > > You seem to be alone in reporting this. I certainly have never > experienced anything quite like what you seem to be reporting. I don't feel the need to prove it to you via actual evidence. You already know of mechanisms which produce such an effect: > Certainly there are reports of flipped bits in memory. .. and that is all the same to your code when it comes to resyncing. You don't care whether the change is real or produced in the cpu, on the bus, or wherever. It still is what you will observe and copy. > If you have > non-ecc memory, then this is a real risk and when it happens you > replace the memory. Sure. > Usually it happens with a sufficiently high > frequency that the computer is effectively unusable. Well, there are many computers that remain usable. When I see bit flips the first thing I request the techs to do is check the memory and keep on checking it until they find a fault. I also ask them to check the fans, clean out dust and so on. In a relatively small percentage of cases, it turns out that the changes are real, on the disk, and persist from reboot to reboot, and move with the disk when one moves it from place to place. I don't know where these come from - perhaps from the drive electronics, perhaps from the disk. > But bits being flipped on disk, without the drive reporting an error, > and without the filesystem very quickly becoming unusable, is (except > for your report) unheard of. As far as I recall it is usually a bit flipped througout a range of consecutive addresses on disk, when it happens. I haven't been monitoring this daily check for about a year now, however, so I don't have any data to show you. > md/raid would definitely not help that sort of situation at all. Nor is there any reason to suggest it should - it just doesn't check. It could. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 0:19 ` Peter T. Breuer @ 2005-01-05 1:19 ` Jure Pe_ar 2005-01-05 2:29 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Jure Pe_ar @ 2005-01-05 1:19 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Wed, 5 Jan 2005 01:19:34 +0100 ptb@lab.it.uc3m.es (Peter T. Breuer) wrote: > Neil Brown <neilb@cse.unsw.edu.au> wrote: > > On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > > Bits flip on our client disks all the time :(. > > > > You seem to be alone in reporting this. I certainly have never > > experienced anything quite like what you seem to be reporting. > > I don't feel the need to prove it to you via actual evidence. You > already know of mechanisms which produce such an effect: > > > Certainly there are reports of flipped bits in memory. > > .. and that is all the same to your code when it comes to resyncing. > You don't care whether the change is real or produced in the cpu, on the > bus, or wherever. It still is what you will observe and copy. You work with PC servers, so live with it. If you want to have the right to complain about bits being flipped in hardware randomly, go get a job with IBM mainframes or something. And since you like theoretic approach to problems, I might have a suggestion for you: pick a linux kernel subsystem of your choice, think of it as a state machine, roll out all the states and then check which states are not covered by the code. I think that will keep you busy and the result might have some value for the community. -- Jure Pečar http://jure.pecar.org/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-05 1:19 ` Jure Pe_ar @ 2005-01-05 2:29 ` Peter T. Breuer 0 siblings, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-05 2:29 UTC (permalink / raw) To: linux-raid Jure Pe_ar <pegasus@nerv.eu.org> wrote: > And since you like theoretic approach to problems, I might have a suggestion > for you: pick a linux kernel subsystem of your choice, think of it as a > state machine, roll out all the states and then check which states are not > covered by the code. I have no idea what you mean (I suspect you are asking about reachable states). If you want a static analyzer for the linux kernel written by me, you can try ftp://oboe.it.uc3m.es/pub/Programs/c-1.2.2.tgz > I think that will keep you busy and the result might have some value for the > community. If you wish to sneer about something, please try and put some technical espertise and effort into it. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:08 ` Peter T. Breuer ` (2 preceding siblings ...) 2005-01-04 22:29 ` Neil Brown @ 2005-01-05 0:38 ` maarten 3 siblings, 0 replies; 92+ messages in thread From: maarten @ 2005-01-05 0:38 UTC (permalink / raw) To: linux-raid [ Spoiler: this text may or may not contain harsh language and/or insulting ] [ remarks, specifically in the middle part. The reader is advised to exert ] [ some mild caution here and there. Sorry for that but my patience can ] [ and does really reach its limits, too. - Maarten ] On Tuesday 04 January 2005 22:08, Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: > > @Peter: > > I still need you to clarify what can cause such creeping corruption. > > 1) that data is only partially written to the array on system crash > and on recovery the inappropriate choice of alternate datasets > from the redundant possibles is propagated. > > 2) corruption occurs unnoticed in a part of the redundant data that > is not currently in use, but a disk in the array then drops out, > bringing the data with the error into use. On recovery of the > failed disk, the error data is then propagated over the correct > data. Congrats, you just described the _symptoms_. We all know the alledged symptoms, if only for you repeating them over and over and over... My question was HOW they [can] occur. Disks don't go around randomly changing bits just because they dislike you, you know. > > 1) A bit flipped on the platter or the drive firmware had a 'thinko'. > > > > This will be signalled by the CRC / ECC on the drive. > > Bits flip on our client disks all the time :(. It would be nice if it > were the case that they didn't, but it isn't. Mind you, I don't know > precisely HOW. I suppose more bits than the CRC can recover change, or > something, and the CRC coincides. Anyway, it happens. Probably cpu > -mediated. Sorry but I haven't kept any recent logs of 1-bit errors in > files on readonly file systems for you to look at. Well I don't think we would want to crosspost this flamefest ^W discussion to a mailinglist where our resident linux blockdevice people hang out, but I'm reasonably certain that the ECC correction in drives is very solid, and very good at correcting multiple bit errors, or at least signalling them so an official read error can be issued. If you experience bit errors that often I'd wager it is in another layer of your setup, be it CPU, network layer, or rogue scriptkiddie admins changing your files on disk. I dunno. What I do know is that nowadays the bits-per-square inch on media (CD, DVD and harddisks alike) is SO high that even during ideal circumstances the head will not read all the low-level bits correctly. It has a host of tricks to compensate for that, first and foremost error correction. If that doesn't help it can retry the read, and if that still doesn't help it can / will adjust the head very slightly in- or outwards to see if that gives a better result. (in all fairness, this happens earlier, during the read of the servo tracks, but it may still adjust slightly). If even after all that the read still fails, it issues a read error to the I/O subsystem, ie. the OS. Now it may be conceivable that a bit gets flipped by a cosmic ray, but the error correction would notice that and correct it. If too many bits got flipped, there comes a point that it will give up and give an error. What it will NOT do at this point, AFAIK, is return the entire sector with some bit errors in them. It will either return a good block, or no block at all accompanied by a "bad sector" error. This is logical, as most of the time you're more interested in knowing the data in unretrievable than getting it back fubar'ed. (your undetectable vs detectable, in fact) The points where there is no ECC protection against cosmic rays are in your RAM. I believe the data path between disk and controller has error checks, so do the other electrical paths. So if you see random bit errors, suspect your memory above all and not your I/O layer. Go out and buy some ECC ram, and don't forget to actually enable it in the BIOS. But you may want to change data-cables to your drives nevertheless, just to be safe. > > You can't flip a bit > > unnoticed. > > Not by me, but then I run md5sum every day. Of course, there is a > question if the bit changed on disk, in ram, or in the cpu's fevered > miscalculations. I've seen all of those. One can tell which after a bit > more detective work. Hehe. Oh yeah, sure you can. Would you please elaborate to the group here how in the hell you can distinguish a bit being flipped by the CPU and one being flipped while in RAM ? Cause I'd sure like to see you try...! I suppose lots of terms like axiom, poisson and binomial etc. will be used in your explanation ? Otherwise we might not believe it, you know... ;-) Luckily we don't yet use quantum computers, otherwise just you observing the bit would make it vanish, hehehe. Back to seriousness, tho. > Nope. Or at least, we see one-bit errors. Yep, I'm sure you do. I'm just not sure they originate on the I/O layer. > > Obviously, the raid or FS code handles this error in the usual way; this > > is > > This is not an error, it is a "failure"! An error is a wrong result, not > a complete failure. Be that as it may (it's just language definitions) you perfectly understand what I meant: a "bad sector"-error is issued to the nearest OS layer. > > what we call a bad sector, and we have routines that handle that > > perfectly. > > Well, as I recall the raid code, it doesn't handle it correctly - it > simply faults the disk implicated offline. True, but that is NOT the point. The point is, the error IS detectable; the disk just said as much. We're hunting for your improbable UNdetectable errors, and how they can technically occur. Because you say you see them, but you have not shown us HOW they could even originate. Basically, *I* am doing your research now! > > 2) An incomplete write due to a crash. > > > > This can't happen on the drive itself, as the onboard cache will ensure > > Of course it can! I thought you were the one that didn't swallow > manufacturer's figures! MTBF, no. Because that is purely marketspeak. Technical and _verifiable_ specs I can believe, if only for the fact that I can verify them to be true. I outlined already how you can do that yourself too...: Look, it isn't rocket science. All you'd need is a computer-controlled relay that switches off the drive. Trivially made off the parallel port. Then you write some short code that issues write requests and sends block to the drive and then shuts the drive down with varying timings in between to cover all possibilities. All that in a loop which sends different data to different offsets each time. Then you leave that running for a night or so. The next morning you check all the offsets for your written data and compare. Without being overly paranoid, I think someone has already conducted such tests. Ask around on the various ATA mailinglists (if you care enough). But honestly, deploying a working UPS is both more elegant, less expensive and more logical. Who cares if a write gets botched during a power cut to the drive, you just make triple sure that that can never happen: Simple. OS crashes do not cut power to the drive, only PSUs and UPSes can. So cover those two bases and you're set. Child's play. > > Another possibility is it happening in a higher layer, the raid code or > > the FS code. Let's examine this further. The raid code does not promise > > that that > > There's no need to. All these modes are possible and very well known. You're like a stuck record aren't you ? We're searching for the real truth here, and all you say in your defense is "It's the truth!" "It is the truth!" "It IS the truth!" like a small child repeating over and over. You've never attempted to prove any of your wild statements, yet demand from us that we take your word for granted. Not so. Either you prove your point, or at the very least you try not to sabotage people who try to find proof. As I am attempting now. You're harassing me. Go away until you have a meaningful input ! dickwad ! > > In the case of a journaled FS, the first that must be written is the > > delta. Then the data, then the delta is removed again. From this we can > > trivially deduce that indeed a journaled FS will not(*) suffer write > > reordering; as > > Eh, we can't. Or do you mean "suffer" as in "withstand"? Yes, of > course it's vulnerable to it. suffer as in withstand, yes. > > So in fact, a journaled FS will either have to rely on lower layers *not* > > reordering writes, or will have to wait for the ACK on the journal delta > > before issuing the actual_data write command(!). > > A) the time during which the journal delta gets written > > B) the time during which the data gets written > > C) the time during which the journal delta gets removed. > > > > Now at what point do or did we crash ? If it is at A) the data is > > consistent, > > The FS metadata is ALWAYS consistent. There is no need for this. Well, either you agree that an error cannot originate here, or you don't. There is no middle way stating things like the data getting corrupt yet the metadata not showing that. The write gets verified, bit by bit, so I don't see where you're going with this...? > > no matter whether the delta got written or not. > > Uh, that's not at issue. The question is whether it is CORRECT, not > whether it is consistent. Of course it is correct. You want to know how the bit errors originate during crashes. Thus the bit errors are obviously not written _before_ the crash. Because IF they did, your only recourse is to go for one of the options in 3) further below. For now we're only describing how the data that the OS hands the FS lands on disk. Whether the data given to us by the OS is good or not is irrelevant (now). So, get with the program, please. The delta is written first and that's a fact. Next step. > > If it is at B) the data > > block is in an unknown state and the journal reflects that, so the > > journal code rolls back. > > Is a rollback correct? I maintain it is always correct. That is not an issue, you can safely leave that to the FS to figure out. It will most certainly make a more logical decision than you at this point. In any case, since the block is not completely written yet, the FS probably has no other choice than to roll back, since it probably misses data... This is a question left for the relevant coders, though. It still is irrelevant to this discussion however. > > If it is at C) the data is again consistent. Depending on > > what sense the journal delta makes, there can be a rollback, or not. In > > either case, the data still remains fully consistent. > > It's really very simple, no ? > > Yes - I don't know why you consistently dive into details and miss the > big picture! This is nonsense - the question is not if it is > consistent, but if it is CORRECT. Consistency is guaranteed. However, > it will likely be incorrect. NO. For crying out loud !! We're NOT EVEN talking about a mirror set here! That comes later on. This is a SINGLE disk, very simple, the FS gets handed data by the OS, the FS directs it to the MD code, the MD code hands it on down. Nothing in here except for a code bug can flip your friggin' bits !! If you indeed think it is a code bug, skip all this chapter and go to 3). Otherwise, just shut the hell up !! > > Now to get to the real point of the discussion. What changes when we > > have a mirror ? Well, if you think hard about that: NOTHING. What Peter > > tends to forget it that there is no magical mixup of drive 1's journal > > with drive 2's data (yep, THAT would wreak havoc!). > > There is. Raid knows nothing about journals. The raid read strategy > is normally 128 blocks from one disk, then 128 blocks from the next > disk - in kernel 2.4 . In kernel 2.6 it seems to me that it reads from > the disk that it calculates the heads are best positoned for the read > (in itself a bogus calculation). As to what happens on a resync rather > than a read, well, it will read from one disk or another - so the > journals will not be mixed up - but the result will still likely > be incorrect, and always consistent (in that case). Irrelevant. You should read on before you open your mouth and start blabbing. > > At any point in time -whether mirror 1 is chosen as true or mirror 2 gets > > chosen does not matter as we will see- the metadata+data on _that_ mirror > > by > > And what if there are three mirrors? You don't know either the raid read > startegy or the raid resync strategy - that is plain. Wanna stick with the program here ? What do you do if your students interrupt you and start about "But what if the theorem is incorrect and we actually have three possible outcomes?" Again: shut up and read on. > > definition will be one of the cases A through C outlined above. IT DOES > > NOT MATTER that mirror one might be at stage B and mirror two at stage C. > > We use but one mirror, and we read from that and the FS rectifies what it > > needs to rectify. > > Unfortunately, EVEN given your unwarranted assumption that things are > like that, the result is still likely to be incorrect, but will be > consistent! Unwarranted...! I took you by the hand and led you all the way here. All the while you whined and whined that that was unneccessary, and now that we got here you say I did not explain nuthin' along the way ?!? You have some nerve, mister. For the <incredibly thick> over here: IS there, yes or no, any other possible state for a disk than either state A, B or C above, at any particular time?? If the answer is YES, fully describe that imaginary state for us. If the answer is NO, shut up and listen. I mean read. Oh hell... > > This IS true because the raid code at boot time sees that the shutdown > > was not clean, and will sync the mirrors. > > But it has no way of knowing which mirror is the correct one. Djeez. Are you thick or what? I say it chooses any one, at random, BECAUSE _after_ the rollback of the jounaled FS code it will ALWAYS be correct(YES!) AND consistent. You just don't get the concept, do you ? There IS no INcorrect mirror, neither is there a correct mirror. They're both just mirrors in various, as yet undetermined, states of completing a write. Since the journal delta is consistent, it WILL be able to roll back (or through, or forward, or on, or whatever) to a clean state. And it will. (fer cryin' out loud...!!) > > At this point, the FS layer has not even > > come into play. Only when the resync has finished, the FS gets to > > examine its journal. -> !! At this point the mirrors are already in sync > > again !! <- > > Sure! So? So the FS code will find an array in either state A, B or C and take it from there. Just as with any normal single, non-raided disk. Get it now? > > If, for whatever reason, the raid code would NOT have seen the unclean > > shutdown, _then_ you may have a point, since in that special case it > > would be possible for the journal entry from mirror one (crashed during > > stage C) gets used to evaluate the data block on mirror two (being in > > state B). In those cases, bad things may happen obviously. > > And do you know what happens in the case of a three way mirror, with a > 2-1 split on what's in the mirrored journals, and the raid resyncs? Yes. Either at random or intelligently, one is chosen(which one is entirely irrelevant!). Then the raid resync follows, then the FS code finds an array in (hey! again!) either state A, B or C. And it will roll back or roll on to reinstate the clean state. (again: do you get it now????) > (I don't!) Well, sure. That goes without saying. > > If I'm not mistaken, this is what happens when one has to assemble > > --force an array that has had issues. But as far as I can see, that is > > the only time... > > > > Am I making sense so far ? (Peter, this is not adressed to you, as I > > already > > Not very much. As usual you are bogged down in trivialities, and are > missing the big picture :(. There is no need for this little baby step > analysis! We know perfectly well that crashing can leave the different > journals in different states. I even suppose that half a block can e > written to one of them (a sector), instead of a whole block. Are > journals written to in sectors or blocks? Logic would say that it > should be written in sectors, for atomicity, but I haven't checked the > ext3fs code. Man oh man you are pityful. The elephant is right in front of you, if you'd stick out your arm you would touch it, but you keep repeating there is no elephant in sight. I give up. > And then you haven't considered the problem of what happens if only > some bytes get sent over the BUS before hitting the disk. What happens? > I don't know. I suppose bytes are acked only in units of 512. No shit...! Would that be why they call disks "block devices" ?? Your levels of comprehension amaze me more every time. No of course you can't send half block or bytes or bits to a drive. Else they would be called serial devices, not block devices, now wouldn't they ? A drive will not ACK anything unless it is received completely (how obvious is that?) > > know your answer beforehand: I'd be "baby raid tech talk", correct ?) > > More or less - this is horribly low-level, it doesn't get anywhere. Some people seem to disagree with you. Let's just leave it at that shall we ? > > So. What possible scenarios have I overlooked until now...? > > All of them. Oh really. (God. Is there no end to this.) > > 3) The inconsistent write comes from a bug in the CPU, RAM, code or such. > > It doesn't matter! You really cannot see the wood for the trees. I see only Peters right now, and I know I will have nightmares over you. > > As Neil already pointed out, you gotta trust your CPU to work right > > otherwise all bets are off. > > Tough - when it overheats it can and does do anything. Ditto memory. > LKML is full of Linus doing Zen debugging of an oops, saying "oooooooom, > ooooooom, you have a one bit flip in bit 7 at address 17436987, > ooooooom". How this even remotely relates to MD raid, or even I/O in general, completely eludes me. And everyone else, I suppose. But for academic purposes, I'd like to see you discuss something with Linus. He is way more short-tempered than I am, if you read LKML you'd know that. But hey, it's 2005, maybe it's time to add a chapter to the infamous Linus vs AST archives. You might qualify. Oh well, never mind... > > But even if this could happen, there is no blaming the FS > > or the raid code, as the faulty request was carried out as directed. The > > Who's blaming! This is most odd! It simply happens, that's all. Yeah... That is the nature of computers innit ? Unpredictable bastards is what they are. Math is also soooooo unpredictable, I really hate that. (do I really need to place a smiley here?) > > Does this make any sense to anybody ? (I sure hope so...) > > No. It is neither useful nor sensical, the latter largely because of > the former. APART from your interesting layout of the sequence of > steps in writing the journal. Tell me, what do you mean by "a delta"? The entry in the journal that contains the info on what a next data-write will be, where it will take place, and how to reconstruct the data in case of <problem>. (As if that wasn't obvious by now.) > (to be able to rollback it is either a xor of the intended block versus > the original, or a copy of the original block plus a copy of the > intended block). I have no deep knowledge of how the intricacies of a journaled FS work. If I would have, we would not have had this discussion in the first place since I would have said yesterday "Peter you're wrong" and that would've ended all of this right then and there. (oh yes!) If you care to know, go pester other lists about it, or read some reiserfs or ext3 code and find out for yourself. > Note that it is not at all necessary that a journal work that way. To me the sole thing I care about is that it can repair the missing block and how it manages that is of no great concern to me. I do not have to know how a pentium is made in order to use it and program for it. Maarten -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 2:07 ` Neil Brown 2005-01-04 2:16 ` Ewan Grantham @ 2005-01-04 9:40 ` Peter T. Breuer 2005-01-04 11:57 ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Michael Tokarev 2005-01-04 14:03 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves 1 sibling, 2 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 9:40 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday January 4, ptb@lab.it.uc3m.es wrote: > > > > Then the probability of an error occuring UNdetected on a n-disk raid > > > > array is > > > > > > > > (n-1)p + np' > > > > > > > > > The probability of an event occurring lies between 0 and 1 inclusive. > > > You have given a formula for a probability which could clearly evaluate > > > to a number greater than 1. So it must be wrong. > > > > The hypothesis here is that p is vanishingly small. I.e. this is a Poisson > > distribution - the analysis assumes that only one event can occcur per > > unit time. Take the unit too be one second if you like. Does that make > > it true enough for you? > > Sorry, I didn't see any such hypothesis stated and I don't like to > assUme. You don't have to. It is conventional. It doesn't need saying. > So what you are really saying is that: > for sufficiently small p and p' (i.e. p-squared terms can be ignored) > the probability of an error occurring undetected approximates > (n-1)p + np' > > this may be true, but I'm still having trouble understanding what your > p and p' really mean. Examine your conscience. They're dependent on you. All I say is that they exist. They represent two different classes of error, one detectible by whatever thing like fsck you run as an "experiment", and one not. But you are right in that I have been sloppy about defining what I mean. For one thing I have mixed probailities "per unit time" and multiplied them by probabilities associated with a single observation (your experiment with fsck or whatever) made at a certain moment. I do that because I know that it would make no difference if I integrated up the the instantaneous probabilities and then multiplied. Thus if you want to be more formal, you want to stick some integral signs in and get (n-1) /p dt + n /p' dt. Or if you wanted to calculate in terms of mean times to a detected event, well, you'd modify that again. But the principle remains the same: the probability of a single undetectible error rises in proportion to the number of disks n, and the probability of a detectible error going undetected rises in proportion to n-1, because your experiment to detect the error will only test one of the possible disks at the crucial point. > > I mean an error occurs that can be detected (by the experiment you run, > > which is prsumably an fsck, but I don't presume to dictate to you). > > > > The whole point of RAID is that fsck should NEVER see any error caused > by drive failure. Then I guess you have helped clarify to yourself what type of errors falls in which class! Apparently errors caused by drive failure fall in the class of "indetectible error" for you! But in any case, you are wrong, because it is quite possible for an error to spontaneously arise on a disk which WOULD be detected by fsck. What does fsck detect normally if it is not that! > I think we have a major communication failure here, because I have no > idea what sort of failure scenario you are imagining. I am not imagining. It is up to you. > > Likewise, I don't know. It's whatever error your experiment > > (presumably an fsck) will miss. > > But 'fsck's primary purpose is not to detect errors on the disk. Of course it is (it does not mix and make cakes - it precisely and exactly detects errors on the disk it is run on, and repairs the filesystem to either work around those errors, or repairs the errors themselves). > It is > to repair a filesystem after an unclean shutdown. Those are "errors on the disk". It is of no interest to fsck how they are caused. Fsck simply has a certain capacity for detecting anomalies (and fixing them). If you have a better test than fsck, by all means run it! > It can help out a > bit after disk corruption, but usually disk corruption (apart from > very minimal problems) causes fsck to fail to do anything useful. I would have naively said you were right simply by the real estate argument - fsck checks only metadata, and metadata occupies abut 1% of the disk real estate only. Nevertheless experience suggests that it is very good at detecting when strange _physical_ things have happened on the disk - I presume that is because physical strangenesses affect a block or two at a time, and are much more likely than a bit error to hit some metadata amongst that. Certainly single bit errors occur relatively undetected by fsck (in conformity with the real estate argument), as I know because I check the md5sums of all files on all machines daily, and they change spontaneously without human intervention :). In readonly areas! (the rate is probably about 1 bit per disk per three months, on average, but I'd have to check that to see if my estimate from memory is accurate). Fsck never finds those. But I do. Shrug - so our definitions of detectible and undetectible error are different. > > They happen all the time - just write a 1 to disk A and a zero to disk > > B in the middle of the data in some file, and you will have an > > undetectible error (vis a vis your experimental observation, which is > > presumably an fsck). > > But this doesn't happen. You *don't* write 1 to disk A and 0 to disk > B. Then write a 1 to disk A and DON'T write a 1 to disk B, but do it over a patch where there is a 0 already. There is no need for you to make such hard going of this! Invent your own examples, please. > I admit that this can actually happen occasionally (but certainly not It happens EVERY time I choose to do it. Or a software agent of my choice decides to do it :). I decide to do it with probability p' (;-). Call me Murphy. Or Maxwell. > "all the time"). But when it does, there will be subsequent writes to > both A and B with new, correct, data. During the intervening time There may or there may not - but if I wish it there will not. I don't see why you have such trouble! > that block will not be read from A or B. You are imagining some particular mechanism that I, and I presume the rest of us, are not. I think you are thinking of raid and how it works. Please clean your thoughts of it .. this part of the argument has nothing particularly to do with raid or any implementation of it. It is more generic than that. It is simply the probability of something going "wrong" on n disks and the question of whether you can detect that wrongness with some particular test of yours (and HERE is where raid is slightly involved) that only reads from one of the n disks for each block that it does read. > If there is a system crash before correct, consistent data is written, Exactly. > then on restart, disk B will not be read at all until disk A as been Why do you think so? I know of no mechanism in RAID that records to which of the two disks paired data has been written and to which it has not! Please clarify - this is important. If you are thinking of the "event count" that is stamped on the superblocks, that is only updated from time to time as far as I know! Can you please specify (for my curiousity) exactly when it is updated? That would be useful to know. > completely copied on it. > > So again, I fail to see your failure scenario. Try harder! Neil, there is no need for you to make such hard going of it! If you like, pay a co-worker to put a 1 on one disk and a 0 on another, and see if you can detect it! Errors arise spontaneously on disks, and and then there are errors caused by being written by overheated cpus which write a 1 where they meant a 0, just before dying, and then there are errors caused by stuck bits in RAM, and so on. And THEN there are errors caused by wrting ONE of a pair of paired writes to a mirror pair, just before the system crashes. It is not hard to think of such things. > > > or high level software error (i.e. the wrong data was written - and > > > that doesn't really count). > > > > It counts just fine, since it's what does happen :- consider a system > > crash that happens AFTER one of a pair of writes to the two disk > > components has completed, but BEFORE the second has completed. Then on > > reboot your experiment (an fsck) has the task of finding the error > > (which exists at least as a discrepency between the two disks), if it > > can, and shouting at you about it. > > No. RAID will not let you see that discrepancy Of course it won't - that's the point. Raid won't even know it's there! > and will not let the > discrepancy last any longer that it takes to read on drive and write > the other. WHICH drive does it read and which does it write? It ha no way of knowing which, does it? > Maybe I'm beginning to understand your failure scenario. > It involves different data being written to the drives. Correct? That is one possible way, sure. But the error on the drive can also change spontaneously! Look, here are some outputs from the daily md5sum run on a group of identical machines: /etc/X11/fvwm2/menudefs.hook: (7) b4262c2eea5fa82d4092f63d6163ead5 : lm003 lm005 lm006 lm007 lm008 lm009 lm010 /etc/X11/fvwm2/menudefs.hook: (1) 36e47f9e6cde8bc120136a06177c2923 : lm011 That file on one of them mutated overnight. > That only happens if: > 1/ there is a software error > 2/ there is an admin error And if there is a hardware error. Hardware can do what it likes. Anyway, I don't care HOW. > You seem to be saying that if this happens, then raid is less reliable > than non-raid. No, I am saying nothing of the kind. I am simply pointing at the probabilities. > There may be some truth in this, but it is irrelevant. > The likelyhood of such a software error or admin error happening on a > well-managed machine is substantially less than the likelyhood of a > drive media error, and raid will protect from drive media errors. No it won't! I don't know why you say this either - oh, your definition of "error" must be "when the drive returns a failure for a sector or block read". Sorry, I don't mean anything so specific. I mean anything at all that might be considered an error, such as the mutating bits in the daily check shown above. > So using raid might reduce reliability in a tiny number of cases, but > will increase it substantially in a vastly greater number of cases. Look at the probabilities, nothing else. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] 2005-01-04 9:40 ` Peter T. Breuer @ 2005-01-04 11:57 ` Michael Tokarev 2005-01-04 12:40 ` Morten Sylvest Olsen 2005-01-04 12:44 ` Peter T. Breuer 2005-01-04 14:03 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves 1 sibling, 2 replies; 92+ messages in thread From: Michael Tokarev @ 2005-01-04 11:57 UTC (permalink / raw) To: linux-raid Peter T. Breuer wrote: > Neil Brown <neilb@cse.unsw.edu.au> wrote: [] >>If there is a system crash before correct, consistent data is written, >>then on restart, disk B will not be read at all until disk A as been > > Why do you think so? I know of no mechanism in RAID that records to > which of the two disks paired data has been written and to which it has > not! > > Please clarify - this is important. If you are thinking of the "event > count" that is stamped on the superblocks, that is only updated from > time to time as far as I know! Can you please specify (for my > curiousity) exactly when it is updated? That would be useful to know. Yes, this is the most dark corner in whole raid stuff for me still. I just looked at the code again, re-read it several times, but the code is a bit.. large to understand in a relatively short time. This very question bothered me for quite some time now. How md code "knows" which drive has "more recent" data on it in case of system crash (power loss, whatever) after one drive has completed the write but before another hasn't? The "event counter" isn't updated on every write (it'd be very expensive in both time and disk health -- too much seeking and too much writes to a single block where the superblock is located). For me, and I'm just thinking how it can be done, the only possible solution in this case is to choose "random" drive and declare it as "up-to-date" -- it will not necessary be really up-to-date. Or, maybe, write to "first" drive first and to "second" next, and assume first drive have the data written before second (no guarantee here because of reordering, differences in drive speed etc, but it is -- sort of -- valid assumption). Speaking of a reasonable filesystem (journalling isn't relevant here, the key word is "reasonable", that it, the system that makes comples operations to be atomic) and filesystem metadata, choosing "random" drive as up-to-date makes some sense, at least the metadata will be consistent (not necessary up to date, ie, for example, it is still possible to lose some mail file which has been acknowleged by filesystem AND by the smtp server, but due to choosing the "wrong" (not recent) drive, that file operation has been "rolled back"), but still consistent (I'm not talking about data consistency and integrity, that's another long story). Or, maybe it's better to ask the question slightly (?) differently: recalling "write barriers" etc and raid1 (for simplicity), will raid code acknowlege a write only after ALL drives has been written to? And thus, having reasonable filesystem (again), will the filesystem operation (at least metadata) succeed ONLY after the md layer will report ALL disks has the data written? (This way, it really makes no difference which - fresh or not - drive will be considered up to date after the poweroff in the middle of some write, *at least* for filesystem metadata, and for applications that implements "commit" concept as needed to correctly implement "reasonable" metadata operations). How it all fits together? Which drive will be declared "fresh"? How about several (>2) drives in raid1 array? How about data written without a concept of "commits", if "wrong" drive will be choosen -- will it contain some old data in it, while another drive contained new data but was declared "non fresh" at reconstruction? And speaking of the previous question, is there any difference here between md device and single disk, which also does various write reordering and stuff like that? -- I mean, does md layer increase probability to see old data after reboot caused by a power loss (for example) if an app (or whatever) was writing (or even when the filesystem reported the write is complete) some new data during the power loss? Alot of questions.. but I think it's really worth to understand how it all works. Thanks. /mjt ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] 2005-01-04 11:57 ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Michael Tokarev @ 2005-01-04 12:40 ` Morten Sylvest Olsen 2005-01-04 12:44 ` Peter T. Breuer 1 sibling, 0 replies; 92+ messages in thread From: Morten Sylvest Olsen @ 2005-01-04 12:40 UTC (permalink / raw) To: Michael Tokarev; +Cc: linux-raid > Yes, this is the most dark corner in whole raid stuff for me still. > I just looked at the code again, re-read it several times, but the > code is a bit.. large to understand in a relatively short time. This > very question bothered me for quite some time now. How md code "knows" > which drive has "more recent" data on it in case of system crash (power > loss, whatever) after one drive has completed the write but before > another hasn't? The "event counter" isn't updated on every write > (it'd be very expensive in both time and disk health -- too much > seeking and too much writes to a single block where the superblock > is located). > > For me, and I'm just thinking how it can be done, the only possible > solution in this case is to choose "random" drive and declare it as > "up-to-date" -- it will not necessary be really up-to-date. Or, > maybe, write to "first" drive first and to "second" next, and assume > first drive have the data written before second (no guarantee here > because of reordering, differences in drive speed etc, but it is -- > sort of -- valid assumption). Funny, I've been thinking alot about this lately, because I use RAID in strange setup with failover (admittedly a stupid setup, I did not know any better). I've have only been looking at scenarios for RAID-1. I can't even begin to think about what might happen with RAID-5. But as the RAID howto says, RAID does not protect you from power failures and the like, and you should have an UPS. The md layer will not acknowledge a write before it has been written to all disks. I have not checked this, but the raid developers are smart people, and otherwise I would loose my sanity. IMHO this means that it doesn't really matter which disk is chosen as the one to synchronize from after restarting. This means that data in files written to during the failure might be corrupted, but metadata should be correct. Ie. depending on which disk was chosen you might loose a little more or less, but only within the limits of a "stripe". This is no different from a failure without raid. The important thing is of course, that if the RAID was recovering or running in degraded mode when the power failed, that it does not make any wrong decisions about which disk to use when coming back up, if for example the failure of the disk was some temporary thing which the hard reboot corrected. The superblock event-counter is updated on start, stop and failure events. During recovery the superblock on the new disk is not updated until the raid is properly closed. > How it all fits together? > Which drive will be declared "fresh"? The first one possibly, or another one :) One should never assume anything about this. > How about several (>2) drives in raid1 array? Shouldn't make a difference. Probably not a widely used setup either? > How about data written without a concept of "commits", if "wrong" > drive will be choosen -- will it contain some old data in it, while > another drive contained new data but was declared "non fresh" at > reconstruction? Unless one drive was failed, the difference between the two drives will never be more than one "stripe". The persistant superblock which is updated at disk failure ensures that if the system fails while running degraded or during recovery it will kick any non-fresh (failed) disks from the array when restarting, and run in degraded mode. > And speaking of the previous question, is there any difference here > between md device and single disk, which also does various write > reordering and stuff like that? -- I mean, does md layer increase > probability to see old data after reboot caused by a power loss > (for example) if an app (or whatever) was writing (or even when > the filesystem reported the write is complete) some new data during > the power loss? I don't think md is worse than single drive. But cannot back that up absolutely. - Morten ---- A: No. Q: Should I include quotations after my reply? ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] 2005-01-04 11:57 ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Michael Tokarev 2005-01-04 12:40 ` Morten Sylvest Olsen @ 2005-01-04 12:44 ` Peter T. Breuer 2005-01-04 14:22 ` Maarten 1 sibling, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 12:44 UTC (permalink / raw) To: linux-raid Michael Tokarev <mjt@tls.msk.ru> wrote: > How it all fits together? > Which drive will be declared "fresh"? I'd like details of the event count too. No, I haven't been able to figure it out from the code either. In this case "ask an author" is indicated. :). > How about several (>2) drives in raid1 array? > How about data written without a concept of "commits", if "wrong" > drive will be choosen -- will it contain some old data in it, while > another drive contained new data but was declared "non fresh" at > reconstruction? To answer a question of yours which I seem to have missed quoting here, standard softare raid only acks the user (does end_request) when ALL the i/os corresponding to mirrored requests have finished. This is precisely the condition Stephen wants for ext3, and it is satisfied. However, the last time I asked Hans Reiser what his conditions were for reiserfs, he told me that he required write order to be preserved, which is a different condition. It's not precisely stronger as it is, but it becomes precisely stronger than Stephen's when you add in some extra "normal" hypotheses about the rest of the universe it lives in. However, the media underneath raid is free to lie. In many respects, it is likely to lie! Hardware disks, for example, ack back the write when they have buffered it, not when they have written it (and manufacturers claim there is always enough capacitative energy in the disk electrionics to get the buffer written to disk when you cut the power, before the disk spins down - to which I say, "oh yeah?"). If there is another software layer between you and the hardware then bets are off. And you can also patch raid to do async writes, as I have - that is, respond with an ack on the first component write, not the last. This requires extra logic to account the pending list, and makes the danger window larger than with standard raid, but it does not create it. The bonus is halved latency. Newer raid code attempts to solve latency on read, by the way, by always choosing the disk to read from on which it thinks the heads are closest to where they need to be. That is probably a bogus calculation. > And speaking of the previous question, is there any difference here > between md device and single disk, which also does various write > reordering and stuff like that? Raid substitutes its own make_request, which does NOT do request aggregation, as far as I can see. So it works like a single disk with aggregation disabled. This is right, but it also wants to switch off write aggregation on the underlying device if it can - it probably can, by substituting its own max_whatever functions for those predicates that calculate when to stp aggregating requests, but that would be a layering violation. One might request from Linus a generic way of asking a device to control aggregation (which implies reordering). > -- I mean, does md layer increase > probability to see old data after reboot caused by a power loss > (for example) if an app (or whatever) was writing (or even when > the filesystem reported the write is complete) some new data during > the power loss? It does not introduce extra buffering (beyond maybe one request) except inasmuch as it IS a buffering layer - the kernel will accumulate requests to it, call its request function, and it will send them to the mirror devices, where they will accumulate, until the kernel calls their request functions ... It might try and force processing of the mirrored requests as each is generated. It could. I don't think it does. Anyway, strictly speaking, the answer to your question is "yes". It does not decrease the probability, and therefore it increases it. The question is by how much, and that is unanswerable. > Alot of questions.. but I think it's really worth to understand > how it all works. Agree. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] 2005-01-04 12:44 ` Peter T. Breuer @ 2005-01-04 14:22 ` Maarten 2005-01-04 14:56 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Maarten @ 2005-01-04 14:22 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 13:44, Peter T. Breuer wrote: > Michael Tokarev <mjt@tls.msk.ru> wrote: Hm, Peter, you did it again. At the very end of an admittedly interesting discussion you come out with the baseless assumptions and conclusions. Just when I was prepared to give you the benefit of the doubt... > > Anyway, strictly speaking, the answer to your question is "yes". It > does not decrease the probability, and therefore it increases it. The > question is by how much, and that is unanswerable. You continue to amaze me. If it does not decrease, it automatically increases ?? What happened to the "stays equal" possibility ? Do you exclusively use ">" and "<" instead of "=" in your math too ? Maybe the increase is zero. Oh wait, it could even be negative, right ? Just as with probability. So it possibly has an increase of, say, -0.5 ? (see how easy it is to confuse people ?) Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] 2005-01-04 14:22 ` Maarten @ 2005-01-04 14:56 ` Peter T. Breuer 0 siblings, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 14:56 UTC (permalink / raw) To: linux-raid Maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 13:44, Peter T. Breuer wrote: > > Michael Tokarev <mjt@tls.msk.ru> wrote: > > Hm, Peter, you did it again. At the very end of an admittedly interesting > discussion you come out with the baseless assumptions and conclusions. > Just when I was prepared to give you the benefit of the doubt... :-(. > > Anyway, strictly speaking, the answer to your question is "yes". It > > does not decrease the probability, and therefore it increases it. The > > question is by how much, and that is unanswerable. > > You continue to amaze me. If it does not decrease, it automatically > increases ?? Yes. > What happened to the "stays equal" possibility ? It's included in the "automatically increases". But anyway, it's neglible. Any particular precise outcome (such as "stays precisely the same") is neglibly likely in a cntinuous universe. Probability distributions are only stated to "almost everywhere" equivalence, since they are fundamentally just measures on the universe, so we can't even talk about "=", properly speaking. > Do you exclusively use ">" and "<" instead of "=" in your math too ? No. I use >= and <=, since I said "increases" and "decreases". > Maybe the increase is zero. Exactly. > Oh wait, it could even be negative, right ? Just No, I said "increases". I would have said "strictly increases" or "properly increases" if I had meant "<" and not "<=". But I didn't bother to distinguish since the distinction is unimportant, and unmeasurable (in the frmal sense), and besides I wuldn't ever distinguish between < and <= in such situations. > as with probability. So it possibly has an increase of, say, -0.5 ? > (see how easy it is to confuse people ?) No. I am very exact! Automatically, I may add. But anyway, it doesn't matter, since the possibility of the probabilities being unaffected is zero in any situation where there is a real causal mechanism acting to influence them, with a continuous range of outcomes (hey, computers are random, right?). So you may deduce (correctly) that in all likelihood the probability that we were speaking of is _strictly_ increased by the mechanism we were discussing. If you care. Whatever it was. Or is. Really! I do expect a certain minimum of numericity! :(. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 9:40 ` Peter T. Breuer 2005-01-04 11:57 ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Michael Tokarev @ 2005-01-04 14:03 ` David Greaves 2005-01-04 14:07 ` Peter T. Breuer 1 sibling, 1 reply; 92+ messages in thread From: David Greaves @ 2005-01-04 14:03 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter T. Breuer wrote: >Then I guess you have helped clarify to yourself what type of errors >falls in which class! Apparently errors caused by drive failure fall in >the class of "indetectible error" for you! > >But in any case, you are wrong, because it is quite possible for an >error to spontaneously arise on a disk which WOULD be detected by fsck. >What does fsck detect normally if it is not that! > > It checks the filesystem metadata - not the data held in the filesystem. David ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 14:03 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves @ 2005-01-04 14:07 ` Peter T. Breuer 2005-01-04 14:43 ` David Greaves 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 14:07 UTC (permalink / raw) To: linux-raid David Greaves <david@dgreaves.com> wrote: > Peter T. Breuer wrote: > > >Then I guess you have helped clarify to yourself what type of errors > >falls in which class! Apparently errors caused by drive failure fall in > >the class of "indetectible error" for you! > > > >But in any case, you are wrong, because it is quite possible for an > >error to spontaneously arise on a disk which WOULD be detected by fsck. > >What does fsck detect normally if it is not that! > > > It checks the filesystem metadata - not the data held in the filesystem. So you should deduce that your test (if fsck be it) won't detect errors in the files data, but only errors in the filesystem metadata. So? Is there some problem here? (yes, and one could add a md5sum per block to a fs, but I don't know a fs that does). Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 14:07 ` Peter T. Breuer @ 2005-01-04 14:43 ` David Greaves 2005-01-04 15:12 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: David Greaves @ 2005-01-04 14:43 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter, Can I make a serious attempt to sum up your argument as: Disks suffer from random *detectable* corruption events on (or after) write (eg media or transient cache being hit by a cosmic ray, cpu fluctuations during write, e/m or thermal variations). Disks suffer from random *undetectable* corruption events on (or after) write (eg media or transient cache being hit by a cosmic ray, cpu fluctuations during write, e/m or thermal variations) Raid disks have more 'corruption-susceptible' data capacity per useable data capacity and so the probability of a corruption event is higher. Since a detectable error is detected it can be retried and dealt with. This leaves the fact that essentially, raid disks are less reliable than non-raid disks wrt undetectable corruption events. However, we need to carry out risk analysis to decide if the increase in susceptibility to certain kinds of corruption (cosmic rays) is acceptable given the reduction in susceptibility to other kinds (bearing or head failure). David tentative definitions: detectable = noticed by normal OS I/O. ie CRC sector failure etc undetectable = noticed by special analysis (fsck, md5sum verification etc) ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 14:43 ` David Greaves @ 2005-01-04 15:12 ` Peter T. Breuer 2005-01-04 16:54 ` David Greaves 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 15:12 UTC (permalink / raw) To: linux-raid David Greaves <david@dgreaves.com> wrote: > Disks suffer from random *detectable* corruption events on (or after) > write (eg media or transient cache being hit by a cosmic ray, cpu > fluctuations during write, e/m or thermal variations). Well, and also people hitting the off switch (or the power going off) during a write sequence to a mirror, but after one of a pair of mirror writes has gone to disk, but before the other of the pair has. (If you want to say "but the fs is journalled", then consider what if the write is to the journal ...). > Disks suffer from random *undetectable* corruption events on (or after) > write (eg media or transient cache being hit by a cosmic ray, cpu > fluctuations during write, e/m or thermal variations) Yes. This is not different from what I have said. I didn't have any particular scenario in mind. But I see that you are correct in pointing out that some error posibilities arer _created_ by the presence of raid that would not ordinarily be present. So there is some scaling with the number of disks that needs clarification. > Raid disks have more 'corruption-susceptible' data capacity per useable > data capacity and so the probability of a corruption event is higher. Well, the probability is larger no matter what the nature of the event. In principle, and vry apprximately, there are simply more places (and times!) for it to happen TO. Yes, you may say but those errors that are produced by the cpu don't scale, nor do those that are produced by software. I'd demur. If you think about each kind you have in mind you'll see that they do scale: for example, the cpu has to work twice as often to write to two raid disks as it does to have to write to one disk, so the opportunities for IT to get something wrong are doubled. Ditto software. And of course, since it is writing twice as often , the chance of being interrupted at an inopportune time by a power failure are also doubled. See? > Since a detectable error is detected it can be retried and dealt with. No. I made no such assumption. I don't know or care what you do with a detectable error. I only say that whatever your test is, it detects it! IF it looks at the right spot, of course. And on raid the chances of doing that are halved, because it has to choose which disk to read. > This leaves the fact that essentially, raid disks are less reliable than > non-raid disks wrt undetectable corruption events. Well, that too. There is more real estate. But this "corruption" word seems to me to imply that you think I was imagining errors produced by cosmic rays. I made no such restriction. > However, we need to carry out risk analysis to decide if the increase in > susceptibility to certain kinds of corruption (cosmic rays) is Ahh. Yes you do. No I don't! This is your own invention, and I said no such thing. By "errors", I meant anything at all that you consider to be an error. It's up to you. And I see no reason to restrict the term to what is produced by something like "cosmic rays". "People hitting the off switch at the wrong time" counts just as much, as far as I know. I would guess that you are trying to classify errors by the way their probabilities scale with number of disks. I made no such distinction, in principle. I simply classified errors according to whether you could (in principle, also) detect them or not, whatever your test is. > acceptable given the reduction in susceptibility to other kinds (bearing > or head failure). Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 15:12 ` Peter T. Breuer @ 2005-01-04 16:54 ` David Greaves 2005-01-04 17:42 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: David Greaves @ 2005-01-04 16:54 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter T. Breuer wrote: >David Greaves <david@dgreaves.com> wrote: > > >>Disks suffer from random *detectable* corruption events on (or after) >>write (eg media or transient cache being hit by a cosmic ray, cpu >>fluctuations during write, e/m or thermal variations). >> >> > >Well, and also people hitting the off switch (or the power going off) >during a write sequence to a mirror, but after one of a pair of mirror >writes has gone to disk, but before the other of the pair has. > >(If you want to say "but the fs is journalled", then consider what if >the write is to the journal ...). > > Hmm. In neither case would a journalling filesystem be corrupted. The md driver (somehow) gets to decide which half of the mirror is 'best'. If the journal uses the fully written half of the mirror then it's replayed. If the journal uses the partially written half of the mirror then it's not replayed. It's just the same as powering off a normal non-resilient device. (Is your point here back to the failure to guarantee write ordering? I thought Neil answered that?) but lets carry on... >>Disks suffer from random *undetectable* corruption events on (or after) >>write (eg media or transient cache being hit by a cosmic ray, cpu >>fluctuations during write, e/m or thermal variations) >> >> > >Yes. This is not different from what I have said. I didn't have any >particular scenario in mind. > >But I see that you are correct in pointing out that some error >posibilities arer _created_ by the presence of raid that would not >ordinarily be present. So there is some scaling with the >number of disks that needs clarification. > > > >>Raid disks have more 'corruption-susceptible' data capacity per useable >>data capacity and so the probability of a corruption event is higher. >> >> > >Well, the probability is larger no matter what the nature of the event. >In principle, and vry apprximately, there are simply more places (and >times!) for it to happen TO. > > exactly what I meant. >Yes, you may say but those errors that are produced by the cpu don't >scale, nor do those that are produced by software. > No, I don't say that. > I'd demur. If you >think about each kind you have in mind you'll see that they do scale: >for example, the cpu has to work twice as often to write to two raid >disks as it does to have to write to one disk, so the opportunities for >IT to get something wrong are doubled. Ditto software. And of course, >since it is writing twice as often , the chance of being interrupted at >an inopportune time by a power failure are also doubled. > > I agree - obvious really. >See? > > yes > > > >>Since a detectable error is detected it can be retried and dealt with. >> >> > >No. I made no such assumption. I don't know or care what you do with a >detectable error. I only say that whatever your test is, it detects it! >IF it looks at the right spot, of course. And on raid the chances of >doing that are halved, because it has to choose which disk to read. > > I did when I defined detectable.... tentative definitions: detectable = noticed by normal OS I/O. ie CRC sector failure etc undetectable = noticed by special analysis (fsck, md5sum verification etc) And a detectable error occurs on the underlying non-raid device - so the chances are not halved since we're talking about write errors which go to both disks. Detectable read errors are retried until they succeed - if they fail then I submit that a "write (or after)" corruption occured. Hmm. It also occurs to me that undetectable errors are likely to be temporary - nothing's broken but a bit flipped during the write/store process (or the power went before it hit the media). Detectable errors are more likely to be permanent (since most detection algorithms probably have a retry). >>This leaves the fact that essentially, raid disks are less reliable than >>non-raid disks wrt undetectable corruption events. >> >> > >Well, that too. There is more real estate. > >But this "corruption" word seems to me to imply that you think I was >imagining errors produced by cosmic rays. I made no such restriction. > > No, I was attempting to convey "random, undetectable, small, non systematic" (ie I can't spot cosmic rays hitting the disk - and even if I could, only a very few would cause damage) vs significant physical failure "drive smoking and horrid graunching noise" (smoke and noise being valid detection methods!). They're only the same if you have a no process for dealing with errors. >>However, we need to carry out risk analysis to decide if the increase in >>susceptibility to certain kinds of corruption (cosmic rays) is >> >> > >Ahh. Yes you do. No I don't! This is your own invention, and I said no >such thing. By "errors", I meant anything at all that you consider to be >an error. It's up to you. And I see no reason to restrict the term to >what is produced by something like "cosmic rays". "People hitting the >off switch at the wrong time" counts just as much, as far as I know. > > You're talking about causes - I'm talking about classes of error. (I live in telco-land so most datacentres I know have more chance of suffering cosmic ray damage than Joe Random user pulling the plug - but conceptually these events are the same). Hitting the power off switch doesn't cause a physical failure - it causes inconsistency in the data. I introduce risk analysis to justify accepting the 'real estate undetectable corruption vulnerability' risk increase of raid versus the ability to cope with detectable errors. >I would guess that you are trying to classify errors by the way their >probabilities scale with number of disks. > Nope - detectable vs undetectable. > I made no such distinction, >in principle. I simply classified errors according to whether you could >(in principle, also) detect them or not, whatever your test is. > > Also, it strikes me that raid can actually find undetectable errors by doing a bit-comparison scan. Non-resilient devices with only one copy of each bit can't do that. raid 6 could even fix undetectable errors. A detectable error on a non-resilient media means you have no faith in the (possibly corrupt) data. An undetectable error on a non-resilient media means you have faith in the (possibly corrupt) data. Raid ultimately uses non-resilient media and propagates and uses this faith to deliver data to you. David ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 16:54 ` David Greaves @ 2005-01-04 17:42 ` Peter T. Breuer 2005-01-04 19:12 ` David Greaves 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 17:42 UTC (permalink / raw) To: linux-raid David Greaves <david@dgreaves.com> wrote: > >(If you want to say "but the fs is journalled", then consider what if > >the write is to the journal ...). > Hmm. > In neither case would a journalling filesystem be corrupted. A joournalled file system is always _consistent_. That does no mean it is correct! > The md driver (somehow) gets to decide which half of the mirror is 'best'. Yep - and which is correct? > If the journal uses the fully written half of the mirror then it's replayed. > If the journal uses the partially written half of the mirror then it's > not replayed. Which is correct? > It's just the same as powering off a normal non-resilient device. Well, I see what you mean - yes, it is the same in terms of the total event space. It's just that with a single disk, the possible outcomes are randomized only over time, as you repeat the experiment. Here you have randomization of outcomes over space as well, depending on which disk you test (or how you interleave the test across the disks). And the question remains - which outcome is correct? Well, I'll answer that. Assuming that the fs layer is only notified when BOTH journal writes have happened, and tcp signals can be sent off-machine or something like that, then the correct result is the rollback, not the completion, as the world does not expect there to have been a completion given the data it has got. It's as I said. One always wants to rollback. So one doesn't want the journal to bother with data at all. > (Is your point here back to the failure to guarantee write ordering? I > thought Neil answered that?) I don't see what that has to do with anything (Neil said that write ordering is not preserved, but that writes are not acked until they have occurred - which would allow write order to be preserved if you were interested in doing so; you simply have to choose "synchronous write"). > >No. I made no such assumption. I don't know or care what you do with a > >detectable error. I only say that whatever your test is, it detects it! > >IF it looks at the right spot, of course. And on raid the chances of > >doing that are halved, because it has to choose which disk to read. > I did when I defined detectable.... tentative definitions: > detectable = noticed by normal OS I/O. ie CRC sector failure etc > undetectable = noticed by special analysis (fsck, md5sum verification etc) A detectable error is one you detect with whatever your test is. If your test is fsck, then that's the kind of error that is detected by the detection that you do ... the only condition I imposed for the analysis was that the test be conducted on the raid array, not on its underlying components. > And a detectable error occurs on the underlying non-raid device - so the > chances are not halved since we're talking about write errors which go > to both disks. Detectable read errors are retried until they succeed - > if they fail then I submit that a "write (or after)" corruption occured. I don't understand you here - you seem to be confusing hardware mechanisms with ACTUAL errors/outcomes. It is the business of your hardware to do something for you: how and what it does is immaterial to the analysis. The question is whether that something ends up being CORRECT or INCORRECT, in terms of YOUR wishes. Whether the hardware consisders something an error or not and what it does about it is immaterial here. It may go back in time and ask your grandmother what is your favorite colour, as far as I care - all that is important is what ENDS UP on the disk, and whether YOU consider that an error or not. So you are on some wild goose chase of your own here, I am afraid! > It also occurs to me that undetectable errors are likely to be temporary You are again on a trip of your own :( undetectable errors are errors you cannot detect with your test, and that is all! There is no implication. > - nothing's broken but a bit flipped during the write/store process (or > the power went before it hit the media). Detectable errors are more > likely to be permanent (since most detection algorithms probably have a > retry). I think that for some reason you are considering that a test (a detection test) is carried out at every moment of time. No. Only ONE test is ever carried out. It is the test you apply when you do the observation: the experiment you run decides at that single point wether the disk (the raid array) has errors or not. In practical terms, you do it usualy when you boot the raid array, and run fsck on its file system. OK? You simply leave an experiment running for a while (leave the array up, let monkeys play on it, etc.) and then you test it. That test detects some errors. However, there are two types of errors - those you can detect with your test, and those you cannot detect. My analysis simply gave the probabilities for those on the array, in terms of basic parameters for the probabilities per an individual disk. I really do not see why people make such a fuss about this! > >>However, we need to carry out risk analysis to decide if the increase in > >>susceptibility to certain kinds of corruption (cosmic rays) is > >> > > > >Ahh. Yes you do. No I don't! This is your own invention, and I said no > >such thing. By "errors", I meant anything at all that you consider to be > >an error. It's up to you. And I see no reason to restrict the term to > >what is produced by something like "cosmic rays". "People hitting the > >off switch at the wrong time" counts just as much, as far as I know. > > > > > You're talking about causes - I'm talking about classes of error. No, I'm talking about classes of error! You're talking about causes. :) > > Hitting the power off switch doesn't cause a physical failure - it > causes inconsistency in the data. I don't understand you - it causes errors just like cosmic rays do (and we can even set out and describe the mechanisms involved). The word "failure" is meaningless to me here. > >I would guess that you are trying to classify errors by the way their > >probabilities scale with number of disks. > > > Nope - detectable vs undetectable. Then what's the problem? An undetectable error is one you cannot detect via your test. Those scale with real estate. A detectible error is one you can spot with your test (on the array, not its components). The missed detectible errors scale as n-1, where n is the number of disks in the array. Thus a single disk suffers from no missed detectible errors, and a 2-disk raid array does. That's all. No fuss, no muss! > Also, it strikes me that raid can actually find undetectable errors by > doing a bit-comparison scan. No, it can't, by definition. Undetectible errors are undetectible. If you change your test, you change the class of errors that are undetectible. That's all. > Non-resilient devices with only one copy of each bit can't do that. > raid 6 could even fix undetectable errors. Then they are not "undetectible". The analisis in not affected by your changing the definition of what is in the undetectible class of error and what is not. It stands. I have made no assumption at all on what they are. I simply pointed out how the probabilities scale for a raid array. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 17:42 ` Peter T. Breuer @ 2005-01-04 19:12 ` David Greaves 0 siblings, 0 replies; 92+ messages in thread From: David Greaves @ 2005-01-04 19:12 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter T. Breuer wrote: >A joournalled file system is always _consistent_. That does no mean it >is correct! > > To my knowledge no computers have the philosophical wherewithall to provide that service ;) If one is rude enough to stab a journalling filesystem in the back as it tries to save your data it promises only to be consistent when it is revived - it won't provide application correctness.. I think we agree on that. >>The md driver (somehow) gets to decide which half of the mirror is 'best'. >> >> >Yep - and which is correct? > > Both are 'correct' - they simply represent different points in the series of system calls made before the power went. >Which is correct? > > <grumble> ditto >And the question remains - which outcome is correct? > > same answer I'm afraid. >Well, I'll answer that. Assuming that the fs layer is only notified >when BOTH journal writes have happened, and tcp signals can be sent >off-machine or something like that, then the correct result is the >rollback, not the completion, as the world does not expect there to >have been a completion given the data it has got. > >It's as I said. One always wants to rollback. So one doesn't want the >journal to bother with data at all. > <cough>bullshit</cough> ;) I write a,b,c and d to the filesystem we begin our story when a,b and c all live on the fs device (raid or not), all synced up and consistent. I start to write d it hits journal mirror A it hits journal mirror B it finalises on journal mirror B I yank the plug The mirrors are inconsistent The filesystem is consistent I reboot scenario 1) the md device comes back using A the journal isn't finalised - it's ignored the filesystem contains a,b and c Is that correct? scenario 2) the md device comes back using B the journal is finalised - it's rolled forward the filesystem contains a,b,c and d Is that correct? Both are correct. So, I think that deals with correctness and journalling - now on to errors... >>>No. I made no such assumption. I don't know or care what you do with a >>>detectable error. I only say that whatever your test is, it detects it! >>>IF it looks at the right spot, of course. And on raid the chances of >>>doing that are halved, because it has to choose which disk to read. >>> >>> >>I did when I defined detectable.... tentative definitions: >>detectable = noticed by normal OS I/O. ie CRC sector failure etc >>undetectable = noticed by special analysis (fsck, md5sum verification etc) >> >> > >A detectable error is one you detect with whatever your test is. If >your test is fsck, then that's the kind of error that is detected by the >detection that you do ... the only condition I imposed for the analysis >was that the test be conducted on the raid array, not on its underlying >components. > > well, if we're going to get anywhere here we need to be clear about things. There are all kinds of errors - raid and redundancy will help with some and not others. An md device does have underlying components and to refuse to allow tests to compare them you remove one of the benefits of raid - redundancy. It may make it easier to model mathmatically - but then the model is wrong. We need to make sure we're talking about bits on a device md reads devices and it writes them. We need to understand what an error is - stop talking bollocks about "whatever the test is". This is *not* a math problem - it's simply not well enough defined yet. Lets get back to reality to decide what to model. I proposed definitions and tests (the ones used in the real world where we don't run fsck) and you've ignored them. I'll repeat them: detectable = noticed by normal OS I/O. ie CRC sector failure etc undetectable = noticed by special analysis (fsck, md5sum verification etc) I'll add 'component device comparison' to the special analysis list. No error is truly undetectable - if it were then it wouldn't matter would it? >>- nothing's broken but a bit flipped during the write/store process (or >>the power went before it hit the media). Detectable errors are more >>likely to be permanent (since most detection algorithms probably have a >>retry). >> >> > >I think that for some reason you are considering that a test (a >detection test) is carried out at every moment of time. No. Only ONE >test is ever carried out. It is the test you apply when you do the >observation: the experiment you run decides at that single point wether >the disk (the raid array) has errors or not. In practical terms, you do >it usualy when you boot the raid array, and run fsck on its file system. > >OK? >You simply leave an experiment running for a while (leave the array up, >let monkeys play on it, etc.) and then you test it. That test detects >some errors. However, there are two types of errors - those you can >detect with your test, and those you cannot detect. My analysis simply >gave the probabilities for those on the array, in terms of basic >parameters for the probabilities per an individual disk. > >I really do not see why people make such a fuss about this! > > We care about our data and raid has some vulnerabilites to corruption. We need to understand these to fix them - your analysis is woolly and unhelpful and, although it may have certain elements that are mathmatically correct - your model has flaws that mean that the conclusions are not applicable. >>>>However, we need to carry out risk analysis to decide if the increase in >>>>susceptibility to certain kinds of corruption (cosmic rays) is >>>> >>>> >>>> >>>Ahh. Yes you do. No I don't! This is your own invention, and I said no >>>such thing. By "errors", I meant anything at all that you consider to be >>>an error. It's up to you. And I see no reason to restrict the term to >>>what is produced by something like "cosmic rays". "People hitting the >>>off switch at the wrong time" counts just as much, as far as I know. >>> >>> >>> >>> >>You're talking about causes - I'm talking about classes of error. >> >> > >No, I'm talking about classes of error! You're talking about causes. :) > > No, by comparing the risk between classes of error (detectable and not) I'm talking about classes of errror - by arguing about cosmic rays and power switches you _are_ talking about causes. Personally I think there is a massive difference between the risk of detectable errors and undetectable ones. Many orders of magnitude. >>Hitting the power off switch doesn't cause a physical failure - it >>causes inconsistency in the data. >> >> >I don't understand you - it causes errors just like cosmic rays do (and >we can even set out and describe the mechanisms involved). The word >"failure" is meaningless to me here. > > yes, you appear to have selectively quoted and ignored what I said a line earlier: > (I live in telco-land so most datacentres I know have more chance of suffering cosmic ray damage than Joe Random user pulling the plug - but conceptually these events are the same). When that happens I begin to think that further discussion is meaningless. >>>I would guess that you are trying to classify errors by the way their >>>probabilities scale with number of disks. >>> >>> >>> >>Nope - detectable vs undetectable. >> >> > >Then what's the problem? An undetectable error is one you cannot detect >via your test. Those scale with real estate. A detectible error is one >you can spot with your test (on the array, not its components). The >missed detectible errors scale as n-1, where n is the number of disks in >the array. > >Thus a single disk suffers from no missed detectible errors, and a >2-disk raid array does. > >That's all. > >No fuss, no muss! > > and so obviously wrong! An md device does have underlying components and to refuse to allow tests to compare them you remove one of the benefits of raid - redundancy. >>Also, it strikes me that raid can actually find undetectable errors by >>doing a bit-comparison scan. >> >> > >No, it can't, by definition. Undetectible errors are undetectible. If >you change your test, you change the class of errors that are >undetectible. > >That's all. > > > >>Non-resilient devices with only one copy of each bit can't do that. >>raid 6 could even fix undetectable errors. >> >> > >Then they are not "undetectible". > > They are. Read my definition. They are not detected in normal operation with some kind of event notification/error return code; hence undetectable. However bit comparison with known good or md5 sums or with a mirror can spot such bit flips. They are still 'undetectable' in normal operation. Be consistent in your terminology. >The analisis in not affected by your changing the definition of what is >in the undetectible class of error and what is not. It stands. I have >made no assumption at all on what they are. I simply pointed out how >the probabilities scale for a raid array. > > What analysis - you are waving vague and changing definitions about and talk about grandma's favourite colour David PS any dangling sentences are because I just found so many inconsistencies that I gave up. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 20:41 ` Peter T. Breuer 2005-01-03 23:19 ` Peter T. Breuer @ 2005-01-04 0:45 ` maarten 2005-01-04 10:14 ` Peter T. Breuer 1 sibling, 1 reply; 92+ messages in thread From: maarten @ 2005-01-04 0:45 UTC (permalink / raw) To: linux-raid On Monday 03 January 2005 21:41, Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: > > Just for laughs, I calculated this chance also for a three-way raid-1 > > setup > > Let us (randomly) assume there is a 10% chance of a disk failure. > > No, call it "p". That is the correct name. And I presume you mean "an > error", not "a failure". You presume correctly. > > We therefore have eight possible scenarios: > > Oh, puhleeeeze. Infantile arithmetic instead of elementary probabilistic > algebra is not something I wish to suffer through ... Maybe not. Your way of explaining may make sense to a math expert, I tried to explain it in a form other humans might comprehend, and that was on purpose. Your way may be correct, or it may not be, I'll leave that up to other people. To me, it looks like you complicate it and obfuscate it, like someone can code a one-liner in perl which is completely correct yet cannot be read by anyone but the author... In other words, you try to impress me with your leet math skills but my explanation was both easier to read and potentially reached a far bigger audience. Now excuse me if my omitting "p" in my calculation made you lose your concentration... or something. Further comments to be found below. > There is no need for you to consider these scenarios. The probability > is 3p^2, which is tiny. Forget it. (actually 3p^2(1-p), but forget the > cube term). If you're going to prove something in calculations, you DO NOT 'forget' a tiny probability. This is not science, it's math. Who is to say p will always be 0.1 ? In another scenario in another calculation p might be as high as 0.9 ! > > Scenarios G and H are special, the chances > > of that occurring are calculated seperately. > > No, they are NOT special. one of them is the chance that everything is > OK, which is (1-p)^3, or approx 1-3p (surprise surprise). The other is > the completely forgetable probaility p^3 that all three are bad at that > spot. Again, you cannot go around setting (1-p)^3 to 1-3p. P is a variable which is not known to you (therefore it is a variable) thus might as well be 0.9. Is 0.1^3 the same to you as 1-2.7 ? Not really huh, is it ? > This is excruciatingly poor baby math! Oh, well then my math seems on par with your admin skills... :-p > Or approx 1-p. Which is approx the same number as what I said. > It should be p! It is one minus your previous result. > > SIgh ... 0 (1-3p) + 1/3 3p = p > > > Which, again, is exactly the same chance a single disk will get > > corrupted, as we assumed above in line one is 10%. Ergo, using raid-1 > > does not make the risks of bad data creeping in any worse. Nor does it > > make it better either. > > All false. And baby false at that. Annoying! Are your reading skills lacking ? I stated that the chance of reading bad data was 0.1, which is equal to p, so we're in agreement it is p (0.1). > Look, the chance of an undetected detectable failure occuring is > 0 (1-3p) + 2/3 3p > > = 2p > > and it grows with the number n of disks, as you may expect, being > proportional to n-1. I see no proof whatsoever of that. Is that your proof, that single line ? Do you comment your code as badly as you do your math ? Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 0:45 ` maarten @ 2005-01-04 10:14 ` Peter T. Breuer 2005-01-04 13:24 ` Maarten 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 10:14 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > On Monday 03 January 2005 21:41, Peter T. Breuer wrote: > > maarten <maarten@ultratux.net> wrote: > > > Just for laughs, I calculated this chance also for a three-way raid-1 > > > setup > > > > Let us (randomly) assume there is a 10% chance of a disk failure. > > > > No, call it "p". That is the correct name. And I presume you mean "an > > error", not "a failure". > > You presume correctly. > > > > We therefore have eight possible scenarios: > > > > Oh, puhleeeeze. Infantile arithmetic instead of elementary probabilistic > > algebra is not something I wish to suffer through ... > > Maybe not. Your way of explaining may make sense to a math expert, I tried to It would make sense to a 16 year old, since that's about where you get to be certified as competent in differential calculus and probability theory, if my memory of my high school math courses is correct. This is pre-university stuff by a looooooong way. The problem is that I never have a 9-year old child available when I need one ... > explain it in a form other humans might comprehend, and that was on purpose. If that's your definition of a human, I'm not sure I want to see them! > Your way may be correct, or it may not be, I'll leave that up to other people. What do you see as incorrect? > To me, it looks like you complicate it and obfuscate it, like someone can No, I simplify and make it clear. > Now excuse me if my omitting "p" in my calculation made you lose your > concentration... or something. Further comments to be found below. It does, because what you provide is a sort of line-noise instead of just "p". Not abstracting away from the detail to the information content behind it is perhaps a useful trait in a sysadmin. > > There is no need for you to consider these scenarios. The probability > > is 3p^2, which is tiny. Forget it. (actually 3p^2(1-p), but forget the > > cube term). > > If you're going to prove something in calculations, you DO NOT 'forget' a tiny You forget it because it is tiny. As tiny as you or I could wish to make it. Puhleeze. This is just Poisson distributions. > probability. This is not science, it's math. Therefore you forget it. All of differential calculus works like that. Forget the square term - it vanishes. All terms of the series beyond the first can be ignored as you go to the limiting situation. > Who is to say p will always be 0.1 ? Me. Or you. But it will always be far less. Say in the 1/10^40 range for a time interval of one second. You can look up such numbers for yourself at manufacturers sites - I vaguely recall they appear on their spec sheets. > In another scenario in another calculation p might be as high as 0.9 ! This is a probabilty PER UNIT TIME. Choose the time interval to make it as small as you like. > > > Scenarios G and H are special, the chances > > > of that occurring are calculated seperately. > > > > No, they are NOT special. one of them is the chance that everything is > > OK, which is (1-p)^3, or approx 1-3p (surprise surprise). The other is > > the completely forgetable probaility p^3 that all three are bad at that > > spot. > > Again, you cannot go around setting (1-p)^3 to 1-3p. Of course I can. I think you must have failed differential calculus! The derivative of (1-p)^3 near p=0 is -3. That is to say that the approximation series for (1-p)^3 is 1 - 3p + o(p). And by o(p) I mean a term that when divided by p tends to zero as p tends to 0. In other words, something that you can forget. > P is a variable which is > not known to you (therefore it is a variable) It's a value. That I call it "p" does not make it variable. > thus might as well be 0.9. Is Your logic fails here - it is exactly as small as I wish it to be, because I get to choose the interval of time (the "scale") involved. > 0.1^3 the same to you as 1-2.7 ? Not really huh, is it ? If your jaw were to drop any lower it would drag on the ground :(. This really demonstrates amazing ignorance of very elementary high school math. Perhaps you've forgotten it all! Then how do you move your hand from point A to point B? How do you deal with the various moments of inertia involved, and the feedback control, all under the affluence of incohol and gravity too? Maybe it's a Kalman filter. You try it with the other hand first, and follow that with the hand you want, compensating for the differences you see. > > This is excruciatingly poor baby math! > > Oh, well then my math seems on par with your admin skills... :-p Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 10:14 ` Peter T. Breuer @ 2005-01-04 13:24 ` Maarten 2005-01-04 14:05 ` Peter T. Breuer 0 siblings, 1 reply; 92+ messages in thread From: Maarten @ 2005-01-04 13:24 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote: > > > maarten <maarten@ultratux.net> wrote: > It would make sense to a 16 year old, since that's about where you get > to be certified as competent in differential calculus and probability > theory, if my memory of my high school math courses is correct. This is > pre-university stuff by a looooooong way. Oh wow. So you deduced I did not study math at university ? Well, that IS an eye-opener for me. I was unaware studying math was a requirement to engage in conversation on the linux-raid mailinglist ? Or is this not the list I think it is ? > The problem is that I never have a 9-year old child available when I > need one ... Um, check again... he's sitting right there with you I think. > You forget it because it is tiny. As tiny as you or I could wish to > make it. Puhleeze. This is just Poisson distributions. > > Therefore you forget it. All of differential calculus works like that. > Forget the square term - it vanishes. All terms of the series beyond > the first can be ignored as you go to the limiting situation. And that is precisely what false assumption you're making ! We ARE not going to the limiting situation. We are discussing the probabilities in failures of media. You cannot assume we will be talking about harddisks, and neither is the failure rate in harddisk anywhere near limit zero. Drive manufacturers might want you to believe that through publishing highly theoretical MTBFs, but that doesn't make it so that any harddrive has a life expectancy of 20+ years, as the daily facts prove all the time. You cannot assume p to be vanishingly small. Maybe p really is the failure rate in 20 year old DAT tapes that were stored at 40 degrees C. Maybe it is the failure rate of wet floppydisks. You cannot make assumptions about p. The nice thing in math is that you can make great calculations when you assume a variable is limit zero or limit infinite. The bad thing is, you cannot assume that things in real life act like predefined math variables. > > Who is to say p will always be 0.1 ? > > Me. Or you. But it will always be far less. Say in the 1/10^40 range > for a time interval of one second. You can look up such numbers for > yourself at manufacturers sites - I vaguely recall they appear on their > spec sheets. Yes, and p will be in the range of 1.0 for time intervals 10^40 seconds. Got another wisecrack ? Of course p will approach zero when you make time interval t approach zero !! And yes, judging a time of 1 second as a realistic time interval to measure a disk drives' failure rate over time certainly qualifies as making t limit zero. Goddamnit, why am I even discussing this with you. Are you a troll ?? > Your logic fails here - it is exactly as small as I wish it to be, > because I get to choose the interval of time (the "scale") involved. No, you don't. You're as smart with numbers as drive manufactures are, letting people believe it is okay to sell a drive with a MTBF of 300000 hours, yet with a one year warrantee. I say if you trust your own MTBF put your money where your mouth is and extend the warrantee to something believable. You do the same thing here. I can also make such calculations: I can safely say that at this precise second (note I say second) you are not thinking clearly. I now can prove that you never are thinking clearly, simply by making time interval t of one second limit zero, and hey, whaddayaknow, your intellect goes to zero too. Neat trick huh ? > If your jaw were to drop any lower it would drag on the ground :(. This > really demonstrates amazing ignorance of very elementary high school > math. That is because I came here as a linux admin, not a math whiz. I think we have already established that you do not surpass my admin skills, in another branch of this thread, yes ? (boy is that ever an understatement !) A branch which you, wisely, left unanswered after at least two people besides myself pointed out to you how fscked up (pun intended ;) your server rooms and / or procedures are. So now you concentrate on the math angle, where you can shine your cambridge medals (whatever that is still worth, in light of this) and outshine "baby math" people all you want. I got news for you: I may not be fluent anymore in math terminology, but I certainly have the intellect and intelligence to detect and expose a bullshitter. > > Perhaps you've forgotten it all! Then how do you move your hand from > point A to point B? How do you deal with the various moments of inertia > involved, and the feedback control, all under the affluence of incohol > and gravity too? Well that's simple. I actually obey the law. Such as the law of gravity, and the law that says any outcome of a probability calculation cannot be other than between zero and one, inclusive. You clearly do not. (though I still think you obey gravity law, since obviously you're still here) Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 13:24 ` Maarten @ 2005-01-04 14:05 ` Peter T. Breuer 2005-01-04 15:31 ` Maarten 0 siblings, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 14:05 UTC (permalink / raw) To: linux-raid Maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote: > > maarten <maarten@ultratux.net> wrote: > > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote: > > > > maarten <maarten@ultratux.net> wrote: > > > It would make sense to a 16 year old, since that's about where you get > > to be certified as competent in differential calculus and probability > > theory, if my memory of my high school math courses is correct. This is > > pre-university stuff by a looooooong way. > > Oh wow. So you deduced I did not study math at university ? Well, I deduced that you did not get to the level expected of a 16 year old. > Well, that IS an eye-opener for me. I was unaware studying math was a One doesn't "study" math, one _does_ math, just as one _does_ walking down the street, talking, and opening fridge doors. Your competency at it gets certified in school and uni, that's all. > requirement to engage in conversation on the linux-raid mailinglist ? Looks like it, or else one gets bogged down in inane conversations at about the level of "what is an editor". Look - a certain level of mathematical competence is required in the technical world. You cannot get away without it. Being able to do math to the level expected of an ordinary 16 year old is certainly expected by me in order to be able to talk, walk, and eat pizza with you as an ordinary human being. As to the Poisson distribution, I know it forms part of the university syllabus even for laboratory techs first degrees, because the head tech here, who has been doing his degree in tech stuff here while he admins, used to keep coming and asking me things like "how do I calculate the expected time between two collisions given that packets are emitted from both ends according to a Poisson distribution with mean mu". And that's ordinary 16 year-old math stuff. Stands to reason they would teach it in a university at a crummy place like this. > Or is this not the list I think it is ? It's not the list you think it is, I think. > > The problem is that I never have a 9-year old child available when I > > need one ... > > Um, check again... he's sitting right there with you I think. OK! Now he can explain this man page to me, which they say a 9-year old child can understand (must be the solaris man page for "passwd" again). > > Therefore you forget it. All of differential calculus works like that. > > Forget the square term - it vanishes. All terms of the series beyond > > the first can be ignored as you go to the limiting situation. > > And that is precisely what false assumption you're making ! We ARE not going There is no false assumption. This is "precisely" what you are getting wrong. I do not want to argue with you, I just want you to GET IT. > to the limiting situation. Yes we are. > We are discussing the probabilities in failures of > media. PER UNIT TIME, for gawd's sake. Choose a small unit. One small enough to please you. Then make it "vanishingly small", slowly, please. Then we can all take a rest while you get it. > You cannot assume we will be talking about harddisks, and neither is I don't. > the failure rate in harddisk anywhere near limit zero. Eh? It's tiny per tiny unit of time. As you would expect, naively. > Drive manufacturers > might want you to believe that through publishing highly theoretical MTBFs, If the MTBF is one year (and we are not discussing failure, but error), then the probability of a failure PER DAY is 1/365, or about 0.03. That would make the square of that probability PER DAY about 0.0009, or negligible in the scale of the linear term. This _is_ mathematics. And if you want to consider the probability PER HOUR, then the probability of failure is about 0.0012. Per minute it is about 0.00002. Per second it is about 0.0000003. The square term is 0.0000000000009, or completely negligible. And that is failure, not error. But we don't care. Just take a time unit that is pleasingly small, and consider the probability per THAT unit. And please keep the unit to yourself. > but that doesn't make it so that any harddrive has a life expectancy of 20+ > years, as the daily facts prove all the time. It does mean it. It means precisely that (given certain experimental conditions). If you want to calculate the MTBF in a real dusty noisy environment, I would say it is about ten years. That is, 10% chance of failure per year. If they say it is 20 years and not 10 years, well I believe that too, but they must be keeping the monkeys out of the room. > You cannot assume p to be vanishingly small. I don't have to - make it so. Then yes, I can. > Maybe p really is the failure > rate in 20 year old DAT tapes that were stored at 40 degrees C. Maybe it is You simply are spouting nonsense. Please cease. It is _painful_. Like hearing somebody trying to sing when they cannot sing, or having to admire amateur holiday movies. P is vanishingly small when you make it so. Do so! It doesn't require anything but a choice on your part to choose a sufficiently small unit of time to scale it to. > the failure rate of wet floppydisks. You cannot make assumptions about p. I don't. You on the other hand ... > The nice thing in math is that you can make great calculations when you > assume a variable is limit zero or limit infinite. The bad thing is, you Complete and utter bullshit. You ought to be ashamed. > cannot assume that things in real life act like predefined math variables. Rest of crank math removed. One can't reason with people who simply don't have the wherewithal to recognise that the problem is inside them. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 14:05 ` Peter T. Breuer @ 2005-01-04 15:31 ` Maarten 2005-01-04 16:21 ` Peter T. Breuer 2005-01-04 19:57 ` Mikael Abrahamsson 0 siblings, 2 replies; 92+ messages in thread From: Maarten @ 2005-01-04 15:31 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 15:05, Peter T. Breuer wrote: > Maarten <maarten@ultratux.net> wrote: > > On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote: > > > maarten <maarten@ultratux.net> wrote: > > > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote: > > > > > maarten <maarten@ultratux.net> wrote: > > Well, that IS an eye-opener for me. I was unaware studying math was a > > One doesn't "study" math, one _does_ math, just as one _does_ walking > down the street, talking, and opening fridge doors. Your competency > at it gets certified in school and uni, that's all. I know a whole mass of people who can't calculate what chance the toss of a coin has. Or who don't know how to verify their money change is correct. So it seems math is not an essential skill, like walking and talking is. I'll not even go into gambling, which is immensely popular. I'm sure there are even mathematicians who gamble. How do you figure that ?? > > but that doesn't make it so that any harddrive has a life expectancy of > > 20+ years, as the daily facts prove all the time. > > It does mean it. It means precisely that (given certain experimental > conditions). If you want to calculate the MTBF in a real dusty noisy > environment, I would say it is about ten years. That is, 10% chance of > failure per year. > > If they say it is 20 years and not 10 years, well I believe that too, > but they must be keeping the monkeys out of the room. Nope, not 10 years, not 20 years, not even 40 years. See this Seagate sheet below where they go on record with a whopping 1200.000 hours MTBF. That translates to 137 years. Now can you please state here and now that you actually believe that figure ? Cause it would show that you have indeed fully and utterly lost touch with reality. No sane human being would take seagate for their word seen as we all experience many many more drive failures within the first 10 years, let alone 20, to even remotely support that outrageous MTBF claim. All this goes to show -again- that you can easily make statistics which do not resemble anything remotely possible in real life. Seagate determines MTBF by setting up 1.200.000 disks, running them for one hour, applying some magic extrapolation wizardry which should (but clearly doesn't) properly account for aging, and hey presto, we've designed a drive with a statistical average life expectancy of 137 years. Hurray. Any reasonable person will ignore that MTBF as gibberish, and many people would probably even state as much as that NONE of those drives will still work after 137 years. (too bad there's no-one to collect the prize money) So, the trick seagate does is akin to your trick of defining t as small as you like and [then] proving that p goes to zero. Well newsflash, you can't determine anything useful from running 1000 drives for one hour, and probably even less from running 3.600.000 drives for one second. The idea alone is preposterous. http://www.seagate.com/cda/products/discsales/marketing/ detail/0,1081,551,00.html > > Maybe p really is the failure > > rate in 20 year old DAT tapes that were stored at 40 degrees C. Maybe it > > is > > You simply are spouting nonsense. Please cease. It is _painful_. Like > hearing somebody trying to sing when they cannot sing, or having to > admire amateur holiday movies. Nope. I want you to provide a formula which shows how likely a failure is. It is entirely my prerogative to test that formula with media with a massive failure rate. I want to build a raid-1 array out of 40 pieces of 5.25" 25-year old floppy drives, and who's stopping me. What is my expected failure rate ? > Rest of crank math removed. One can't reason with people who simply > don't have the wherewithal to recognise that the problem is inside > them. This sentence could theoretically equally well apply to you, couldn't it ? Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 15:31 ` Maarten @ 2005-01-04 16:21 ` Peter T. Breuer 2005-01-04 20:55 ` maarten 2005-01-04 19:57 ` Mikael Abrahamsson 1 sibling, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 16:21 UTC (permalink / raw) To: linux-raid Maarten <maarten@ultratux.net> wrote: > I'll not even go into gambling, which is immensely popular. I'm sure there > are even mathematicians who gamble. How do you figure that ?? I know plenty who do. They win. A friend of mine made his living at the institute of advanced studies at princeton for two years after his grant ran out by winning at blackjack in casinos all over the states. (never play him at poker! I used to lose all my matchsticks ..) > > If they say it is 20 years and not 10 years, well I believe that too, > > but they must be keeping the monkeys out of the room. > > Nope, not 10 years, not 20 years, not even 40 years. See this Seagate sheet > below where they go on record with a whopping 1200.000 hours MTBF. That > translates to 137 years. I believe that too. They REALLY have kept the monkeys well away. They're only a factor of ten out from what I think it is, so I certainly believe them. And they probably discarded the ones that failed burn-in too. > Now can you please state here and now that you > actually believe that figure ? Of course. Why wouldn't I? They are stating something like 1% lossage per year under perfect ideal conditions, no dust, no power spikes, no a/c overloads, etc. I'd easily belueve that. > Cause it would show that you have indeed > fully and utterly lost touch with reality. No sane human being would take > seagate for their word seen as we all experience many many more drive > failures within the first 10 years, Of course we do. Why wouldn't we? That doesn't make their figures wrong! > let alone 20, to even remotely support > that outrageous MTBF claim. The number looks believable to me. Do they reboot every day? I doubt it. It's not outrageous. Just optimistic for real-world conditions. (And yes, I have ten year old disks, or getting on for it, and they still work). > All this goes to show -again- that you can easily make statistics which do not No, it means that statistics say what they say, and I understand them fine, thanks. > resemble anything remotely possible in real life. Seagate determines MTBF by > setting up 1.200.000 disks, running them for one hour, applying some magic > extrapolation wizardry which should (but clearly doesn't) properly account > for aging, and hey presto, we've designed a drive with a statistical average > life expectancy of 137 years. Hurray. That's a fine technique. It's perfectly OK. I suppose they did state the standard deviation of their estimator? > Any reasonable person will ignore that MTBF as gibberish, No they wouldn't - it looks a perfectly reasonable figure to me, just impossibly optimisitic for the real world, which contains dust, water vapour, mains spikes, reboots every day, static electrickery, and a whole load of other gubbins that doesn't figure in their tests at all. > and many people > would probably even state as much as that NONE of those drives will still > work after 137 years. (too bad there's no-one to collect the prize money) They wouldn't expect them to. If the mtbf is 137 years, then of a batch of 1000, approx 0.6 and a bit PERCENT would die per year. Now you get to multiply. 99.3^n % is ... well, anyway, it isn't linear, but they would all be expected to die out by 137y. Anyone got some logarithms? > So, the trick seagate does is akin to your trick of defining t as small as you Nonsense. Please stop this bizarre crackpottism of yours. I don't have any numrical disabilities, and if you do, that's your problem, and it should give yu a guide where you need to work to improve. > Nope. I want you to provide a formula which shows how likely a failure is. That's your business. But it doesn't seem likely that you'll manage it. > It is entirely my prerogative to test that formula with media with a massive > failure rate. I want to build a raid-1 array out of 40 pieces of 5.25" > 25-year old floppy drives, and who's stopping me. > What is my expected failure rate ? Oh, about 20 times the failure rate with one floppy. If the mtbf for one floppy is x (so the probabilty of failure is p = 1/x per unit time), then the raid will fail after two floppies die, which is expected to be at APPROX 1/(40p) + 1/(39p) = x(1/40 + 1/39) or approximately x/20 units of time from now (I shoould really tell you the expected time to the second event in a poisson distro, but you can do that for me ..., I simply point you to the crude calculation above as being roughly good enough). It will last one twentieth as long as a single floppy (thanks to the redundancy). > > Rest of crank math removed. One can't reason with people who simply > > don't have the wherewithal to recognise that the problem is inside > > them. > > This sentence could theoretically equally well apply to you, couldn't it ? !! Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 16:21 ` Peter T. Breuer @ 2005-01-04 20:55 ` maarten 2005-01-04 21:11 ` Peter T. Breuer 2005-01-04 21:38 ` Peter T. Breuer 0 siblings, 2 replies; 92+ messages in thread From: maarten @ 2005-01-04 20:55 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 17:21, Peter T. Breuer wrote: > Maarten <maarten@ultratux.net> wrote: > > Nope, not 10 years, not 20 years, not even 40 years. See this Seagate > > sheet below where they go on record with a whopping 1200.000 hours MTBF. > > That translates to 137 years. > > I believe that too. They REALLY have kept the monkeys well away. > They're only a factor of ten out from what I think it is, so I certainly > believe them. And they probably discarded the ones that failed burn-in > too. > > > Now can you please state here and now that you > > actually believe that figure ? > > Of course. Why wouldn't I? They are stating something like 1% lossage > per year under perfect ideal conditions, no dust, no power spikes, no > a/c overloads, etc. I'd easily belueve that. No spindle will take 137 years of abuse at the incredibly high speed of 10000 rpm and not show enough wear so that the heads will either collide with the platters or read on adjacent tracks. Any mechanic can tell you this. I don't care what kind of special diamond bearings you use, it's just not feasible. We could even start a debate of how much decay we would see in the silicon junctions in the chips, but that is not useful nor on-topic. Let's just say that the transistor barely exists 50 years and it is utter nonsense to try to say anything meaningful about what 137 years will do to semiconductors and their molecular structures over that vast a timespan. Remember, it was not too long ago they said CDs were indestructible (by time elapsed, not by force, obviously). And look what they say now. I don't see where you come up with 1% per year. Remember that MTBF means MEAN time between failures, so for every single drive that dies in year one, one other drive has to double its life expectancy to twice 137, which is 274 years. If your reasoning is correct with one drive dying per year, the remaining bunch after 50 years will have to survive another 250(!) years, on average. ...But wait, you're still not convinced, eh ? Also, I'm not used to big data centers buying disks by the container, but from what I've heard no-one can actually say that they lose as little as 1 drive a year for any hundred drives bought. Those figures are (much) higher. You yourself said in a previous post you expected 10% per year, and that is WAY off the 1% mark you now state 'believeable'. How come ? > > Cause it would show that you have indeed > > fully and utterly lost touch with reality. No sane human being would > > take seagate for their word seen as we all experience many many more > > drive failures within the first 10 years, > > Of course we do. Why wouldn't we? That doesn't make their figures > wrong! Yes it does. By _definition_ even. It clearly shows that one cannot account for tens, nay hundreds, of years wear and tear by just taking a very small sample of drives and having them tested for a very small amount of time. Look, _everybody_ knows this. No serious admin will not change their drives after five years as a rule, or 10 years at the most. And that is not simply due to Moore's law. The failure rate just gets too high, and economics dictate that they must be decommissioned. After "only" 10 years...! > > let alone 20, to even remotely support > > that outrageous MTBF claim. > > The number looks believable to me. Do they reboot every day? I doubt Of course they don't. They never reboot. MTBF is not measured in adverse conditions. Even so, neither do disks in a data centre... > it. It's not outrageous. Just optimistic for real-world conditions. > (And yes, I have ten year old disks, or getting on for it, and they > still work). Some of em do, yes. Not all of them. (to be fair, the MTBF in those days was much lower than now (purportedly)). > > All this goes to show -again- that you can easily make statistics which > > do not > > No, it means that statistics say what they say, and I understand them > fine, thanks. Uh-huh. So explain to me why drive manufacturers do not give a 10 year warrantee. I say because they know full well that they would go bankrupt if they did since not 8% but rather 50% or more would return in that time. > > resemble anything remotely possible in real life. Seagate determines > > MTBF by setting up 1.200.000 disks, running them for one hour, applying > > some magic extrapolation wizardry which should (but clearly doesn't) > > properly account for aging, and hey presto, we've designed a drive with a > > statistical average life expectancy of 137 years. Hurray. > > That's a fine technique. It's perfectly OK. I suppose they did state > the standard deviation of their estimator? Call them and find out; you're the math whiz. And I'll say it again: if some statistical technique yields wildly different results than the observable, verifiable real world does, then there is something wrong with said technique, not with the real world. The real world is our frame of reference, not some dreamed-up math model which attempts to describe the world. And if they do collide, a math theory gets thrown out, not the real world observations instead...! > > Any reasonable person will ignore that MTBF as gibberish, > > No they wouldn't - it looks a perfectly reasonable figure to me, just > impossibly optimisitic for the real world, which contains dust, water > vapour, mains spikes, reboots every day, static electrickery, and a > whole load of other gubbins that doesn't figure in their tests at all. Test labs have dust, water vapours and mains spikes too, albeit as little as possible. They're testing on earth, not on some utopian other parallel world. Good colo's do a good job to eliminate most adverse effects. In any case, dust is not a great danger to disks (but it is to fans), heat is. Especially quick heat buildup, hence powercycles are amongst the worst. Drives don't really like the expansion of materials that occurs when temperatures rise, nor the extra friction that a higher temperature entails. > > and many people > > would probably even state as much as that NONE of those drives will still > > work after 137 years. (too bad there's no-one to collect the prize money) > > They wouldn't expect them to. If the mtbf is 137 years, then of a batch > of 1000, approx 0.6 and a bit PERCENT would die per year. Now you get > to multiply. 99.3^n % is ... well, anyway, it isn't linear, but they > would all be expected to die out by 137y. Anyone got some logarithms? Look up what the word "mean" from mtbf means, and recompute. Maarten -- When I answered where I wanted to go today, they just hung up -- Unknown ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 20:55 ` maarten @ 2005-01-04 21:11 ` Peter T. Breuer 2005-01-04 21:38 ` Peter T. Breuer 1 sibling, 0 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 21:11 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > On Tuesday 04 January 2005 17:21, Peter T. Breuer wrote: > > Maarten <maarten@ultratux.net> wrote: > > > > > Nope, not 10 years, not 20 years, not even 40 years. See this Seagate > > > sheet below where they go on record with a whopping 1200.000 hours MTBF. > > > That translates to 137 years. > > > > I believe that too. They REALLY have kept the monkeys well away. > > They're only a factor of ten out from what I think it is, so I certainly > > believe them. And they probably discarded the ones that failed burn-in > > too. > > > > > Now can you please state here and now that you > > > actually believe that figure ? > > > > Of course. Why wouldn't I? They are stating something like 1% lossage > > per year under perfect ideal conditions, no dust, no power spikes, no > > a/c overloads, etc. I'd easily belueve that. > > No spindle will take 137 years of abuse at the incredibly high speed of 10000 > rpm and not show enough wear so that the heads will either collide with the Nor does anyone say it will! That's the mtbf, that's all. It's a parameter in a statistical distribrution. The inverse of the probability of failure per unit time. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 20:55 ` maarten 2005-01-04 21:11 ` Peter T. Breuer @ 2005-01-04 21:38 ` Peter T. Breuer 2005-01-04 23:29 ` Guy 1 sibling, 1 reply; 92+ messages in thread From: Peter T. Breuer @ 2005-01-04 21:38 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > I don't see where you come up with 1% per year. Because that is 1/137 approx (hey, is that planks constant or something...) > Remember that MTBF means MEAN > time between failures, I.e. it's the inverse of the probability of failure per unit time, in a Poisson distribution. A Poisson distribution only has one parameter and that's it! The standard deviation is that too. No, I don't recall the third moment offhand. > so for every single drive that dies in year one, one > other drive has to double its life expectancy to twice 137, which is 274 Complete nonsense. Please go back to remedial statistics. > years. If your reasoning is correct with one drive dying per year, the Who said that? I said the probability of failure is 1% per year. Not one drive per year! If you have a hundred drives, you expect about one death in the first year. > remaining bunch after 50 years will have to survive another 250(!) years, on > average. ...But wait, you're still not convinced, eh ? Complete and utter disgraceful nonsense! Did you even get as far as the 11-year old standard in your math? > Also, I'm not used to big data centers buying disks by the container, but from > what I've heard no-one can actually say that they lose as little as 1 drive a > year for any hundred drives bought. Those figures are (much) higher. Of course not - I would say 10% myself. A few years ago it was 20%, but I believe that recently the figure may have fallen as low as 5%. That's perfectly consistent with their spec. > You yourself said in a previous post you expected 10% per year, and that is > WAY off the 1% mark you now state 'believeable'. How come ? Because it is NOT "way off the 1% mark". It is close to it! Especially when you bear in mind that real disks are exposed to a much more stressful environment than the manufacturers testing labs. Heck, we can't even get FANs that last longer than 3 to 6 months in the atmosphere here (dry, thin, dusty, heat reaching 46C in summer, dropping below zero in wnter). Is the problem simply "numbers" with you? > > Of course we do. Why wouldn't we? That doesn't make their figures > > wrong! > > Yes it does. By _definition_ even. No it doesn't. > It clearly shows that one cannot account > for tens, nay hundreds, of years wear and tear by just taking a very small > sample of drives and having them tested for a very small amount of time. Nor does anyone suggest that one should! Where do you get this from? Of course their figures don't reflect your environment, or mine. If you want to duplicte their figures, you have to duplicate their environment! Ask them how, if you're interested. > Look, _everybody_ knows this. No serious admin will not change their drives > after five years as a rule, Well, three years is when we change, but that's because everything is changed every three years, since it depreciates to zero in that time, im accounting terms. But I have ten year old disks working fine (says he, wincing at the seagate fireballs and barracudas screaming ..). > or 10 years at the most. And that is not simply > due to Moore's law. The failure rate just gets too high, and economics > dictate that they must be decommissioned. After "only" 10 years...! Of course! So? I really don't see why you think that is anything to do with the mtbf, whhich is only the single parameter telling you what the scale of the poisson distribution for moment-to-moment failure is! I really don't get why you don't get this! Don't you know what the words mean? Then it's no wonder that whatever you say around the area makes very little sense, and why you have the feeling that THEY are saying nonsense, rather than that you are UNDERSTANDING nonsense, which is the case! Please go and learn some stats! > > No, it means that statistics say what they say, and I understand them > > fine, thanks. > > Uh-huh. So explain to me why drive manufacturers do not give a 10 year > warrantee. Because if they did they would have to replace 100% of their disks. If there is a 10y mtbf in the real world (as I estimate), then very few of them would make it to ten years. > I say because they know full well that they would go bankrupt if > they did since not 8% but rather 50% or more would return in that time. No, the mtbf in our conditions is somewhere like 10y. That means that almost none would make it to ten years. 10% would die each year. 90% would remain. After 5 years 59% would remain. After 10 years 35% would remain. > > That's a fine technique. It's perfectly OK. I suppose they did state > > the standard deviation of their estimator? > > Call them and find out; you're the math whiz. It doesn't matter. It's good enough as a guide. > And I'll say it again: if some statistical technique yields wildly different > results than the observable, verifiable real world does, then there is But it doesn't! > something wrong with said technique, not with the real world. They are not trying to estimate the mtbf in YOUR world, but in THEIRS. Those are different. If you want to emulate them, so be it! I don't. > The real world is our frame of reference, not some dreamed-up math model which There is nothing wrong with their model. It doesn't reflect your world. > attempts to describe the world. And if they do collide, a math theory gets > thrown out, not the real world observations instead...! You are horribly confused! Please do not try and tell real statisticians that YOU do not understand the model, and that therefore THEY should change them. You can simply understand them. They only are applying accelerating techniques. They take 1000 disks and run them for a year - if 10 die, then they know the mtbf is about 100y. That does not mean that disks will last a 100y! It means that 10 in every thousand will die within one year. That seems to be your major confusion. And that's only for starters. They then have to figure out how the mtbf changes with time! But really, you don't care about that, since it's only the mtbf during the first five years that you care about, as you said. So what are you on about? > > They wouldn't expect them to. If the mtbf is 137 years, then of a batch > > of 1000, approx 0.6 and a bit PERCENT would die per year. Now you get > > to multiply. 99.3^n % is ... well, anyway, it isn't linear, but they > > would all be expected to die out by 137y. Anyone got some logarithms? > > Look up what the word "mean" from mtbf means, and recompute. I know what it means - you don't. It is the inverse of the probability of failure in any moment of time. A strange way of stating that parameter, but then I guess it's just that people are more used to seeing it expresed in ohms than mho. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:38 ` Peter T. Breuer @ 2005-01-04 23:29 ` Guy 0 siblings, 0 replies; 92+ messages in thread From: Guy @ 2005-01-04 23:29 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid I think in this example, if you had 137 disks, you should expect an average of 1 failed drive per year. But, I would bet after 5 years you would have much more than 5 failed disks! Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Tuesday, January 04, 2005 4:38 PM To: linux-raid@vger.kernel.org Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten <maarten@ultratux.net> wrote: > I don't see where you come up with 1% per year. Because that is 1/137 approx (hey, is that planks constant or something...) > Remember that MTBF means MEAN > time between failures, I.e. it's the inverse of the probability of failure per unit time, in a Poisson distribution. A Poisson distribution only has one parameter and that's it! The standard deviation is that too. No, I don't recall the third moment offhand. > so for every single drive that dies in year one, one > other drive has to double its life expectancy to twice 137, which is 274 Complete nonsense. Please go back to remedial statistics. > years. If your reasoning is correct with one drive dying per year, the Who said that? I said the probability of failure is 1% per year. Not one drive per year! If you have a hundred drives, you expect about one death in the first year. > remaining bunch after 50 years will have to survive another 250(!) years, on > average. ...But wait, you're still not convinced, eh ? Complete and utter disgraceful nonsense! Did you even get as far as the 11-year old standard in your math? > Also, I'm not used to big data centers buying disks by the container, but from > what I've heard no-one can actually say that they lose as little as 1 drive a > year for any hundred drives bought. Those figures are (much) higher. Of course not - I would say 10% myself. A few years ago it was 20%, but I believe that recently the figure may have fallen as low as 5%. That's perfectly consistent with their spec. > You yourself said in a previous post you expected 10% per year, and that is > WAY off the 1% mark you now state 'believeable'. How come ? Because it is NOT "way off the 1% mark". It is close to it! Especially when you bear in mind that real disks are exposed to a much more stressful environment than the manufacturers testing labs. Heck, we can't even get FANs that last longer than 3 to 6 months in the atmosphere here (dry, thin, dusty, heat reaching 46C in summer, dropping below zero in wnter). Is the problem simply "numbers" with you? > > Of course we do. Why wouldn't we? That doesn't make their figures > > wrong! > > Yes it does. By _definition_ even. No it doesn't. > It clearly shows that one cannot account > for tens, nay hundreds, of years wear and tear by just taking a very small > sample of drives and having them tested for a very small amount of time. Nor does anyone suggest that one should! Where do you get this from? Of course their figures don't reflect your environment, or mine. If you want to duplicte their figures, you have to duplicate their environment! Ask them how, if you're interested. > Look, _everybody_ knows this. No serious admin will not change their drives > after five years as a rule, Well, three years is when we change, but that's because everything is changed every three years, since it depreciates to zero in that time, im accounting terms. But I have ten year old disks working fine (says he, wincing at the seagate fireballs and barracudas screaming ..). > or 10 years at the most. And that is not simply > due to Moore's law. The failure rate just gets too high, and economics > dictate that they must be decommissioned. After "only" 10 years...! Of course! So? I really don't see why you think that is anything to do with the mtbf, whhich is only the single parameter telling you what the scale of the poisson distribution for moment-to-moment failure is! I really don't get why you don't get this! Don't you know what the words mean? Then it's no wonder that whatever you say around the area makes very little sense, and why you have the feeling that THEY are saying nonsense, rather than that you are UNDERSTANDING nonsense, which is the case! Please go and learn some stats! > > No, it means that statistics say what they say, and I understand them > > fine, thanks. > > Uh-huh. So explain to me why drive manufacturers do not give a 10 year > warrantee. Because if they did they would have to replace 100% of their disks. If there is a 10y mtbf in the real world (as I estimate), then very few of them would make it to ten years. > I say because they know full well that they would go bankrupt if > they did since not 8% but rather 50% or more would return in that time. No, the mtbf in our conditions is somewhere like 10y. That means that almost none would make it to ten years. 10% would die each year. 90% would remain. After 5 years 59% would remain. After 10 years 35% would remain. > > That's a fine technique. It's perfectly OK. I suppose they did state > > the standard deviation of their estimator? > > Call them and find out; you're the math whiz. It doesn't matter. It's good enough as a guide. > And I'll say it again: if some statistical technique yields wildly different > results than the observable, verifiable real world does, then there is But it doesn't! > something wrong with said technique, not with the real world. They are not trying to estimate the mtbf in YOUR world, but in THEIRS. Those are different. If you want to emulate them, so be it! I don't. > The real world is our frame of reference, not some dreamed-up math model which There is nothing wrong with their model. It doesn't reflect your world. > attempts to describe the world. And if they do collide, a math theory gets > thrown out, not the real world observations instead...! You are horribly confused! Please do not try and tell real statisticians that YOU do not understand the model, and that therefore THEY should change them. You can simply understand them. They only are applying accelerating techniques. They take 1000 disks and run them for a year - if 10 die, then they know the mtbf is about 100y. That does not mean that disks will last a 100y! It means that 10 in every thousand will die within one year. That seems to be your major confusion. And that's only for starters. They then have to figure out how the mtbf changes with time! But really, you don't care about that, since it's only the mtbf during the first five years that you care about, as you said. So what are you on about? > > They wouldn't expect them to. If the mtbf is 137 years, then of a batch > > of 1000, approx 0.6 and a bit PERCENT would die per year. Now you get > > to multiply. 99.3^n % is ... well, anyway, it isn't linear, but they > > would all be expected to die out by 137y. Anyone got some logarithms? > > Look up what the word "mean" from mtbf means, and recompute. I know what it means - you don't. It is the inverse of the probability of failure in any moment of time. A strange way of stating that parameter, but then I guess it's just that people are more used to seeing it expresed in ohms than mho. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 15:31 ` Maarten 2005-01-04 16:21 ` Peter T. Breuer @ 2005-01-04 19:57 ` Mikael Abrahamsson 2005-01-04 21:05 ` maarten 1 sibling, 1 reply; 92+ messages in thread From: Mikael Abrahamsson @ 2005-01-04 19:57 UTC (permalink / raw) To: linux-raid On Tue, 4 Jan 2005, Maarten wrote: > failures within the first 10 years, let alone 20, to even remotely support > that outrageous MTBF claim. One should note that environment seriously affects MTBF, even on non-movable parts, and probably even more on movable parts. I've talked to people in the reliability business, and they use models that say that MTBF for a part at 20 C as opposed to 40 C can differ by a factor of 3 or 4, or even more. A lot of people skimp on cooling and then get upset when their drives fail. I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will have less than 1/10th of that at 55-60C. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 19:57 ` Mikael Abrahamsson @ 2005-01-04 21:05 ` maarten 2005-01-04 21:26 ` Alvin Oga 2005-01-04 21:46 ` Guy 0 siblings, 2 replies; 92+ messages in thread From: maarten @ 2005-01-04 21:05 UTC (permalink / raw) To: linux-raid On Tuesday 04 January 2005 20:57, Mikael Abrahamsson wrote: > On Tue, 4 Jan 2005, Maarten wrote: > > failures within the first 10 years, let alone 20, to even remotely > > support that outrageous MTBF claim. > > One should note that environment seriously affects MTBF, even on > non-movable parts, and probably even more on movable parts. Yes. Heat especially above all else. > I've talked to people in the reliability business, and they use models > that say that MTBF for a part at 20 C as opposed to 40 C can differ by a > factor of 3 or 4, or even more. A lot of people skimp on cooling and then > get upset when their drives fail. > > I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will > have less than 1/10th of that at 55-60C. Yes. I know that full well. Therefore my server drives are mounted directly behind two monstrous 12cm fans... I don't take no risks. :-) Still, two western digitals have died within the first or second year in that enclosure. So much for MTBF vs. real world expectancy I guess. It should be public knowledge by now that heat is the number 1 killer for harddisks. However, you still see PC cases everywhere where disks are sandwiched together and with no possible airflow at all. Go figure... Maarten -- ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:05 ` maarten @ 2005-01-04 21:26 ` Alvin Oga 2005-01-04 21:46 ` Guy 1 sibling, 0 replies; 92+ messages in thread From: Alvin Oga @ 2005-01-04 21:26 UTC (permalink / raw) To: maarten; +Cc: linux-raid On Tue, 4 Jan 2005, maarten wrote: > Yes. I know that full well. Therefore my server drives are mounted directly > behind two monstrous 12cm fans... I don't take no risks. :-) exactly... lots of air for the drives ( treat it like a cpu ) that it should be kept cool as possible > Still, two western digitals have died within the first or second year in that > enclosure. So much for MTBF vs. real world expectancy I guess. wd is famous for various reasons .. > It should be public knowledge by now that heat is the number 1 killer for > harddisks. However, you still see PC cases everywhere where disks are > sandwiched together and with no possible airflow at all. Go figure... its a conspiracy, to get you/us to buy new disks when the old one dies but if we all kept a 3" fan cooling each disk ... inside the pcs, there'd be less disk failures - and equal amounts of fresh cooler air coming in as hot air going out c ya alvin ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 21:05 ` maarten 2005-01-04 21:26 ` Alvin Oga @ 2005-01-04 21:46 ` Guy 1 sibling, 0 replies; 92+ messages in thread From: Guy @ 2005-01-04 21:46 UTC (permalink / raw) To: 'maarten', linux-raid I have a PC with 2 disks, these disks are much too hot to touch for more than a second or less. The system has been like that for 3-4 years. I have no idea how they lasted so long! 1 is an IBM the other is Seagate. Both are 18 Gig SCSI disks. The Seagate is 10,000 RPM. As you said: "Go figure..."! :) Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Tuesday, January 04, 2005 4:05 PM To: linux-raid@vger.kernel.org Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) On Tuesday 04 January 2005 20:57, Mikael Abrahamsson wrote: > On Tue, 4 Jan 2005, Maarten wrote: > > failures within the first 10 years, let alone 20, to even remotely > > support that outrageous MTBF claim. > > One should note that environment seriously affects MTBF, even on > non-movable parts, and probably even more on movable parts. Yes. Heat especially above all else. > I've talked to people in the reliability business, and they use models > that say that MTBF for a part at 20 C as opposed to 40 C can differ by a > factor of 3 or 4, or even more. A lot of people skimp on cooling and then > get upset when their drives fail. > > I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will > have less than 1/10th of that at 55-60C. Yes. I know that full well. Therefore my server drives are mounted directly behind two monstrous 12cm fans... I don't take no risks. :-) Still, two western digitals have died within the first or second year in that enclosure. So much for MTBF vs. real world expectancy I guess. It should be public knowledge by now that heat is the number 1 killer for harddisks. However, you still see PC cases everywhere where disks are sandwiched together and with no possible airflow at all. Go figure... Maarten -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten 2005-01-03 19:52 ` maarten @ 2005-01-03 20:22 ` Peter T. Breuer 2005-01-03 23:05 ` Guy ` (2 more replies) 2005-01-03 21:36 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy 2 siblings, 3 replies; 92+ messages in thread From: Peter T. Breuer @ 2005-01-03 20:22 UTC (permalink / raw) To: linux-raid maarten <maarten@ultratux.net> wrote: > The chance of a PSU blowing up or lightning striking is, reasonably, much less > than an isolated disk failure. If this simple fact is not true for you Oh? We have about 20 a year. Maybe three of them are planned. But those are the worst ones! - the electrical department's method of "testing" the lines is to switch off the rails then pulse them up and down. Surge tests or something. When we can we switch everything off beforehand. But then we also get to deal with the amateur contributions from the city power people. Yes, my PhD is in electrical engineering. Have I sent them sarcastic letters explaining how to test lines using a dummy load? Yes. Does the physics department also want to place them in a vat of slowly reheating liquid nitrogen? Yes. Does it make any difference? No. I should have kept the letter I got back when I asked them exactly WHAT it was they thought they had been doing when they sent round a pompous letter explaining how they had been up all night "helping" the town power people to get back on line, after an outage took out the half-million or so people round here. Waiting for the phonecall saying "you can turn it back on now", I think. That letter was a riot. I plug my stuff into the ordinary mains myself. It fails less often than the "secure circuit" plugs we have that are meant to be wired to their smoking giant UPS that apparently takes half the city output to power up. > personally, you really ought to reevaluate the quality of your PSU (et al) > and / or the buildings' defenses against a lightning strike... I don't think so. You can argue with the guys with the digger tool and a weekend free. > > However, I don't see how you can expect to replace a failed disk > > without taking down the system. For that reason you are expected to be > > running "spare disks" that you can virtually insert hot into the array > > (caveat, it is possible with scsi, but you will need to rescan the bus, > > which will take it out of commission for some seconds, which may > > require you to take the bus offline first, and it MAY be possible with > > recent IDE buses that purport to support hotswap - I don't know). > > I think the point is not what actions one has to take at time T+1 to replace > the disk, but rather whether at time T, when the failure first occurs, the > system survives the failure or not. > > > (1) how likely is it that a disk will fail without taking down the system > > (2) how likely is it that a disk will fail > > (3) how likely is it that a whole system will fail > > > > I would say that (2) is about 10% per year. I would say that (3) is > > about 1200% per year. It is therefore difficult to calculate (1), which > > is your protection scenario, since it doesn't show up very often in the > > stats! > > I don't understand your math. For one, percentage is measured from 0 to 100, No, it's measured from 0 to infinity. Occasionally from negative infinity to positive infinity. Did I mention that I have two degrees in pure mathematics? We can discuss nonstandard interpretations of Peano's axioms then. > not from 0 to 1200. What is that, 12 twelve times 'absolute certainty' that > something will occur ? Yep. Approximately. Otherwise known as the expectaion that twelve events will occur per year. One a month. I would have said "one a month" if I had not been being precise. > But besides that, I'd wager that from your list number (3) has, by far, the > smallest chance of occurring. Except of course, that you would lose, since not only did I SAY that it had the highest chance, but I gave a numerical estimate for it that is 120 times as high as that I gave for (1). > Choosing between (1) and (2) is more difficult, Well, I said it doesn't matter, because everything is swamped by (3). > my experiences with IDE disks are definitely that it will take the system > down, but that is very biased since I always used non-mirrored swap. It's the same principle. There exists a common mode for failure. Bayesian calculations then tell you that there is a strong liklihood of the whole system coming down in conjunction with the disk coming down. > I sure can understand a system dying if it loses part of its memory... > > > > ** A disk failing is the most common failure a system can have (IMO). > > I fully agree. > > > Not in my experience. See above. I'd say each disk has about a 10% > > failure expectation per year. Whereas I can guarantee that an > > unexpected system failure will occur about once a month, on every > > important system. There you are. I said it again. > Whoa ! What are you running, windows perhaps ?!? ;-) No. Ordinary hardware. > No but seriously, joking aside, you have 12 system failures per year ? At a very minimum. Almost none of them caused by hardware. Hey, I even took down my own home server by accident over new year! Spoiled its 222 day uptime. > I would not be alone in thinking that figure is VERY high. My uptimes It isn't. A random look at servers tells me: bajo up 77+00:23, 1 user, load 0.28, 0.39, 0.48 balafon up 25+08:30, 0 users, load 0.47, 0.14, 0.05 dino up 77+01:15, 0 users, load 0.00, 0.00, 0.00 guitarra up 19+02:15, 0 users, load 0.20, 0.07, 0.04 itserv up 77+11:31, 0 users, load 0.01, 0.02, 0.01 itserv2 up 20+00:40, 1 user, load 0.05, 0.13, 0.16 lmserv up 77+11:32, 0 users, load 0.34, 0.13, 0.08 lmserv2 up 20+00:49, 1 user, load 0.14, 0.20, 0.23 nbd up 24+04:12, 0 users, load 0.08, 0.08, 0.02 oboe up 77+02:39, 3 users, load 0.00, 0.00, 0.00 piano up 77+11:55, 0 users, load 0.00, 0.00, 0.00 trombon up 24+08:14, 2 users, load 0.00, 0.00, 0.00 violin up 77+12:00, 4 users, load 0.00, 0.00, 0.00 xilofon up 73+01:08, 0 users, load 0.00, 0.00, 0.00 xml up 33+02:29, 5 users, load 0.60, 0.64, 0.67 (one net). Looks like a major power outage 77 days ago, and a smaller event 24 and 20 days ago. The event at 20 days ago looks like sysadmins. Both Trombon and Nbd survived it and tey're on separate (different) UPSs. The servers which are up 77 days are on a huge UPS that Lmserv2 and Itserv2 should also be on, as far as I know. So somebody took them off the UPS wihin ten minutes of each other. Looks like maintenance moving racks. OK, not once every month, more like between onece every 20 days and once every 77 days, say once every 45 days. > generally are in the three-digit range, and most *certainly* not in the low > 2-digit range. Well, they have no chance to be here. There are several planned power outs a year for the electrical department to do their silly tricks with. When that happens they take the weekend over it. > > If you think about it that is quite likely, since a system is by > > definition a complicated thing. And then it is subject to all kinds of > > horrible outside influences, like people rewiring the server room in > > order to reroute cables under the floor instead of through he ceiling, > > and the maintenance people spraying the building with insecticide, > > everywhere, or just "turning off the electricity in order to test it" > > (that happens about four times a year here - hey, I remember when they > > tested the giant UPS by turning off the electricity! Wrong switch. > > Bummer). > > If you have building maintenance people and other random staff that can access > your server room unattended and unmonitored, you have far worse problems than > making decicions about raid lavels. IMNSHO. Oh, they most certainly can't access the server rooms. The techs would have done that on their own, but they would (obviously) have needed to move the machines for that, and turn them off. Ah . But yes, the guy with the insecticide has the key to everywhere, and is probably a gardener. I've seen him at it. He sprays all the corners of the corridors, along the edge of the wall and floor, then does the same inside the rooms. The point is that most foul-ups are created by the humans, whether technoid or gardenoid, or hole-diggeroid. > By your description you could almost be the guy the joke with the recurring 7 > o'clock system crash is about (where the cleaning lady unplugs the server > every morning in order to plug in her vacuum cleaner) ;-) Oh, the cleaning ladies do their share of damage. They are required BY LAW to clean the keyboards. They do so by picking them up in their left hand at the lower left corner, and rubbing a rag over them. Their left hand is where the ctl and alt keys are. Solution is not to leave keyboard in the room. Use a whaddyamacallit switch and attach one keyboard to that whenever one needs to access anything.. Also use thwapping great power cables one inch thck that they cannot move. And I won't mention the learning episodes with the linux debugger monitor activated by pressing "pause". Once I watched the lady cleaning my office. She SPRAYED the back of the monitor! I YELPED! I tried to explain to her about voltages, and said that she would't clean her tv at home that way - oh yes she did! > > Yes, you can try and keep these systems out of harms way on a > > colocation site, or something, but by then you are at professional > > level paranoia. For "home systems", whole system failures are far more > > common than disk failures. > > Don't agree. You may not agree, but you would be rather wrong in persisting in that idea in face of evidence that you can easily accumulate yourself, like the figures I randomly checked above. > Not only do disk failures occur more often than full system > failures, No they don't - by about 12 to 1. > disk failures are also much more time-consuming to recover from. No they aren't - we just put in another one, and copy the standard image over it (or in the case of a server, copy its twin, but then servers don't blow disks all that often, but when they do they blow ALL of them as well, as whatever blew one will blow the others in due course - likely heat). > Compare changing a system board or PSU with changing a drive and finding, > copying and verifying a backup (if you even have one that's 100% up to date) We have. For one thing we have identical pairs of servers, abslutely equal, md5summed and checked. The idenity-dependent scripts on them check who they are on and do the approprate thing depending on who they find they are on. And all the clients are the same, as clients. Checked daily. > > > ** In a computer room with about 20 Unix systems, in 1 year I have seen > > > 10 or so disk failures and no other failures. > > > > Well, let's see. If each system has 2 disks, then that would be 25% per > > disk per year, which I would say indicates low quality IDE disks, but > > is about the level I would agree with as experiential. > > The point here was, disk failures being more common than other failures... But they aren't. If you have only 25% chance of failure per disk per year, then that makes system outages much more likely, since they happen at about one per month (here!). If it isn't faulty scsi cables, it will be overheating cpus. Dust in the very dry air here kills all fan bearings within 6 months to one year. My defense against that is to heavily underclock all machines. > > > No way! I hate tapes. I backup to other disks. > > Then for your sake, I hope they're kept offline, in a safe. No, they're kept online. Why? What would be the point of having them in a safe? Then they'd be unavailable! The general scheme is that sites cross-backup each other. > > > > ** My computer room is for development and testing, no customer access. > > > > Unfortunately, the admins do most of the sabotage. > > Change admins. Can't. They're as good as they get. Hey, *I* even do the sabotage sometimes. I'm probably only abut 99% accurate, and I can certainly write a hundred commands in a day. > I could understand an admin making typing errors and such, but then again that > would not usually lead to a total system failure. Of course it would. You try working remotely to upgrade the sshd and finally killing off the old one, only to discover that you kill the wrong one and lock yourself out, while the deadman script on the server tries fruitlessly to restart a misconfigured server, and then finally decides after an hour to give up and reboot as a last resort, then can't bring itself back up because of something else you did that you were intending to finish but didn't get the opportunity to. > Some daemon not working, > sure. Good admins review or test their changes, And sometimes miss the problem. > for one thing, and in most > cases any such mistake is rectified much simpler and faster than a failed > disk anyway. Except maybe for lilo errors with no boot media available. ;-\ Well, you can go out to the site in the middle of the night to reboot! Changes are made out of working hours so as not to disturb the users. > > Yes you did. You can see from the quoting that you did. > > Or the quoting got messed up. That is known to happen in threads. Shrug. > > > but it may be more current than 1 or > > > more of the other disks. But this would be similar to what would happen > > > to a non-RAID disk (some data not written). > > > > No, it would not be similar. You don't seem to understand the > > mechanism. The mechanism for corruption is that there are two different > > versions of the data available when the system comes back up, and you > > and the raid system don't know which is more correct. Or even what it > > means to be "correct". Maybe the earlier written data is "correct"! > > That is not the whole truth. To be fair, the mechanism works like this: > With raid, you have a 50% chance the wrong, corrupted, data is used. > Without raid, thus only having a single disk, the chance of using the > corrupted data is 100% (obviously, since there is only one source) That is one particular spin on it. > Or, much more elaborate: > > Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5 There's no need to. Call it "p". > With raid, you always have a 50% chance of reading faultty data IF one of the > drives holds faulty data. That is, the probability of corruption occuring and it THEN being detected is 0.5 . However, the probabilty that it occurred is 2p, not p, since there are two disks (forget the tiny p^2 possibility). So we have p = probability of corruption occuring AND it being detected. > For the drives itself, the chance of both disks > being wrong is 0.5x0.5=0.25(scenario A). Similarly, 25 % chance both disks > are good (scenario B). The chance of one of the disks being wrong is 50% > (scenarios C & D together). In scenarios A & B the outcome is certain. In > scenarios C & D the chance of the raid choosing the false mirror is 50%. > Accumulating those chances one can say that the chance of reading false data > is: > in scenario A: 100% p^2 > in scenario B: 0% 0 > scenario C: 50% 0.5p > scenario D: 50% 0.5p > Doing the math, the outcome is still (200% divided by four)= 50%. Well, it's p + p^2. But I said to neglect the square term. > Ergo: the same as with a single disk. No change. Except that it is not the case. With a single disk you are CERTAIN to detect the problem (if it is detectable) when you run the fsck at reboot. With a RAID1 mirror you are only 50% likely to detect the detectable problem, because you may choose to read the "wrong" (correct :) disk at the crucial point in the fsck. Then you have to hope that the right disk fails next, when it fails, or else you will be left holding the detectably wrong, unchecked data. So in the scenario of a single detectable corruption: A: probability of a detectable error occuring and NOT being detected on a single disk system is zero B: probability of a detectable error occuring and NOT being detected on a two disk system is p Cute, no? You could have deduced that from your figures too, but you were all fired up about the question of a detectable error occurring AND being detected to think about it occuring AND NOT being detected. Even though that is what interests us! "silent corruption". > > > In contrast, on a single disk they have a 100% chance of detection (if > > > you look!) and a 100% chance of occuring, wrt normal rate. > > > ** Are you talking about the disk drive detecting the error? > > No, you have a zero chance of detection, since there is nothing to compare TO. That is not the case. You have every chance in the world of detecting it - you know what fsck does. If you like we can consider detectable and indetectable errors separtely. > Raid-1 at least gives you a 50/50 chance to choose the right data. With a > single disk, the chance of reusing the corrupted data is 100% and there is no > mechanism to detect the odd 'tumbled bit' at all. False. > > You wouldn't necesarily know which of the two data sources was > > "correct". > > No, but you have a theoretical choice, and a 50% chance of being right. > Not so without raid, where you get no choice, and a 100% chance of getting the > wrong data, in the case of a corruption. Review the calculation. Peter ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 20:22 ` Peter T. Breuer @ 2005-01-03 23:05 ` Guy 2005-01-04 0:08 ` maarten 2005-01-04 8:57 ` I'm glad I don't live in Spain (was Re: ext3 journal on software raid) David L. Smith-Uchida 2 siblings, 0 replies; 92+ messages in thread From: Guy @ 2005-01-03 23:05 UTC (permalink / raw) To: 'Peter T. Breuer', linux-raid Peter said: "Except that it is not the case. With a single disk you are CERTAIN to detect the problem (if it is detectable) when you run the fsck at reboot." Guy says: As a guess, fsck checks less than 1% of the disk. No user data is checked. So, virtually all errors would go un-detected. But a RAID system could detect the errors. Any yes, RAID6 could correct a single disk error. Even multi disk errors, as long as only 1 error per stripe/block. Your data center has problems well beyond the worse stories I have ever heard! My home systems tend to have much better uptime than any of your systems. 5:05pm up 33 days, 16:31, 1 user, load average: 0.12, 0.03, 0.01 17:05:20 up 28 days, 15:06, 1 user, load average: 0.03, 0.04, 0.00 Both were re-booted by me, not some strange failures. When I re-booted the first one, it had over 7 months of uptime. At work I had 2 systems with over 2 years uptime, and one of them made it to 3 years (yoda). The 3 year system was connected to the Internet and was used for customer demos. So, very low usage, but the 2 year system (lager) was an internal email server, test server, router, had Informix, ... I must admit, I rebooted the 2 year system by accident! But it was a proper reboot, not a crash. The Y2K patches did not require a re-boot. This is from an email I sent 10/26/2000: "Subject: Happy birthday Yoda! 6:13pm up 730 days, 23:59, 1 user, load average: 0.10, 0.12, 0.12 6:14pm up 731 days, 1 user, load average: 0.08, 0.12, 0.12" It was a leap year! So 1 extra day. Sent 7/5/2001: "Kathy, Yoda will be 1000 days up Jul 22 @ 18:14, thats a Sunday. Guy" We did give yoda a 3 year birthday party. It was not my idea. I told my wife about the party, my wife baked cupcakes. Again, not my idea! We moved to a different building, everything had to be shutdown. I sent this email 6/20/2002: "The final up-time report! Yoda wins with 3.64 years! Yoda will be shut down in a few minutes! comet 4:29pm up 6 days, 49 mins, 15 users, load average: 1.31, 0.95, 0.53 trex 4:31pm up 38 days, 16:34, 13 users, load average: 0.04, 0.02, 0.02 yoda 4:31pm up 1332 days, 22:17, 0 users, load average: 0.17, 0.13, 0.12 falcon 4:31pm up 38 days, 16:29, 14 users, load average: 0.00, 0.00, 0.01 right 4:31pm up 38 days, 15:42, 7 users, load average: 2.28, 1.86, 1.68 saturn 4:31pm up 16 days, 6:14, 1 user, load average: 0.01, 0.02, 0.02 citi1 4:31pm up 63 days, 23:20, 7 users, load average: 0.27, 0.23, 0.21 lager 4:46pm up 606 days, 1:47, 5 users, load average: 0.00, 0.00, 0.00 Guy" As you can see, some of us don't have major problem as you describe. Oh, somewhere you said you have on-line disk backups. This is very bad. If you had a lighting strike, or fire, it could takeout all of your disks at the same time. Your backup copies should be off-site, tape or disk, it does not matter. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer Sent: Monday, January 03, 2005 3:22 PM To: linux-raid@vger.kernel.org Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten <maarten@ultratux.net> wrote: > The chance of a PSU blowing up or lightning striking is, reasonably, much less > than an isolated disk failure. If this simple fact is not true for you Oh? We have about 20 a year. Maybe three of them are planned. But those are the worst ones! - the electrical department's method of "testing" the lines is to switch off the rails then pulse them up and down. Surge tests or something. When we can we switch everything off beforehand. But then we also get to deal with the amateur contributions from the city power people. Yes, my PhD is in electrical engineering. Have I sent them sarcastic letters explaining how to test lines using a dummy load? Yes. Does the physics department also want to place them in a vat of slowly reheating liquid nitrogen? Yes. Does it make any difference? No. I should have kept the letter I got back when I asked them exactly WHAT it was they thought they had been doing when they sent round a pompous letter explaining how they had been up all night "helping" the town power people to get back on line, after an outage took out the half-million or so people round here. Waiting for the phonecall saying "you can turn it back on now", I think. That letter was a riot. I plug my stuff into the ordinary mains myself. It fails less often than the "secure circuit" plugs we have that are meant to be wired to their smoking giant UPS that apparently takes half the city output to power up. > personally, you really ought to reevaluate the quality of your PSU (et al) > and / or the buildings' defenses against a lightning strike... I don't think so. You can argue with the guys with the digger tool and a weekend free. > > However, I don't see how you can expect to replace a failed disk > > without taking down the system. For that reason you are expected to be > > running "spare disks" that you can virtually insert hot into the array > > (caveat, it is possible with scsi, but you will need to rescan the bus, > > which will take it out of commission for some seconds, which may > > require you to take the bus offline first, and it MAY be possible with > > recent IDE buses that purport to support hotswap - I don't know). > > I think the point is not what actions one has to take at time T+1 to replace > the disk, but rather whether at time T, when the failure first occurs, the > system survives the failure or not. > > > (1) how likely is it that a disk will fail without taking down the system > > (2) how likely is it that a disk will fail > > (3) how likely is it that a whole system will fail > > > > I would say that (2) is about 10% per year. I would say that (3) is > > about 1200% per year. It is therefore difficult to calculate (1), which > > is your protection scenario, since it doesn't show up very often in the > > stats! > > I don't understand your math. For one, percentage is measured from 0 to 100, No, it's measured from 0 to infinity. Occasionally from negative infinity to positive infinity. Did I mention that I have two degrees in pure mathematics? We can discuss nonstandard interpretations of Peano's axioms then. > not from 0 to 1200. What is that, 12 twelve times 'absolute certainty' that > something will occur ? Yep. Approximately. Otherwise known as the expectaion that twelve events will occur per year. One a month. I would have said "one a month" if I had not been being precise. > But besides that, I'd wager that from your list number (3) has, by far, the > smallest chance of occurring. Except of course, that you would lose, since not only did I SAY that it had the highest chance, but I gave a numerical estimate for it that is 120 times as high as that I gave for (1). > Choosing between (1) and (2) is more difficult, Well, I said it doesn't matter, because everything is swamped by (3). > my experiences with IDE disks are definitely that it will take the system > down, but that is very biased since I always used non-mirrored swap. It's the same principle. There exists a common mode for failure. Bayesian calculations then tell you that there is a strong liklihood of the whole system coming down in conjunction with the disk coming down. > I sure can understand a system dying if it loses part of its memory... > > > > ** A disk failing is the most common failure a system can have (IMO). > > I fully agree. > > > Not in my experience. See above. I'd say each disk has about a 10% > > failure expectation per year. Whereas I can guarantee that an > > unexpected system failure will occur about once a month, on every > > important system. There you are. I said it again. > Whoa ! What are you running, windows perhaps ?!? ;-) No. Ordinary hardware. > No but seriously, joking aside, you have 12 system failures per year ? At a very minimum. Almost none of them caused by hardware. Hey, I even took down my own home server by accident over new year! Spoiled its 222 day uptime. > I would not be alone in thinking that figure is VERY high. My uptimes It isn't. A random look at servers tells me: bajo up 77+00:23, 1 user, load 0.28, 0.39, 0.48 balafon up 25+08:30, 0 users, load 0.47, 0.14, 0.05 dino up 77+01:15, 0 users, load 0.00, 0.00, 0.00 guitarra up 19+02:15, 0 users, load 0.20, 0.07, 0.04 itserv up 77+11:31, 0 users, load 0.01, 0.02, 0.01 itserv2 up 20+00:40, 1 user, load 0.05, 0.13, 0.16 lmserv up 77+11:32, 0 users, load 0.34, 0.13, 0.08 lmserv2 up 20+00:49, 1 user, load 0.14, 0.20, 0.23 nbd up 24+04:12, 0 users, load 0.08, 0.08, 0.02 oboe up 77+02:39, 3 users, load 0.00, 0.00, 0.00 piano up 77+11:55, 0 users, load 0.00, 0.00, 0.00 trombon up 24+08:14, 2 users, load 0.00, 0.00, 0.00 violin up 77+12:00, 4 users, load 0.00, 0.00, 0.00 xilofon up 73+01:08, 0 users, load 0.00, 0.00, 0.00 xml up 33+02:29, 5 users, load 0.60, 0.64, 0.67 (one net). Looks like a major power outage 77 days ago, and a smaller event 24 and 20 days ago. The event at 20 days ago looks like sysadmins. Both Trombon and Nbd survived it and tey're on separate (different) UPSs. The servers which are up 77 days are on a huge UPS that Lmserv2 and Itserv2 should also be on, as far as I know. So somebody took them off the UPS wihin ten minutes of each other. Looks like maintenance moving racks. OK, not once every month, more like between onece every 20 days and once every 77 days, say once every 45 days. > generally are in the three-digit range, and most *certainly* not in the low > 2-digit range. Well, they have no chance to be here. There are several planned power outs a year for the electrical department to do their silly tricks with. When that happens they take the weekend over it. > > If you think about it that is quite likely, since a system is by > > definition a complicated thing. And then it is subject to all kinds of > > horrible outside influences, like people rewiring the server room in > > order to reroute cables under the floor instead of through he ceiling, > > and the maintenance people spraying the building with insecticide, > > everywhere, or just "turning off the electricity in order to test it" > > (that happens about four times a year here - hey, I remember when they > > tested the giant UPS by turning off the electricity! Wrong switch. > > Bummer). > > If you have building maintenance people and other random staff that can access > your server room unattended and unmonitored, you have far worse problems than > making decicions about raid lavels. IMNSHO. Oh, they most certainly can't access the server rooms. The techs would have done that on their own, but they would (obviously) have needed to move the machines for that, and turn them off. Ah . But yes, the guy with the insecticide has the key to everywhere, and is probably a gardener. I've seen him at it. He sprays all the corners of the corridors, along the edge of the wall and floor, then does the same inside the rooms. The point is that most foul-ups are created by the humans, whether technoid or gardenoid, or hole-diggeroid. > By your description you could almost be the guy the joke with the recurring 7 > o'clock system crash is about (where the cleaning lady unplugs the server > every morning in order to plug in her vacuum cleaner) ;-) Oh, the cleaning ladies do their share of damage. They are required BY LAW to clean the keyboards. They do so by picking them up in their left hand at the lower left corner, and rubbing a rag over them. Their left hand is where the ctl and alt keys are. Solution is not to leave keyboard in the room. Use a whaddyamacallit switch and attach one keyboard to that whenever one needs to access anything.. Also use thwapping great power cables one inch thck that they cannot move. And I won't mention the learning episodes with the linux debugger monitor activated by pressing "pause". Once I watched the lady cleaning my office. She SPRAYED the back of the monitor! I YELPED! I tried to explain to her about voltages, and said that she would't clean her tv at home that way - oh yes she did! > > Yes, you can try and keep these systems out of harms way on a > > colocation site, or something, but by then you are at professional > > level paranoia. For "home systems", whole system failures are far more > > common than disk failures. > > Don't agree. You may not agree, but you would be rather wrong in persisting in that idea in face of evidence that you can easily accumulate yourself, like the figures I randomly checked above. > Not only do disk failures occur more often than full system > failures, No they don't - by about 12 to 1. > disk failures are also much more time-consuming to recover from. No they aren't - we just put in another one, and copy the standard image over it (or in the case of a server, copy its twin, but then servers don't blow disks all that often, but when they do they blow ALL of them as well, as whatever blew one will blow the others in due course - likely heat). > Compare changing a system board or PSU with changing a drive and finding, > copying and verifying a backup (if you even have one that's 100% up to date) We have. For one thing we have identical pairs of servers, abslutely equal, md5summed and checked. The idenity-dependent scripts on them check who they are on and do the approprate thing depending on who they find they are on. And all the clients are the same, as clients. Checked daily. > > > ** In a computer room with about 20 Unix systems, in 1 year I have seen > > > 10 or so disk failures and no other failures. > > > > Well, let's see. If each system has 2 disks, then that would be 25% per > > disk per year, which I would say indicates low quality IDE disks, but > > is about the level I would agree with as experiential. > > The point here was, disk failures being more common than other failures... But they aren't. If you have only 25% chance of failure per disk per year, then that makes system outages much more likely, since they happen at about one per month (here!). If it isn't faulty scsi cables, it will be overheating cpus. Dust in the very dry air here kills all fan bearings within 6 months to one year. My defense against that is to heavily underclock all machines. > > > No way! I hate tapes. I backup to other disks. > > Then for your sake, I hope they're kept offline, in a safe. No, they're kept online. Why? What would be the point of having them in a safe? Then they'd be unavailable! The general scheme is that sites cross-backup each other. > > > > ** My computer room is for development and testing, no customer access. > > > > Unfortunately, the admins do most of the sabotage. > > Change admins. Can't. They're as good as they get. Hey, *I* even do the sabotage sometimes. I'm probably only abut 99% accurate, and I can certainly write a hundred commands in a day. > I could understand an admin making typing errors and such, but then again that > would not usually lead to a total system failure. Of course it would. You try working remotely to upgrade the sshd and finally killing off the old one, only to discover that you kill the wrong one and lock yourself out, while the deadman script on the server tries fruitlessly to restart a misconfigured server, and then finally decides after an hour to give up and reboot as a last resort, then can't bring itself back up because of something else you did that you were intending to finish but didn't get the opportunity to. > Some daemon not working, > sure. Good admins review or test their changes, And sometimes miss the problem. > for one thing, and in most > cases any such mistake is rectified much simpler and faster than a failed > disk anyway. Except maybe for lilo errors with no boot media available. ;-\ Well, you can go out to the site in the middle of the night to reboot! Changes are made out of working hours so as not to disturb the users. > > Yes you did. You can see from the quoting that you did. > > Or the quoting got messed up. That is known to happen in threads. Shrug. > > > but it may be more current than 1 or > > > more of the other disks. But this would be similar to what would happen > > > to a non-RAID disk (some data not written). > > > > No, it would not be similar. You don't seem to understand the > > mechanism. The mechanism for corruption is that there are two different > > versions of the data available when the system comes back up, and you > > and the raid system don't know which is more correct. Or even what it > > means to be "correct". Maybe the earlier written data is "correct"! > > That is not the whole truth. To be fair, the mechanism works like this: > With raid, you have a 50% chance the wrong, corrupted, data is used. > Without raid, thus only having a single disk, the chance of using the > corrupted data is 100% (obviously, since there is only one source) That is one particular spin on it. > Or, much more elaborate: > > Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5 There's no need to. Call it "p". > With raid, you always have a 50% chance of reading faultty data IF one of the > drives holds faulty data. That is, the probability of corruption occuring and it THEN being detected is 0.5 . However, the probabilty that it occurred is 2p, not p, since there are two disks (forget the tiny p^2 possibility). So we have p = probability of corruption occuring AND it being detected. > For the drives itself, the chance of both disks > being wrong is 0.5x0.5=0.25(scenario A). Similarly, 25 % chance both disks > are good (scenario B). The chance of one of the disks being wrong is 50% > (scenarios C & D together). In scenarios A & B the outcome is certain. In > scenarios C & D the chance of the raid choosing the false mirror is 50%. > Accumulating those chances one can say that the chance of reading false data > is: > in scenario A: 100% p^2 > in scenario B: 0% 0 > scenario C: 50% 0.5p > scenario D: 50% 0.5p > Doing the math, the outcome is still (200% divided by four)= 50%. Well, it's p + p^2. But I said to neglect the square term. > Ergo: the same as with a single disk. No change. Except that it is not the case. With a single disk you are CERTAIN to detect the problem (if it is detectable) when you run the fsck at reboot. With a RAID1 mirror you are only 50% likely to detect the detectable problem, because you may choose to read the "wrong" (correct :) disk at the crucial point in the fsck. Then you have to hope that the right disk fails next, when it fails, or else you will be left holding the detectably wrong, unchecked data. So in the scenario of a single detectable corruption: A: probability of a detectable error occuring and NOT being detected on a single disk system is zero B: probability of a detectable error occuring and NOT being detected on a two disk system is p Cute, no? You could have deduced that from your figures too, but you were all fired up about the question of a detectable error occurring AND being detected to think about it occuring AND NOT being detected. Even though that is what interests us! "silent corruption". > > > In contrast, on a single disk they have a 100% chance of detection (if > > > you look!) and a 100% chance of occuring, wrt normal rate. > > > ** Are you talking about the disk drive detecting the error? > > No, you have a zero chance of detection, since there is nothing to compare TO. That is not the case. You have every chance in the world of detecting it - you know what fsck does. If you like we can consider detectable and indetectable errors separtely. > Raid-1 at least gives you a 50/50 chance to choose the right data. With a > single disk, the chance of reusing the corrupted data is 100% and there is no > mechanism to detect the odd 'tumbled bit' at all. False. > > You wouldn't necesarily know which of the two data sources was > > "correct". > > No, but you have a theoretical choice, and a 50% chance of being right. > Not so without raid, where you get no choice, and a 100% chance of getting the > wrong data, in the case of a corruption. Review the calculation. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 20:22 ` Peter T. Breuer 2005-01-03 23:05 ` Guy @ 2005-01-04 0:08 ` maarten 2005-01-04 8:57 ` I'm glad I don't live in Spain (was Re: ext3 journal on software raid) David L. Smith-Uchida 2 siblings, 0 replies; 92+ messages in thread From: maarten @ 2005-01-04 0:08 UTC (permalink / raw) To: linux-raid On Monday 03 January 2005 21:22, Peter T. Breuer wrote: > maarten <maarten@ultratux.net> wrote: > > The chance of a PSU blowing up or lightning striking is, reasonably, much > > less than an isolated disk failure. If this simple fact is not true for > > you > > Oh? We have about 20 a year. Maybe three of them are planned. But > those are the worst ones! - the electrical department's method of > "testing" the lines is to switch off the rails then pulse them up and > down. Surge tests or something. When we can we switch everything off > beforehand. But then we also get to deal with the amateur contributions > from the city power people. It goes on and on below, but this your first paragraph is already striking(!) You actually say that the planned outages are worse than the others! OMG. Who taught you how to plan ? Isn't planning the act of anticipating things, and acting accordingly so as to minimize the impact ? So your planning is so bad that the planned maintenance is actually worse than the impromptu outages. I... I am speechless. Really. You take the cake. But from the rest of your post it also seems you define a "total system failure" as something entirely different as the rest of us (presumably). You count either planned or unplanned outages as failures, whereas most of us would call that downtime, not system failure, let alone "total". If you have a problematic UPS system, or mentally challenged UPS engineers, that does not constitute a failure IN YOUR server. Same for a broken network. Total system failures is where the single computer system we're focussing on goes down or is unresponsive. You can't say "your server" is down when all that is happening is someone pulled the UTP from your remote console...! > Yes, my PhD is in electrical engineering. Have I sent them sarcastic > letters explaining how to test lines using a dummy load? Yes. Does the > physics department also want to place them in a vat of slowly reheating > liquid nitrogen? Yes. Does it make any difference? No. I don't know what you're on about, nor do I really care. I repeat: your UPS or powercompany failing does not constitute a _server_ failure. It is downtime. Downtime != system failure, (although the reverse obviously is). We shall forthwith define a system failure as a state where there are _repairs_ neccessary to the server, for it to start working again. Not just the reconnection of mains plugs. Okay with that ? > > I don't understand your math. For one, percentage is measured from 0 to > > 100, > > No, it's measured from 0 to infinity. Occasionally from negative > infinity to positive infinity. Did I mention that I have two degrees in > pure mathematics? We can discuss nonstandard interpretations of Peano's > axioms then. Sigh. Look up what "per cent" means (it's Latin). Also, since you seem to pride yourself on your leet math skills, remember that your professor said that chance can be between 0 (false) and 1 (true). Two or 12 cannot be an outcome of any probability calculation. > > But besides that, I'd wager that from your list number (3) has, by far, > > the smallest chance of occurring. > > Except of course, that you would lose, since not only did I SAY that it > had the highest chance, but I gave a numerical estimate for it that is > 120 times as high as that I gave for (1). Then your data center cannot seriously call itself that. Or your staff cannot call themselves capable. Choose whatever suits you. 12 outages a year... Bwaaah. Even a random home windows box has less outages than that(!). > > Choosing between (1) and (2) is more difficult, > > Well, I said it doesn't matter, because everything is swamped by (3). Which I disagreed with. I stated (3) is normally the _least_ likely. > > my experiences with IDE disks are definitely that it will take the system > > down, but that is very biased since I always used non-mirrored swap. > > It's the same principle. There exists a common mode for failure. > Bayesian calculations then tell you that there is a strong liklihood of > the whole system coming down in conjunction with the disk coming down. Nope, there isn't. Bayesian or not, hotswap drives on hardware raid cards prove you wrong, day in day out. So either you're talking about linux with md specifically, or you should wake up and smell the coffee. > > > Not in my experience. See above. I'd say each disk has about a 10% > > > failure expectation per year. Whereas I can guarantee that an > > > unexpected system failure will occur about once a month, on every > > > important system. > > There you are. I said it again. You quote yourself and you agree with that. Now why doesn't that surprise me ? > Hey, I even took down my own home server by accident over new year! > Spoiled its 222 day uptime. Your user error hardly counts as total system failure, don't you think ? > > I would not be alone in thinking that figure is VERY high. My uptimes > > It isn't. A random look at servers tells me: > > bajo up 77+00:23, 1 user, load 0.28, 0.39, 0.48 > balafon up 25+08:30, 0 users, load 0.47, 0.14, 0.05 > dino up 77+01:15, 0 users, load 0.00, 0.00, 0.00 > guitarra up 19+02:15, 0 users, load 0.20, 0.07, 0.04 > itserv up 77+11:31, 0 users, load 0.01, 0.02, 0.01 > itserv2 up 20+00:40, 1 user, load 0.05, 0.13, 0.16 > lmserv up 77+11:32, 0 users, load 0.34, 0.13, 0.08 > lmserv2 up 20+00:49, 1 user, load 0.14, 0.20, 0.23 > nbd up 24+04:12, 0 users, load 0.08, 0.08, 0.02 > oboe up 77+02:39, 3 users, load 0.00, 0.00, 0.00 > piano up 77+11:55, 0 users, load 0.00, 0.00, 0.00 > trombon up 24+08:14, 2 users, load 0.00, 0.00, 0.00 > violin up 77+12:00, 4 users, load 0.00, 0.00, 0.00 > xilofon up 73+01:08, 0 users, load 0.00, 0.00, 0.00 > xml up 33+02:29, 5 users, load 0.60, 0.64, 0.67 > > (one net). Looks like a major power outage 77 days ago, and a smaller > event 24 and 20 days ago. The event at 20 days ago looks like > sysadmins. Both Trombon and Nbd survived it and tey're on separate > (different) UPSs. The servers which are up 77 days are on a huge UPS > that Lmserv2 and Itserv2 should also be on, as far as I know. So > somebody took them off the UPS wihin ten minutes of each other. Looks > like maintenance moving racks. Okay, once again: your loss of power has nothing to do with a server failure. You can't say that your engine died and needs repair just because you forgot to fill the gas tank. You just add gas and away you go. No repair. No damage. Just downtime. Inconvenient as it may be, but that is not relevant. > Well, they have no chance to be here. There are several planned power > outs a year for the electrical department to do their silly tricks > with. When that happens they take the weekend over it. First off, since that is planned, it is _your_ job to be there beforehand and properly shutdown all those systems proir to losing the power. Secondly, reevaluate your UPS setup...!!! How is it even possible we're discussing such obvious measures. UPS'es are there for a reason. If your upstream UPS systems are unreliable, then add your own UPSes, one per server if need be. It really isn't rocket science... > > If you have building maintenance people and other random staff that can > > access your server room unattended and unmonitored, you have far worse > > problems than making decicions about raid lavels. IMNSHO. > > Oh, they most certainly can't access the server rooms. The techs would > have done that on their own, but they would (obviously) have needed to > move the machines for that, and turn them off. Ah . But yes, the guy > with the insecticide has the key to everywhere, and is probably a > gardener. I've seen him at it. He sprays all the corners of the > corridors, along the edge of the wall and floor, then does the same > inside the rooms. Oh really. Nice. Do you even realize that since your gardener or whatever can access everything, and will spray stuff around indiscriminately, he could very well incinerate your server room (or the whole building for that matter) It's really very simple. You tell him that he has two options: A) He agrees to only enter the server rooms in case of immediate emergency and will refrain from entering the room without your supervision in all other cases. You let him sign a paper stating as much. or B) You will change the lock on the server room thus disallowing all access. You agree you will personally carry out all 'maintenance' in that room. > The point is that most foul-ups are created by the humans, whether > technoid or gardenoid, or hole-diggeroid. And that is exactly why you should make sure their access is limited ! > > By your description you could almost be the guy the joke with the > > recurring 7 o'clock system crash is about (where the cleaning lady > > unplugs the server every morning in order to plug in her vacuum cleaner) > > ;-) > > Oh, the cleaning ladies do their share of damage. They are required BY > LAW to clean the keyboards. They do so by picking them up in their left > hand at the lower left corner, and rubbing a rag over them. Whoa, what special country are you at ? In my neck of the woods, I can disallow any and all cleaning if I deem it is hazardous to the cleaner and / or the equipment. Next, you'll start telling me that they clean your backup tapes and/or enclosures with a rag and soap and that you are required by law to grant them that right...? Do you think they have cleaning crews in nuclear facilities ? If so, do you think they are allowed (by law, no less) to go even near the control panels that regulate the reactor process ? (nope, I didn't think you did) > Their left hand is where the ctl and alt keys are. > > Solution is not to leave keyboard in the room. Use a whaddyamacallit > switch and attach one keyboard to that whenever one needs to access > anything.. Also use thwapping great power cables one inch thck that > they cannot move. Oh my. Oh my. Oh my. I cannot believe you. Have you ever heard of locking the console, perhaps ?!? You know, the state where nothing else than typing your password will do anything ? You can do that _most_certainly_ with KVM switches, in case your OS is too stubborn to disregard the various three finger combinations we all know. > And I won't mention the learning episodes with the linux debugger monitor > activated by pressing "pause". man xlock. man vlock. djeez... is this newbie time now ? > Once I watched the lady cleaning my office. She SPRAYED the back of the > monitor! I YELPED! I tried to explain to her about voltages, and said > that she would't clean her tv at home that way - oh yes she did! Exactly my point. My suggestion to you (if simply explaining doesn't help): Call the cleaner over to an old unused 14" CRT. Spray a lot of water-based, or better, flammable stuff into and onto the back of it. Wait for the smoke or the sparks to come flying...! stand back and enjoy. ;-) > You may not agree, but you would be rather wrong in persisting in that > idea in face of evidence that you can easily accumulate yourself, like > the figures I randomly checked above. Nope. However, I will admit that -in light of everything you said- your environment is very unsafe, very unreliable and frankly just unfit to house a data center worth its name. I'm sure others will agree with me. You can't just go around saying that 12 power outages per year are _normal_ and expected. You can't pretend something very very wrong is going on at your site. I've experienced 1 (count 'em: one) power outage in our last colo in over four years, and boy did my management give them (the colo facility) hell over it ! > > Not only do disk failures occur more often than full system > > failures, > > No they don't - by about 12 to 1. Only in your world, yes. > > disk failures are also much more time-consuming to recover from. > > No they aren't - we just put in another one, and copy the standard > image over it (or in the case of a server, copy its twin, but then > servers don't blow disks all that often, but when they do they blow > ALL of them as well, as whatever blew one will blow the others in due > course - likely heat). If you had used a colo, you wouldn't have dust lead to a premature fan failure (in my experience). There is no smoking in colo facilities expressly for that reason (and the fire hazard, obviously). But even then, you could remotely monitor the fan health, and /or the temperature. I still stand by my statement: disks are more time consuming than other failures to repair. Motherboards don't need data being restored to them. Much less finding out how complete the data backup was, and verifying that all works again as expected. > > Compare changing a system board or PSU with changing a drive and finding, > > copying and verifying a backup (if you even have one that's 100% up to > > date) > > We have. For one thing we have identical pairs of servers, abslutely > equal, md5summed and checked. The idenity-dependent scripts on them > check who they are on and do the approprate thing depending on who they > find they are on. Good for you. Well planned. It just amazes me now more than ever that the rest of the setup seems so broken / unstable. On the other hand, with 12 power outages yearly, you most definitely need two redundant servers. > > The point here was, disk failures being more common than other > > failures... > > But they aren't. If you have only 25% chance of failure per disk per > year, then that makes system outages much more likely, since they > happen at about one per month (here!). With the emphasis on your word "(here!)", yes. > If it isn't faulty scsi cables, it will be overheating cpus. Dust in > the very dry air here kills all fan bearings within 6 months to one > year. Colo facilities have a strict no smoking rule, and air filters to clean what enters. I can guarantee you that a good fan in a good colo will live 4++ years. Excuse me but dry air, my ***. Clean air is not dependent on dryness. It is dependent on cleanness. > My defense against that is to heavily underclock all machines. Um, yeah. Great thinking. Do you underclock the PSU also, and the disks ? Maybe you could run a scsi 15000 rpm drive at 10000, see what that gives ? Sorry for getting overly sarcastic here, but there really is no end to the stupidities, is there ? > > > No way! I hate tapes. I backup to other disks. > > > > Then for your sake, I hope they're kept offline, in a safe. > > No, they're kept online. Why? What would be the point of having them in > a safe? Then they'd be unavailable! I'll give you a few pointers then: If your disks are online instead of in a safe, they are vulnerable to: * Intrusions / viruses * User / admin error (you yourself stated how often this happens!) * Fire * Lightning strike * Theft > > Change admins. > > Can't. They're as good as they get. Hey, *I* even do the sabotage > sometimes. I'm probably only abut 99% accurate, and I can certainly > write a hundred commands in a day. Every admin makes mistakes. But most see it before it has dire consequences. > > I could understand an admin making typing errors and such, but then again > > that would not usually lead to a total system failure. > > Of course it would. You try working remotely to upgrade the sshd and > finally killing off the old one, only to discover that you kill the > wrong one and lock yourself out, while the deadman script on the server Yes, been there done that... > tries fruitlessly to restart a misconfigured server, and then finally > decides after an hour to give up and reboot as a last resort, then > can't bring itself back up because of something else you did that you > were intending to finish but didn't get the opportunity to. This will happen only once (if you're good), maybe twice (if you're adequate) but if it happens to you three times or more, then you need to find a different line of work, or start drinking less and paying more attention at your work. I'm not kidding. The good admin is not he who never makes mistakes, but he who (quickly) learns from it. > > Some daemon not working, > > sure. Good admins review or test their changes, > > And sometimes miss the problem. Yes, but apache not restarting due to a typo hardly constitutes a system failure. Come on now! > > for one thing, and in most > > cases any such mistake is rectified much simpler and faster than a failed > > disk anyway. Except maybe for lilo errors with no boot media available. > > ;-\ > > Well, you can go out to the site in the middle of the night to reboot! > Changes are made out of working hours so as not to disturb the users. Sometimes, depending on the SLA the client has. In any case, I do tend to schedule complex, error-prone work for when I am at the console. Look, any way you want to turn it, messing with reconfiguring bootmanagers when not at the console is asking for trouble. If you have no other recourse, test it first with a local machine with the exact same setup. For instance, I learned from my sshd error to always start a second sshd on port 2222 prior to killing off the main one. You could also have a 'screen' session running with a sleep 600 followed by some rescue command. Be creative. Be cautious (or paranoid). Learn. > > That is not the whole truth. To be fair, the mechanism works like this: > > With raid, you have a 50% chance the wrong, corrupted, data is used. > > Without raid, thus only having a single disk, the chance of using the > > corrupted data is 100% (obviously, since there is only one source) > > That is one particular spin on it. It is _the_ spin on it. > > Ergo: the same as with a single disk. No change. > > Except that it is not the case. With a single disk you are CERTAIN to > detect the problem (if it is detectable) when you run the fsck at > reboot. With a RAID1 mirror you are only 50% likely to detect the > detectable problem, because you may choose to read the "wrong" (correct > :) disk at the crucial point in the fsck. Then you have to hope that > the right disk fails next, when it fails, or else you will be left holding > the detectably wrong, unchecked data. First off, fsck doesn't usually run at reboot. Just the journal is replayed. Only when severe errors are there, there will be a forced fsck. You're not telling me that you fsck your 600 gigabyte arrays upon each reboot, yes ? It will give you multiple hours added downtime if you do. Secondly, if you _are_ that paranoid about it that you indeed do a fsck, what is keeping you from breaking the mirror, fsck the underlying physical devices and reassemble if all is okay. Added benefit: if all is not well, you get to choose which half of the mirror you decide to keep. Problem solved. And third, I am not too convinced the error detection is able to detect all errors. For starters, if a crash occurred while disk one was completely written but disk two had not yet begun, both checksums would be correct, so no fsck would notice. Secondly, I doubt that the checksum mechanism is that good. It's just a trivial checksum, it's bound to overlook some errors. And finally: If you would indeed end up with the "detectably wrong, unchecked data", you can still run an fsck on it, just as with the single disk. The fsck will repair it (or not), just as with the single disk you would've had. In any case, seen as you do 12 reboots a year :-P the chances a very very slim you hit the wrong ("right") half of the disk at all those 12 times, so you'll surely notice the corruption at some point. Note that despite all this I am all for an enhancement to mdadm providing a method to check the parity for correctness. But this is besides the point. > > No, you have a zero chance of detection, since there is nothing to > > compare TO. > > That is not the case. You have every chance in the world of detecting > it - you know what fsck does. Well, when have you last fsck'ed a terabyte size array without an immediate need for it ? I know I haven't -> my time is too valueable to wait half a day, or more, for that fsck to finish. Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* I'm glad I don't live in Spain (was Re: ext3 journal on software raid) 2005-01-03 20:22 ` Peter T. Breuer 2005-01-03 23:05 ` Guy 2005-01-04 0:08 ` maarten @ 2005-01-04 8:57 ` David L. Smith-Uchida 2 siblings, 0 replies; 92+ messages in thread From: David L. Smith-Uchida @ 2005-01-04 8:57 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid I've been following this discussion with varying degrees of incredulity and finally thought to look up just _where_ Peter works that might have such incompetence involved. His email address is at the University of Madrid in Spain. I'm glad I don't have to deal with it (not that Japan is perfect - but the power usually works here). Peter, you may be a math whiz, but your situation and experience is very different from that of many other people here. Please stop trying to say that because you have power failures 12 times a year everyone else does. We don't. Most people have working power systems and exerience disk failures because disks are funny spinny things and wear out. Many of us work in environments where when the power people do funny things they get fired and replaced with people who do not do funny things. You have my sympathies but please stop trying to tell everyone that they should expect 12 power failures a year. Peter T. Breuer wrote: >maarten <maarten@ultratux.net> wrote: > > >>The chance of a PSU blowing up or lightning striking is, reasonably, much less >>than an isolated disk failure. If this simple fact is not true for you >> >> > >Oh? We have about 20 a year. Maybe three of them are planned. But >those are the worst ones! - the electrical department's method of >"testing" the lines is to switch off the rails then pulse them up and >down. Surge tests or something. When we can we switch everything off >beforehand. But then we also get to deal with the amateur contributions >from the city power people. > >Yes, my PhD is in electrical engineering. Have I sent them sarcastic >letters explaining how to test lines using a dummy load? Yes. Does the >physics department also want to place them in a vat of slowly reheating >liquid nitrogen? Yes. Does it make any difference? No. > >I should have kept the letter I got back when I asked them exactly WHAT >it was they thought they had been doing when they sent round a pompous >letter explaining how they had been up all night "helping" the town >power people to get back on line, after an outage took out the >half-million or so people round here. Waiting for the phonecall saying >"you can turn it back on now", I think. > >That letter was a riot. > >I plug my stuff into the ordinary mains myself. It fails less often >than the "secure circuit" plugs we have that are meant to be wired to >their smoking giant UPS that apparently takes half the city output to >power up. > > > >>personally, you really ought to reevaluate the quality of your PSU (et al) >>and / or the buildings' defenses against a lightning strike... >> >> > >I don't think so. You can argue with the guys with the digger tool and >a weekend free. > > > >>>However, I don't see how you can expect to replace a failed disk >>>without taking down the system. For that reason you are expected to be >>>running "spare disks" that you can virtually insert hot into the array >>>(caveat, it is possible with scsi, but you will need to rescan the bus, >>>which will take it out of commission for some seconds, which may >>>require you to take the bus offline first, and it MAY be possible with >>>recent IDE buses that purport to support hotswap - I don't know). >>> >>> >>I think the point is not what actions one has to take at time T+1 to replace >>the disk, but rather whether at time T, when the failure first occurs, the >>system survives the failure or not. >> >> >> >>>(1) how likely is it that a disk will fail without taking down the system >>>(2) how likely is it that a disk will fail >>>(3) how likely is it that a whole system will fail >>> >>>I would say that (2) is about 10% per year. I would say that (3) is >>>about 1200% per year. It is therefore difficult to calculate (1), which >>>is your protection scenario, since it doesn't show up very often in the >>>stats! >>> >>> >>I don't understand your math. For one, percentage is measured from 0 to 100, >> >> > >No, it's measured from 0 to infinity. Occasionally from negative >infinity to positive infinity. Did I mention that I have two degrees in >pure mathematics? We can discuss nonstandard interpretations of Peano's >axioms then. > > > >>not from 0 to 1200. What is that, 12 twelve times 'absolute certainty' that >>something will occur ? >> >> > >Yep. Approximately. Otherwise known as the expectaion that twelve >events will occur per year. One a month. I would have said "one a >month" if I had not been being precise. > > > >>But besides that, I'd wager that from your list number (3) has, by far, the >>smallest chance of occurring. >> >> > >Except of course, that you would lose, since not only did I SAY that it >had the highest chance, but I gave a numerical estimate for it that is >120 times as high as that I gave for (1). > > > >>Choosing between (1) and (2) is more difficult, >> >> > >Well, I said it doesn't matter, because everything is swamped by (3). > > > >>my experiences with IDE disks are definitely that it will take the system >>down, but that is very biased since I always used non-mirrored swap. >> >> > >It's the same principle. There exists a common mode for failure. >Bayesian calculations then tell you that there is a strong liklihood of >the whole system coming down in conjunction with the disk coming down. > > > >>I sure can understand a system dying if it loses part of its memory... >> >> >> >>>>** A disk failing is the most common failure a system can have (IMO). >>>> >>>> >>I fully agree. >> >> >> >>>Not in my experience. See above. I'd say each disk has about a 10% >>>failure expectation per year. Whereas I can guarantee that an >>>unexpected system failure will occur about once a month, on every >>>important system. >>> >>> > >There you are. I said it again. > > > >>Whoa ! What are you running, windows perhaps ?!? ;-) >> >> > >No. Ordinary hardware. > > > >>No but seriously, joking aside, you have 12 system failures per year ? >> >> > >At a very minimum. Almost none of them caused by hardware. > >Hey, I even took down my own home server by accident over new year! >Spoiled its 222 day uptime. > > > >>I would not be alone in thinking that figure is VERY high. My uptimes >> >> > >It isn't. A random look at servers tells me: > > bajo up 77+00:23, 1 user, load 0.28, 0.39, 0.48 > balafon up 25+08:30, 0 users, load 0.47, 0.14, 0.05 > dino up 77+01:15, 0 users, load 0.00, 0.00, 0.00 > guitarra up 19+02:15, 0 users, load 0.20, 0.07, 0.04 > itserv up 77+11:31, 0 users, load 0.01, 0.02, 0.01 > itserv2 up 20+00:40, 1 user, load 0.05, 0.13, 0.16 > lmserv up 77+11:32, 0 users, load 0.34, 0.13, 0.08 > lmserv2 up 20+00:49, 1 user, load 0.14, 0.20, 0.23 > nbd up 24+04:12, 0 users, load 0.08, 0.08, 0.02 > oboe up 77+02:39, 3 users, load 0.00, 0.00, 0.00 > piano up 77+11:55, 0 users, load 0.00, 0.00, 0.00 > trombon up 24+08:14, 2 users, load 0.00, 0.00, 0.00 > violin up 77+12:00, 4 users, load 0.00, 0.00, 0.00 > xilofon up 73+01:08, 0 users, load 0.00, 0.00, 0.00 > xml up 33+02:29, 5 users, load 0.60, 0.64, 0.67 > >(one net). Looks like a major power outage 77 days ago, and a smaller >event 24 and 20 days ago. The event at 20 days ago looks like >sysadmins. Both Trombon and Nbd survived it and tey're on separate >(different) UPSs. The servers which are up 77 days are on a huge UPS >that Lmserv2 and Itserv2 should also be on, as far as I know. So >somebody took them off the UPS wihin ten minutes of each other. Looks >like maintenance moving racks. > >OK, not once every month, more like between onece every 20 days and >once every 77 days, say once every 45 days. > > > > >>generally are in the three-digit range, and most *certainly* not in the low >>2-digit range. >> >> > >Well, they have no chance to be here. There are several planned power >outs a year for the electrical department to do their silly tricks >with. When that happens they take the weekend over it. > > > >>>If you think about it that is quite likely, since a system is by >>>definition a complicated thing. And then it is subject to all kinds of >>>horrible outside influences, like people rewiring the server room in >>>order to reroute cables under the floor instead of through he ceiling, >>>and the maintenance people spraying the building with insecticide, >>>everywhere, or just "turning off the electricity in order to test it" >>>(that happens about four times a year here - hey, I remember when they >>>tested the giant UPS by turning off the electricity! Wrong switch. >>>Bummer). >>> >>> >>If you have building maintenance people and other random staff that can access >>your server room unattended and unmonitored, you have far worse problems than >>making decicions about raid lavels. IMNSHO. >> >> > >Oh, they most certainly can't access the server rooms. The techs would >have done that on their own, but they would (obviously) have needed to >move the machines for that, and turn them off. Ah . But yes, the guy >with the insecticide has the key to everywhere, and is probably a >gardener. I've seen him at it. He sprays all the corners of the >corridors, along the edge of the wall and floor, then does the same >inside the rooms. > >The point is that most foul-ups are created by the humans, whether >technoid or gardenoid, or hole-diggeroid. > > > >>By your description you could almost be the guy the joke with the recurring 7 >>o'clock system crash is about (where the cleaning lady unplugs the server >>every morning in order to plug in her vacuum cleaner) ;-) >> >> > >Oh, the cleaning ladies do their share of damage. They are required BY >LAW to clean the keyboards. They do so by picking them up in their left >hand at the lower left corner, and rubbing a rag over them. > >Their left hand is where the ctl and alt keys are. > >Solution is not to leave keyboard in the room. Use a whaddyamacallit >switch and attach one keyboard to that whenever one needs to access >anything.. Also use thwapping great power cables one inch thck that >they cannot move. > >And I won't mention the learning episodes with the linux debugger monitor >activated by pressing "pause". > >Once I watched the lady cleaning my office. She SPRAYED the back of the >monitor! I YELPED! I tried to explain to her about voltages, and said >that she would't clean her tv at home that way - oh yes she did! > > > >>>Yes, you can try and keep these systems out of harms way on a >>>colocation site, or something, but by then you are at professional >>>level paranoia. For "home systems", whole system failures are far more >>>common than disk failures. >>> >>> >>Don't agree. >> >> > >You may not agree, but you would be rather wrong in persisting in that >idea in face of evidence that you can easily accumulate yourself, like >the figures I randomly checked above. > > > >>Not only do disk failures occur more often than full system >>failures, >> >> > >No they don't - by about 12 to 1. > > > >>disk failures are also much more time-consuming to recover from. >> >> > >No they aren't - we just put in another one, and copy the standard >image over it (or in the case of a server, copy its twin, but then >servers don't blow disks all that often, but when they do they blow >ALL of them as well, as whatever blew one will blow the others in due >course - likely heat). > > > >>Compare changing a system board or PSU with changing a drive and finding, >>copying and verifying a backup (if you even have one that's 100% up to date) >> >> > >We have. For one thing we have identical pairs of servers, abslutely >equal, md5summed and checked. The idenity-dependent scripts on them >check who they are on and do the approprate thing depending on who they >find they are on. > >And all the clients are the same, as clients. Checked daily. > > > >>>>** In a computer room with about 20 Unix systems, in 1 year I have seen >>>>10 or so disk failures and no other failures. >>>> >>>> >>>Well, let's see. If each system has 2 disks, then that would be 25% per >>>disk per year, which I would say indicates low quality IDE disks, but >>>is about the level I would agree with as experiential. >>> >>> >>The point here was, disk failures being more common than other failures... >> >> > >But they aren't. If you have only 25% chance of failure per disk per >year, then that makes system outages much more likely, since they >happen at about one per month (here!). > >If it isn't faulty scsi cables, it will be overheating cpus. Dust in >the very dry air here kills all fan bearings within 6 months to one >year. > >My defense against that is to heavily underclock all machines. > > > > >>>No way! I hate tapes. I backup to other disks. >>> >>> >>Then for your sake, I hope they're kept offline, in a safe. >> >> > >No, they're kept online. Why? What would be the point of having them in >a safe? Then they'd be unavailable! > >The general scheme is that sites cross-backup each other. > > > >>>>** My computer room is for development and testing, no customer access. >>>> >>>> >>>Unfortunately, the admins do most of the sabotage. >>> >>> >>Change admins. >> >> > >Can't. They're as good as they get. Hey, *I* even do the sabotage >sometimes. I'm probably only abut 99% accurate, and I can certainly >write a hundred commands in a day. > > > > >>I could understand an admin making typing errors and such, but then again that >>would not usually lead to a total system failure. >> >> > >Of course it would. You try working remotely to upgrade the sshd and >finally killing off the old one, only to discover that you kill the >wrong one and lock yourself out, while the deadman script on the server >tries fruitlessly to restart a misconfigured server, and then finally >decides after an hour to give up and reboot as a last resort, then >can't bring itself back up because of something else you did that you >were intending to finish but didn't get the opportunity to. > > > >>Some daemon not working, >>sure. Good admins review or test their changes, >> >> > >And sometimes miss the problem. > > > >>for one thing, and in most >>cases any such mistake is rectified much simpler and faster than a failed >>disk anyway. Except maybe for lilo errors with no boot media available. ;-\ >> >> > >Well, you can go out to the site in the middle of the night to reboot! >Changes are made out of working hours so as not to disturb the users. > > > >>>Yes you did. You can see from the quoting that you did. >>> >>> >>Or the quoting got messed up. That is known to happen in threads. >> >> > >Shrug. > > > >>>>but it may be more current than 1 or >>>>more of the other disks. But this would be similar to what would happen >>>>to a non-RAID disk (some data not written). >>>> >>>> >>>No, it would not be similar. You don't seem to understand the >>>mechanism. The mechanism for corruption is that there are two different >>>versions of the data available when the system comes back up, and you >>>and the raid system don't know which is more correct. Or even what it >>>means to be "correct". Maybe the earlier written data is "correct"! >>> >>> >>That is not the whole truth. To be fair, the mechanism works like this: >>With raid, you have a 50% chance the wrong, corrupted, data is used. >>Without raid, thus only having a single disk, the chance of using the >>corrupted data is 100% (obviously, since there is only one source) >> >> > >That is one particular spin on it. > > > >>Or, much more elaborate: >> >>Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5 >> >> > >There's no need to. Call it "p". > > > >>With raid, you always have a 50% chance of reading faultty data IF one of the >>drives holds faulty data. >> >> > >That is, the probability of corruption occuring and it THEN being >detected is 0.5 . However, the probabilty that it occurred is 2p, not >p, since there are two disks (forget the tiny p^2 possibility). So >we have > > p = probability of corruption occuring AND it being detected. > > > >>For the drives itself, the chance of both disks >>being wrong is 0.5x0.5=0.25(scenario A). Similarly, 25 % chance both disks >>are good (scenario B). The chance of one of the disks being wrong is 50% >>(scenarios C & D together). In scenarios A & B the outcome is certain. In >>scenarios C & D the chance of the raid choosing the false mirror is 50%. >>Accumulating those chances one can say that the chance of reading false data >>is: >>in scenario A: 100% >> >> > > p^2 > > > >>in scenario B: 0% >> >> > 0 > > >>scenario C: 50% >> >> > 0.5p > > >>scenario D: 50% >> >> > 0.5p > > > >>Doing the math, the outcome is still (200% divided by four)= 50%. >> >> > >Well, it's p + p^2. But I said to neglect the square term. > > > >>Ergo: the same as with a single disk. No change. >> >> > >Except that it is not the case. With a single disk you are CERTAIN to >detect the problem (if it is detectable) when you run the fsck at >reboot. With a RAID1 mirror you are only 50% likely to detect the >detectable problem, because you may choose to read the "wrong" (correct >:) disk at the crucial point in the fsck. Then you have to hope that >the right disk fails next, when it fails, or else you will be left holding >the detectably wrong, unchecked data. > >So in the scenario of a single detectable corruption: > >A: probability of a detectable error occuring and NOT being detected on > a single disk system is > > zero > >B: probability of a detectable error occuring and NOT being detected on > a two disk system is > > p > >Cute, no? You could have deduced that from your figures too, but you >were all fired up about the question of a detectable error occurring >AND being detected to think about it occuring AND NOT being detected. > >Even though that is what interests us! "silent corruption". > > > >>>>In contrast, on a single disk they have a 100% chance of detection (if >>>>you look!) and a 100% chance of occuring, wrt normal rate. >>>>** Are you talking about the disk drive detecting the error? >>>> >>>> >>No, you have a zero chance of detection, since there is nothing to compare TO. >> >> > >That is not the case. You have every chance in the world of detecting >it - you know what fsck does. > >If you like we can consider detectable and indetectable errors >separtely. > > > > >>Raid-1 at least gives you a 50/50 chance to choose the right data. With a >>single disk, the chance of reusing the corrupted data is 100% and there is no >>mechanism to detect the odd 'tumbled bit' at all. >> >> > >False. > > > > >>>You wouldn't necesarily know which of the two data sources was >>>"correct". >>> >>> >>No, but you have a theoretical choice, and a 50% chance of being right. >>Not so without raid, where you get no choice, and a 100% chance of getting the >>wrong data, in the case of a corruption. >> >> > >Review the calculation. > >Peter > >- >To unsubscribe from this list: send the line "unsubscribe linux-raid" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > ^ permalink raw reply [flat|nested] 92+ messages in thread
* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten 2005-01-03 19:52 ` maarten 2005-01-03 20:22 ` Peter T. Breuer @ 2005-01-03 21:36 ` Guy 2005-01-04 0:15 ` maarten 2 siblings, 1 reply; 92+ messages in thread From: Guy @ 2005-01-03 21:36 UTC (permalink / raw) To: 'maarten', linux-raid Maarten said: "Doing the math, the outcome is still (200% divided by four)= 50%. Ergo: the same as with a single disk. No change." Guy said: "I bet a non-mirror disk has similar risk as a RAID1." Guy and Maarten agree, but Maarten does a better job of explaining it! :) I also agree with most of what Maarten said below, but not mirroring swap??? Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten Sent: Monday, January 03, 2005 12:47 PM To: linux-raid@vger.kernel.org Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) On Monday 03 January 2005 12:31, Peter T. Breuer wrote: > Guy <bugzilla@watkins-home.com> wrote: > > "Also sprach Guy:" > 1) lightning strikes rails, or a/c goes out and room full of servers > overheats. All lights go off. > > 2) when sysadmin arrives to sort out the smoking wrecks, he finds > that 1 in 3 random disks are fried - they're simply the points > of failure that died first, and they took down the hardware with > them. > > 3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware > to piece together the raid arrays from the surviving disks, and > hastily does a copy to somewhere very safe and distant, while > an assistant holds off howling hordes outside the door with > a shutgun. > > In this scenario, a disk simply acts as the weakest link in a fuse > chain, and the whole chain goes down. But despite my dramatisation it > is likely that a hardware failure will take out or damage your hardware! > Ide disks live on an electric bus conected to other hardware. Try a > shortcircuit and see what happens. You can't even yank them out while > the bus is operating if you want to keep your insurance policy. The chance of a PSU blowing up or lightning striking is, reasonably, much less than an isolated disk failure. If this simple fact is not true for you personally, you really ought to reevaluate the quality of your PSU (et al) and / or the buildings' defenses against a lightning strike... > However, I don't see how you can expect to replace a failed disk > without taking down the system. For that reason you are expected to be > running "spare disks" that you can virtually insert hot into the array > (caveat, it is possible with scsi, but you will need to rescan the bus, > which will take it out of commission for some seconds, which may > require you to take the bus offline first, and it MAY be possible with > recent IDE buses that purport to support hotswap - I don't know). I think the point is not what actions one has to take at time T+1 to replace the disk, but rather whether at time T, when the failure first occurs, the system survives the failure or not. > (1) how likely is it that a disk will fail without taking down the system > (2) how likely is it that a disk will fail > (3) how likely is it that a whole system will fail > > I would say that (2) is about 10% per year. I would say that (3) is > about 1200% per year. It is therefore difficult to calculate (1), which > is your protection scenario, since it doesn't show up very often in the > stats! I don't understand your math. For one, percentage is measured from 0 to 100, not from 0 to 1200. What is that, 12 twelve times 'absolute certainty' that something will occur ? But besides that, I'd wager that from your list number (3) has, by far, the smallest chance of occurring. Choosing between (1) and (2) is more difficult, my experiences with IDE disks are definitely that it will take the system down, but that is very biased since I always used non-mirrored swap. I sure can understand a system dying if it loses part of its memory... > > ** A disk failing is the most common failure a system can have (IMO). I fully agree. > Not in my experience. See above. I'd say each disk has about a 10% > failure expectation per year. Whereas I can guarantee that an > unexpected system failure will occur about once a month, on every > important system. Whoa ! What are you running, windows perhaps ?!? ;-) No but seriously, joking aside, you have 12 system failures per year ? I would not be alone in thinking that figure is VERY high. My uptimes generally are in the three-digit range, and most *certainly* not in the low 2-digit range. > If you think about it that is quite likely, since a system is by > definition a complicated thing. And then it is subject to all kinds of > horrible outside influences, like people rewiring the server room in > order to reroute cables under the floor instead of through he ceiling, > and the maintenance people spraying the building with insecticide, > everywhere, or just "turning off the electricity in order to test it" > (that happens about four times a year here - hey, I remember when they > tested the giant UPS by turning off the electricity! Wrong switch. > Bummer). If you have building maintenance people and other random staff that can access your server room unattended and unmonitored, you have far worse problems than making decicions about raid lavels. IMNSHO. By your description you could almost be the guy the joke with the recurring 7 o'clock system crash is about (where the cleaning lady unplugs the server every morning in order to plug in her vacuum cleaner) ;-) > Yes, you can try and keep these systems out of harms way on a > colocation site, or something, but by then you are at professional > level paranoia. For "home systems", whole system failures are far more > common than disk failures. Don't agree. Not only do disk failures occur more often than full system failures, disk failures are also much more time-consuming to recover from. Compare changing a system board or PSU with changing a drive and finding, copying and verifying a backup (if you even have one that's 100% up to date) > > ** In a computer room with about 20 Unix systems, in 1 year I have seen > > 10 or so disk failures and no other failures. > > Well, let's see. If each system has 2 disks, then that would be 25% per > disk per year, which I would say indicates low quality IDE disks, but > is about the level I would agree with as experiential. The point here was, disk failures being more common than other failures... > No way! I hate tapes. I backup to other disks. Then for your sake, I hope they're kept offline, in a safe. > > ** My computer room is for development and testing, no customer access. > > Unfortunately, the admins do most of the sabotage. Change admins. I could understand an admin making typing errors and such, but then again that would not usually lead to a total system failure. Some daemon not working, sure. Good admins review or test their changes, for one thing, and in most cases any such mistake is rectified much simpler and faster than a failed disk anyway. Except maybe for lilo errors with no boot media available. ;-\ > Yes you did. You can see from the quoting that you did. Or the quoting got messed up. That is known to happen in threads. > > but it may be more current than 1 or > > more of the other disks. But this would be similar to what would happen > > to a non-RAID disk (some data not written). > > No, it would not be similar. You don't seem to understand the > mechanism. The mechanism for corruption is that there are two different > versions of the data available when the system comes back up, and you > and the raid system don't know which is more correct. Or even what it > means to be "correct". Maybe the earlier written data is "correct"! That is not the whole truth. To be fair, the mechanism works like this: With raid, you have a 50% chance the wrong, corrupted, data is used. Without raid, thus only having a single disk, the chance of using the corrupted data is 100% (obviously, since there is only one source) Or, much more elaborate: Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5 With raid, you always have a 50% chance of reading faultty data IF one of the drives holds faulty data. For the drives itself, the chance of both disks being wrong is 0.5x0.5=0.25(scenario A). Similarly, 25 % chance both disks are good (scenario B). The chance of one of the disks being wrong is 50% (scenarios C & D together). In scenarios A & B the outcome is certain. In scenarios C & D the chance of the raid choosing the false mirror is 50%. Accumulating those chances one can say that the chance of reading false data is: in scenario A: 100% in scenario B: 0% scenario C: 50% scenario D: 50% Doing the math, the outcome is still (200% divided by four)= 50%. Ergo: the same as with a single disk. No change. > > In contrast, on a single disk they have a 100% chance of detection (if > > you look!) and a 100% chance of occuring, wrt normal rate. > > ** Are you talking about the disk drive detecting the error? No, you have a zero chance of detection, since there is nothing to compare TO. Raid-1 at least gives you a 50/50 chance to choose the right data. With a single disk, the chance of reusing the corrupted data is 100% and there is no mechanism to detect the odd 'tumbled bit' at all. > > How? > > ** Compare the 2 halves or the RAID1, or check the parity of RAID5. > > You wouldn't necesarily know which of the two data sources was > "correct". No, but you have a theoretical choice, and a 50% chance of being right. Not so without raid, where you get no choice, and a 100% chance of getting the wrong data, in the case of a corruption. Maarten -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-03 21:36 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy @ 2005-01-04 0:15 ` maarten 2005-01-04 11:21 ` Michael Tokarev 0 siblings, 1 reply; 92+ messages in thread From: maarten @ 2005-01-04 0:15 UTC (permalink / raw) To: linux-raid On Monday 03 January 2005 22:36, Guy wrote: > Maarten said: > "Doing the math, the outcome is still (200% divided by four)= 50%. > Ergo: the same as with a single disk. No change." > > Guy said: > "I bet a non-mirror disk has similar risk as a RAID1." > > Guy and Maarten agree, but Maarten does a better job of explaining it! :) > > I also agree with most of what Maarten said below, but not mirroring > swap??? Yeah... bad choice in hindsight. But, there once was a time, a long long time ago, that the software-raid howto explicitly stated that running swap on raid was a bad idea, and that by telling the kernel all swap partitions had the same priority, the kernel itself would already 'raid' the swap, ie. divide equally between the swap spaces. I'm sure you can read it back somewhere. Now we know better, and we realize that that will indeed loadbalance between the various swap partitions, but it will not provide redundancy at all. Oh well, new insights huh ? ;-) Maarten ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) 2005-01-04 0:15 ` maarten @ 2005-01-04 11:21 ` Michael Tokarev 0 siblings, 0 replies; 92+ messages in thread From: Michael Tokarev @ 2005-01-04 11:21 UTC (permalink / raw) To: linux-raid maarten wrote: > On Monday 03 January 2005 22:36, Guy wrote: > >>Maarten said: >>"Doing the math, the outcome is still (200% divided by four)= 50%. >>Ergo: the same as with a single disk. No change." >> >>Guy said: >>"I bet a non-mirror disk has similar risk as a RAID1." >> >>Guy and Maarten agree, but Maarten does a better job of explaining it! :) >> >>I also agree with most of what Maarten said below, but not mirroring >>swap??? > > > Yeah... bad choice in hindsight. > But, there once was a time, a long long time ago, that the software-raid howto > explicitly stated that running swap on raid was a bad idea, and that by In 2.2, and probably in early 2.4, there indeed was a prob with having swap on raid (md) array. "Random" system lockups, especially during the array recovery. That problem(s) has been fixed long ago. But I think the howto in question tells about something different... > telling the kernel all swap partitions had the same priority, the kernel > itself would already 'raid' the swap, ie. divide equally between the swap > spaces. I'm sure you can read it back somewhere. > Now we know better, and we realize that that will indeed loadbalance between > the various swap partitions, but it will not provide redundancy at all. > Oh well, new insights huh ? ;-) ...that is, the howto tells about raid0 setup (striping), and yes, there's no "r" in "raid0" really (but there IS an "anti-r", as raid0 array is LESS reliable than a single drive). That to say: instead of placing swap on raid0 array, let the swap code itself to perform the striping - swap code "knows better" about its needs. This is still applies to recent kernels. But here, we aren't talking about *reilable* swap, we're talking about *fast* swap (raid1 aka reliable vs raid0 aka fast). There's no code in "swap subsystem" to mirror swap space, but there IS such a code in md. Hence, if you want reliability, use raid1 arrays for swap space. In the same time, if you want speed *too*, use multiple raid1 arrays with equal priority as swap areas (dunno how current raid10 code compares to "swap striping" on top of raid1 arrays, but that probably makes very small difference). Ie, nothing wrong with howto, which is talking about fast swap (sure it'd be good to mention reliability too), and nothing wrong with having raid arrays as swap (esp. when the abovementioned bug(s) has been fixed). Or, "nothing new"... ;) I learned to place swap to raid1 arrays instead of striping it (as suggested by the howto) the hard way, going the full cycle recovering damaged data because system got foobared after one component (stripe) of swapspace was lost, and I don't want to repeat that recovery again. ;) > Maarten /mjt ^ permalink raw reply [flat|nested] 92+ messages in thread
end of thread, other threads:[~2005-01-05 15:52 UTC | newest]
Thread overview: 92+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200501030916.j039Gqe23568@inv.it.uc3m.es>
2005-01-03 10:17 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy
2005-01-03 11:31 ` Peter T. Breuer
2005-01-03 17:34 ` Guy
2005-01-03 19:20 ` ext3 Gordon Henderson
2005-01-03 19:47 ` ext3 Morten Sylvest Olsen
2005-01-03 20:05 ` ext3 Gordon Henderson
2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten
2005-01-03 19:52 ` maarten
2005-01-03 20:41 ` Peter T. Breuer
2005-01-03 23:19 ` Peter T. Breuer
2005-01-03 23:46 ` Neil Brown
2005-01-04 0:28 ` Peter T. Breuer
2005-01-04 1:18 ` Alvin Oga
2005-01-04 4:29 ` Neil Brown
2005-01-04 8:43 ` Peter T. Breuer
2005-01-04 2:07 ` Neil Brown
2005-01-04 2:16 ` Ewan Grantham
2005-01-04 2:22 ` Neil Brown
2005-01-04 2:41 ` Andy Smith
2005-01-04 3:42 ` Neil Brown
2005-01-04 9:50 ` Peter T. Breuer
2005-01-04 14:15 ` David Greaves
2005-01-04 15:20 ` Peter T. Breuer
2005-01-04 16:42 ` Guy
2005-01-04 17:46 ` Peter T. Breuer
2005-01-04 9:30 ` Maarten
2005-01-04 10:18 ` Peter T. Breuer
2005-01-04 13:36 ` Maarten
2005-01-04 14:13 ` Peter T. Breuer
2005-01-04 19:22 ` maarten
2005-01-04 20:05 ` Peter T. Breuer
2005-01-04 21:38 ` Guy
2005-01-04 23:53 ` Peter T. Breuer
2005-01-05 0:58 ` Mikael Abrahamsson
2005-01-04 21:48 ` maarten
2005-01-04 23:14 ` Peter T. Breuer
2005-01-05 1:53 ` maarten
2005-01-04 9:46 ` Peter T. Breuer
2005-01-04 19:02 ` maarten
2005-01-04 19:12 ` David Greaves
2005-01-04 21:08 ` Peter T. Breuer
2005-01-04 22:02 ` Brad Campbell
2005-01-04 23:20 ` Peter T. Breuer
2005-01-05 5:44 ` Brad Campbell
2005-01-05 9:00 ` Peter T. Breuer
2005-01-05 9:14 ` Brad Campbell
2005-01-05 9:28 ` Peter T. Breuer
2005-01-05 9:43 ` Brad Campbell
2005-01-05 15:09 ` Guy
2005-01-05 15:52 ` maarten
2005-01-05 10:04 ` Andy Smith
2005-01-04 22:21 ` Neil Brown
2005-01-05 0:08 ` Peter T. Breuer
2005-01-04 22:29 ` Neil Brown
2005-01-05 0:19 ` Peter T. Breuer
2005-01-05 1:19 ` Jure Pe_ar
2005-01-05 2:29 ` Peter T. Breuer
2005-01-05 0:38 ` maarten
2005-01-04 9:40 ` Peter T. Breuer
2005-01-04 11:57 ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Michael Tokarev
2005-01-04 12:40 ` Morten Sylvest Olsen
2005-01-04 12:44 ` Peter T. Breuer
2005-01-04 14:22 ` Maarten
2005-01-04 14:56 ` Peter T. Breuer
2005-01-04 14:03 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
2005-01-04 14:07 ` Peter T. Breuer
2005-01-04 14:43 ` David Greaves
2005-01-04 15:12 ` Peter T. Breuer
2005-01-04 16:54 ` David Greaves
2005-01-04 17:42 ` Peter T. Breuer
2005-01-04 19:12 ` David Greaves
2005-01-04 0:45 ` maarten
2005-01-04 10:14 ` Peter T. Breuer
2005-01-04 13:24 ` Maarten
2005-01-04 14:05 ` Peter T. Breuer
2005-01-04 15:31 ` Maarten
2005-01-04 16:21 ` Peter T. Breuer
2005-01-04 20:55 ` maarten
2005-01-04 21:11 ` Peter T. Breuer
2005-01-04 21:38 ` Peter T. Breuer
2005-01-04 23:29 ` Guy
2005-01-04 19:57 ` Mikael Abrahamsson
2005-01-04 21:05 ` maarten
2005-01-04 21:26 ` Alvin Oga
2005-01-04 21:46 ` Guy
2005-01-03 20:22 ` Peter T. Breuer
2005-01-03 23:05 ` Guy
2005-01-04 0:08 ` maarten
2005-01-04 8:57 ` I'm glad I don't live in Spain (was Re: ext3 journal on software raid) David L. Smith-Uchida
2005-01-03 21:36 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy
2005-01-04 0:15 ` maarten
2005-01-04 11:21 ` Michael Tokarev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).