From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Greaves Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Date: Tue, 04 Jan 2005 19:12:54 +0000 Message-ID: <41DAEAB6.8000101@dgreaves.com> References: <200501030916.j039Gqe23568@inv.it.uc3m.es> <200501031846.42950.maarten@ultratux.net> <200501032052.21459.maarten@ultratux.net> <16857.55609.534526.297577@cse.unsw.edu.au> <16857.64086.362458.177296@cse.unsw.edu.au> <41DAA243.3060202@dgreaves.com> <41DAAB7D.2030400@dgreaves.com> <41DACA39.4020700@dgreaves.com> <751ra2-amt.ln1@news.it.uc3m.es> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <751ra2-amt.ln1@news.it.uc3m.es> Sender: linux-raid-owner@vger.kernel.org To: "Peter T. Breuer" Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids Peter T. Breuer wrote: >A joournalled file system is always _consistent_. That does no mean it >is correct! > > To my knowledge no computers have the philosophical wherewithall to provide that service ;) If one is rude enough to stab a journalling filesystem in the back as it tries to save your data it promises only to be consistent when it is revived - it won't provide application correctness.. I think we agree on that. >>The md driver (somehow) gets to decide which half of the mirror is 'best'. >> >> >Yep - and which is correct? > > Both are 'correct' - they simply represent different points in the series of system calls made before the power went. >Which is correct? > > ditto >And the question remains - which outcome is correct? > > same answer I'm afraid. >Well, I'll answer that. Assuming that the fs layer is only notified >when BOTH journal writes have happened, and tcp signals can be sent >off-machine or something like that, then the correct result is the >rollback, not the completion, as the world does not expect there to >have been a completion given the data it has got. > >It's as I said. One always wants to rollback. So one doesn't want the >journal to bother with data at all. > bullshit ;) I write a,b,c and d to the filesystem we begin our story when a,b and c all live on the fs device (raid or not), all synced up and consistent. I start to write d it hits journal mirror A it hits journal mirror B it finalises on journal mirror B I yank the plug The mirrors are inconsistent The filesystem is consistent I reboot scenario 1) the md device comes back using A the journal isn't finalised - it's ignored the filesystem contains a,b and c Is that correct? scenario 2) the md device comes back using B the journal is finalised - it's rolled forward the filesystem contains a,b,c and d Is that correct? Both are correct. So, I think that deals with correctness and journalling - now on to errors... >>>No. I made no such assumption. I don't know or care what you do with a >>>detectable error. I only say that whatever your test is, it detects it! >>>IF it looks at the right spot, of course. And on raid the chances of >>>doing that are halved, because it has to choose which disk to read. >>> >>> >>I did when I defined detectable.... tentative definitions: >>detectable = noticed by normal OS I/O. ie CRC sector failure etc >>undetectable = noticed by special analysis (fsck, md5sum verification etc) >> >> > >A detectable error is one you detect with whatever your test is. If >your test is fsck, then that's the kind of error that is detected by the >detection that you do ... the only condition I imposed for the analysis >was that the test be conducted on the raid array, not on its underlying >components. > > well, if we're going to get anywhere here we need to be clear about things. There are all kinds of errors - raid and redundancy will help with some and not others. An md device does have underlying components and to refuse to allow tests to compare them you remove one of the benefits of raid - redundancy. It may make it easier to model mathmatically - but then the model is wrong. We need to make sure we're talking about bits on a device md reads devices and it writes them. We need to understand what an error is - stop talking bollocks about "whatever the test is". This is *not* a math problem - it's simply not well enough defined yet. Lets get back to reality to decide what to model. I proposed definitions and tests (the ones used in the real world where we don't run fsck) and you've ignored them. I'll repeat them: detectable = noticed by normal OS I/O. ie CRC sector failure etc undetectable = noticed by special analysis (fsck, md5sum verification etc) I'll add 'component device comparison' to the special analysis list. No error is truly undetectable - if it were then it wouldn't matter would it? >>- nothing's broken but a bit flipped during the write/store process (or >>the power went before it hit the media). Detectable errors are more >>likely to be permanent (since most detection algorithms probably have a >>retry). >> >> > >I think that for some reason you are considering that a test (a >detection test) is carried out at every moment of time. No. Only ONE >test is ever carried out. It is the test you apply when you do the >observation: the experiment you run decides at that single point wether >the disk (the raid array) has errors or not. In practical terms, you do >it usualy when you boot the raid array, and run fsck on its file system. > >OK? >You simply leave an experiment running for a while (leave the array up, >let monkeys play on it, etc.) and then you test it. That test detects >some errors. However, there are two types of errors - those you can >detect with your test, and those you cannot detect. My analysis simply >gave the probabilities for those on the array, in terms of basic >parameters for the probabilities per an individual disk. > >I really do not see why people make such a fuss about this! > > We care about our data and raid has some vulnerabilites to corruption. We need to understand these to fix them - your analysis is woolly and unhelpful and, although it may have certain elements that are mathmatically correct - your model has flaws that mean that the conclusions are not applicable. >>>>However, we need to carry out risk analysis to decide if the increase in >>>>susceptibility to certain kinds of corruption (cosmic rays) is >>>> >>>> >>>> >>>Ahh. Yes you do. No I don't! This is your own invention, and I said no >>>such thing. By "errors", I meant anything at all that you consider to be >>>an error. It's up to you. And I see no reason to restrict the term to >>>what is produced by something like "cosmic rays". "People hitting the >>>off switch at the wrong time" counts just as much, as far as I know. >>> >>> >>> >>> >>You're talking about causes - I'm talking about classes of error. >> >> > >No, I'm talking about classes of error! You're talking about causes. :) > > No, by comparing the risk between classes of error (detectable and not) I'm talking about classes of errror - by arguing about cosmic rays and power switches you _are_ talking about causes. Personally I think there is a massive difference between the risk of detectable errors and undetectable ones. Many orders of magnitude. >>Hitting the power off switch doesn't cause a physical failure - it >>causes inconsistency in the data. >> >> >I don't understand you - it causes errors just like cosmic rays do (and >we can even set out and describe the mechanisms involved). The word >"failure" is meaningless to me here. > > yes, you appear to have selectively quoted and ignored what I said a line earlier: > (I live in telco-land so most datacentres I know have more chance of suffering cosmic ray damage than Joe Random user pulling the plug - but conceptually these events are the same). When that happens I begin to think that further discussion is meaningless. >>>I would guess that you are trying to classify errors by the way their >>>probabilities scale with number of disks. >>> >>> >>> >>Nope - detectable vs undetectable. >> >> > >Then what's the problem? An undetectable error is one you cannot detect >via your test. Those scale with real estate. A detectible error is one >you can spot with your test (on the array, not its components). The >missed detectible errors scale as n-1, where n is the number of disks in >the array. > >Thus a single disk suffers from no missed detectible errors, and a >2-disk raid array does. > >That's all. > >No fuss, no muss! > > and so obviously wrong! An md device does have underlying components and to refuse to allow tests to compare them you remove one of the benefits of raid - redundancy. >>Also, it strikes me that raid can actually find undetectable errors by >>doing a bit-comparison scan. >> >> > >No, it can't, by definition. Undetectible errors are undetectible. If >you change your test, you change the class of errors that are >undetectible. > >That's all. > > > >>Non-resilient devices with only one copy of each bit can't do that. >>raid 6 could even fix undetectable errors. >> >> > >Then they are not "undetectible". > > They are. Read my definition. They are not detected in normal operation with some kind of event notification/error return code; hence undetectable. However bit comparison with known good or md5 sums or with a mirror can spot such bit flips. They are still 'undetectable' in normal operation. Be consistent in your terminology. >The analisis in not affected by your changing the definition of what is >in the undetectible class of error and what is not. It stands. I have >made no assumption at all on what they are. I simply pointed out how >the probabilities scale for a raid array. > > What analysis - you are waving vague and changing definitions about and talk about grandma's favourite colour David PS any dangling sentences are because I just found so many inconsistencies that I gave up.