From: maarten <maarten@ultratux.net>
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
Date: Tue, 4 Jan 2005 20:02:32 +0100 [thread overview]
Message-ID: <200501042002.32292.maarten@ultratux.net> (raw)
In-Reply-To: <895qa2-0qa.ln1@news.it.uc3m.es>
On Tuesday 04 January 2005 10:46, Peter T. Breuer wrote:
> Andy Smith <andy@strugglers.net> wrote:
> > [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines
> > --]
> >
> > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > On Monday January 3, ewan.grantham@gmail.com wrote:
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
>
> It's not obvious to anyone, where by "it" I mean whether or not you
> "should" put a journal on the same raid device. There are pros and
> cons. I would not. My reasoning is that I don't want data in the
> journal to be subject to the same kinds of creeping invisible corruption
> on reboot and resync that raid is subject to. But you can achieve that
[ I'll attempt to adress all issues that have come up in this entire thread
until now here... please bear with me. ]
@Peter:
I still need you to clarify what can cause such creeping corruption.
There are several possible cases:
1) A bit flipped on the platter or the drive firmware had a 'thinko'.
This will be signalled by the CRC / ECC on the drive. You can't flip a bit
unnoticed. Or in fact, bits get 'flipped' constantly, therefore the highly
sophisticated error correction code in modern drives. If the ECC can't
rectify such a read error, it will issue a read error to the OS.
Obviously, the raid or FS code handles this error in the usual way; this is
what we call a bad sector, and we have routines that handle that perfectly.
2) An incomplete write due to a crash.
This can't happen on the drive itself, as the onboard cache will ensure
everything that's in there gets written to the platter. I have no reason to
doubt what the manufacturer promises here, but it is easy to check if one
really wants to; just issue a couple thousand cycles of well timed <write
block, kill power to drive> commands, and verify if it all got written.
(If not: start a class action suit against the manufacturer)
Another possibility is it happening in a higher layer, the raid code or the FS
code. Let's examine this further. The raid code does not promise that that
can't happen ("MD raid is no substitute for a UPS"). But, the FS helps here.
In the case of a journaled FS, the first that must be written is the delta.
Then the data, then the delta is removed again. From this we can trivially
deduce that indeed a journaled FS will not(*) suffer write reordering; as
that is the only way data could get written without there first being a
journal delta on disk. So at least that part is correct indeed(!)
So in fact, a journaled FS will either have to rely on lower layers *not*
reordering writes, or will have to wait for the ACK on the journal delta
before issuing the actual_data write command(!).
(*) unless it waits for the ACK mentioned above.
Further, we thus can split up the write in separate actions:
A) the time during which the journal delta gets written
B) the time during which the data gets written
C) the time during which the journal delta gets removed.
Now at what point do or did we crash ? If it is at A) the data is consistent,
no matter whether the delta got written or not. If it is at B) the data
block is in an unknown state and the journal reflects that, so the journal
code rolls back. If it is at C) the data is again consistent. Depending on
what sense the journal delta makes, there can be a rollback, or not. In
either case, the data still remains fully consistent.
It's really very simple, no ?
Now to get to the real point of the discussion. What changes when we have a
mirror ? Well, if you think hard about that: NOTHING. What Peter tends to
forget it that there is no magical mixup of drive 1's journal with drive 2's
data (yep, THAT would wreak havoc!).
At any point in time -whether mirror 1 is chosen as true or mirror 2 gets
chosen does not matter as we will see- the metadata+data on _that_ mirror by
definition will be one of the cases A through C outlined above. IT DOES NOT
MATTER that mirror one might be at stage B and mirror two at stage C. We use
but one mirror, and we read from that and the FS rectifies what it needs to
rectify.
This IS true because the raid code at boot time sees that the shutdown was not
clean, and will sync the mirrors. At this point, the FS layer has not even
come into play. Only when the resync has finished, the FS gets to examine
its journal. -> !! At this point the mirrors are already in sync again !! <-
If, for whatever reason, the raid code would NOT have seen the unclean
shutdown, _then_ you may have a point, since in that special case it would be
possible for the journal entry from mirror one (crashed during stage C) gets
used to evaluate the data block on mirror two (being in state B). In those
cases, bad things may happen obviously.
If I'm not mistaken, this is what happens when one has to assemble --force an
array that has had issues. But as far as I can see, that is the only time...
Am I making sense so far ? (Peter, this is not adressed to you, as I already
know your answer beforehand: I'd be "baby raid tech talk", correct ?)
So. What possible scenarios have I overlooked until now...?
Oh yeah, the possibility number 3).
3) The inconsistent write comes from a bug in the CPU, RAM, code or such.
As Neil already pointed out, you gotta trust your CPU to work right otherwise
all bets are off. But even if this could happen, there is no blaming the FS
or the raid code, as the faulty request was carried out as directed. The
drives may not be in sync, but neither the drive, the raid code nor the FS
knows this (and cannot reasonably know!) If a bit in RAM gets flipped in
between two writes there is nothing except ECC ram that's going to help you.
Last possible theoretical case: the bug is actually IN the raid code. Well, in
this case, the error will most certainly be reproduceable. I cannot speak
for the code as I haven't written nor reviewed it (nor would I be able to...)
but this really seems far-fetched. Lots of people use and test the code, it
would have been spotted at some point.
Does this make any sense to anybody ? (I sure hope so...)
Maarten
--
next prev parent reply other threads:[~2005-01-04 19:02 UTC|newest]
Thread overview: 130+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <200501030916.j039Gqe23568@inv.it.uc3m.es>
2005-01-03 10:17 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy
2005-01-03 11:31 ` Peter T. Breuer
2005-01-03 17:34 ` Guy
2005-01-03 19:20 ` ext3 Gordon Henderson
2005-01-03 19:47 ` ext3 Morten Sylvest Olsen
2005-01-03 20:05 ` ext3 Gordon Henderson
2005-01-03 17:46 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten
2005-01-03 19:52 ` maarten
2005-01-03 20:41 ` Peter T. Breuer
2005-01-03 23:19 ` Peter T. Breuer
2005-01-03 23:46 ` Neil Brown
2005-01-04 0:28 ` Peter T. Breuer
2005-01-04 1:18 ` Alvin Oga
2005-01-04 4:29 ` Neil Brown
2005-01-04 8:43 ` Peter T. Breuer
2005-01-04 2:07 ` Neil Brown
2005-01-04 2:16 ` Ewan Grantham
2005-01-04 2:22 ` Neil Brown
2005-01-04 2:41 ` Andy Smith
2005-01-04 3:42 ` Neil Brown
2005-01-04 9:50 ` Peter T. Breuer
2005-01-04 14:15 ` David Greaves
2005-01-04 15:20 ` Peter T. Breuer
2005-01-04 16:42 ` Guy
2005-01-04 17:46 ` Peter T. Breuer
2005-01-04 9:30 ` Maarten
2005-01-04 10:18 ` Peter T. Breuer
2005-01-04 13:36 ` Maarten
2005-01-04 14:13 ` Peter T. Breuer
2005-01-04 19:22 ` maarten
2005-01-04 20:05 ` Peter T. Breuer
2005-01-04 21:38 ` Guy
2005-01-04 23:53 ` Peter T. Breuer
2005-01-05 0:58 ` Mikael Abrahamsson
2005-01-04 21:48 ` maarten
2005-01-04 23:14 ` Peter T. Breuer
2005-01-05 1:53 ` maarten
2005-01-04 9:46 ` Peter T. Breuer
2005-01-04 19:02 ` maarten [this message]
2005-01-04 19:12 ` David Greaves
2005-01-04 21:08 ` Peter T. Breuer
2005-01-04 22:02 ` Brad Campbell
2005-01-04 23:20 ` Peter T. Breuer
2005-01-05 5:44 ` Brad Campbell
2005-01-05 9:00 ` Peter T. Breuer
2005-01-05 9:14 ` Brad Campbell
2005-01-05 9:28 ` Peter T. Breuer
2005-01-05 9:43 ` Brad Campbell
2005-01-05 15:09 ` Guy
2005-01-05 15:52 ` maarten
2005-01-05 10:04 ` Andy Smith
2005-01-04 22:21 ` Neil Brown
2005-01-05 0:08 ` Peter T. Breuer
2005-01-04 22:29 ` Neil Brown
2005-01-05 0:19 ` Peter T. Breuer
2005-01-05 1:19 ` Jure Pe_ar
2005-01-05 2:29 ` Peter T. Breuer
2005-01-05 0:38 ` maarten
2005-01-04 9:40 ` Peter T. Breuer
2005-01-04 11:57 ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Michael Tokarev
2005-01-04 12:40 ` Morten Sylvest Olsen
2005-01-04 12:44 ` Peter T. Breuer
2005-01-04 14:22 ` Maarten
2005-01-04 14:56 ` Peter T. Breuer
2005-01-04 14:03 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
2005-01-04 14:07 ` Peter T. Breuer
2005-01-04 14:43 ` David Greaves
2005-01-04 15:12 ` Peter T. Breuer
2005-01-04 16:54 ` David Greaves
2005-01-04 17:42 ` Peter T. Breuer
2005-01-04 19:12 ` David Greaves
2005-01-04 0:45 ` maarten
2005-01-04 10:14 ` Peter T. Breuer
2005-01-04 13:24 ` Maarten
2005-01-04 14:05 ` Peter T. Breuer
2005-01-04 15:31 ` Maarten
2005-01-04 16:21 ` Peter T. Breuer
2005-01-04 20:55 ` maarten
2005-01-04 21:11 ` Peter T. Breuer
2005-01-04 21:38 ` Peter T. Breuer
2005-01-04 23:29 ` Guy
2005-01-04 19:57 ` Mikael Abrahamsson
2005-01-04 21:05 ` maarten
2005-01-04 21:26 ` Alvin Oga
2005-01-04 21:46 ` Guy
2005-01-03 20:22 ` Peter T. Breuer
2005-01-03 23:05 ` Guy
2005-01-04 0:08 ` maarten
2005-01-04 8:57 ` I'm glad I don't live in Spain (was Re: ext3 journal on software raid) David L. Smith-Uchida
2005-01-03 21:36 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy
2005-01-04 0:15 ` maarten
2005-01-04 11:21 ` Michael Tokarev
2005-01-03 9:30 Peter T. Breuer
-- strict thread matches above, loose matches on Subject: below --
2004-12-30 0:31 PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Georg C. F. Greve
2004-12-30 16:23 ` Georg C. F. Greve
2004-12-30 17:39 ` Peter T. Breuer
2004-12-30 19:50 ` Michael Tokarev
2004-12-30 21:39 ` Peter T. Breuer
2005-01-02 19:42 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
2005-01-02 20:18 ` Peter T. Breuer
2005-01-03 0:30 ` Andy Smith
2005-01-03 6:41 ` Neil Brown
2005-01-03 8:37 ` Peter T. Breuer
2005-01-03 8:03 ` Peter T. Breuer
2005-01-03 8:58 ` Guy
2005-01-03 12:11 ` Michael Tokarev
2005-01-03 14:23 ` Peter T. Breuer
2005-01-03 18:30 ` maarten
2005-01-03 21:36 ` Michael Tokarev
2005-01-05 9:56 ` Andy Smith
2005-01-05 10:44 ` Alvin Oga
2005-01-05 10:56 ` Brad Campbell
2005-01-05 11:39 ` Alvin Oga
2005-01-05 12:02 ` Brad Campbell
2005-01-05 13:23 ` Alvin Oga
2005-01-05 13:33 ` Brad Campbell
2005-01-05 14:12 ` Erik Mouw
2005-01-05 14:37 ` Michael Tokarev
2005-01-05 17:11 ` Erik Mouw
2005-01-06 5:41 ` Brad Campbell
2005-01-05 15:17 ` Guy
2005-01-05 15:33 ` Alvin Oga
2005-01-05 16:22 ` Michael Tokarev
2005-01-05 17:23 ` Peter T. Breuer
2005-01-05 16:23 ` Andy Smith
2005-01-05 16:30 ` Andy Smith
2005-01-05 17:07 ` Guy
2005-01-05 17:21 ` Alvin Oga
2005-01-05 17:32 ` Guy
2005-01-05 18:37 ` Alvin Oga
2005-01-05 17:26 ` David Greaves
2005-01-05 18:16 ` Peter T. Breuer
2005-01-05 18:28 ` Guy
2005-01-05 18:26 ` Guy
2005-01-05 15:48 ` Peter T. Breuer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200501042002.32292.maarten@ultratux.net \
--to=maarten@ultratux.net \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).