linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael Tokarev <mjt@tls.msk.ru>
To: linux-raid@vger.kernel.org
Subject: Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc]
Date: Tue, 04 Jan 2005 14:57:57 +0300	[thread overview]
Message-ID: <41DA84C5.3070403@tls.msk.ru> (raw)
In-Reply-To: <ms4qa2-0qa.ln1@news.it.uc3m.es>

Peter T. Breuer wrote:
> Neil Brown <neilb@cse.unsw.edu.au> wrote:
[]
>>If there is a system crash before correct, consistent data is written,
>>then on restart, disk B will not be read at all until disk A as been
> 
> Why do you think so? I know of no mechanism in RAID that records to
> which of the two disks paired data has been written and to which it has
> not!
> 
> Please clarify - this is important. If you are thinking of the "event
> count" that is stamped on the superblocks, that is only updated from
> time to time as far as I know! Can you please specify (for my
> curiousity) exactly when it is updated? That would be useful to know.

Yes, this is the most dark corner in whole raid stuff for me still.
I just looked at the code again, re-read it several times, but the
code is a bit.. large to understand in a relatively short time.  This
very question bothered me for quite some time now.  How md code "knows"
which drive has "more recent" data on it in case of system crash (power
loss, whatever) after one drive has completed the write but before
another hasn't?  The "event counter" isn't updated on every write
(it'd be very expensive in both time and disk health -- too much
seeking and too much writes to a single block where the superblock
is located).

For me, and I'm just thinking how it can be done, the only possible
solution in this case is to choose "random" drive and declare it as
"up-to-date" -- it will not necessary be really up-to-date.  Or,
maybe, write to "first" drive first and to "second" next, and assume
first drive have the data written before second (no guarantee here
because of reordering, differences in drive speed etc, but it is --
sort of -- valid assumption).

Speaking of a reasonable filesystem (journalling isn't relevant here,
the key word is "reasonable", that it, the system that makes comples
operations to be atomic) and filesystem metadata, choosing "random"
drive as up-to-date makes some sense, at least the metadata will
be consistent (not necessary up to date, ie, for example, it is
still possible to lose some mail file which has been acknowleged
by filesystem AND by the smtp server, but due to choosing the
"wrong" (not recent) drive, that file operation has been "rolled
back"), but still consistent (I'm not talking about data consistency
and integrity, that's another long story).

Or, maybe it's better to ask the question slightly (?) differently:
recalling "write barriers" etc and raid1 (for simplicity), will raid
code acknowlege a write only after ALL drives has been written to?
And thus, having reasonable filesystem (again), will the filesystem
operation (at least metadata) succeed ONLY after the md layer will
report ALL disks has the data written?  (This way, it really makes
no difference which - fresh or not - drive will be considered up to
date after the poweroff in the middle of some write, *at least* for
filesystem metadata, and for applications that implements "commit"
concept as needed to correctly implement "reasonable" metadata
operations).

How it all fits together?
Which drive will be declared "fresh"?
How about several (>2) drives in raid1 array?
How about data written without a concept of "commits", if "wrong"
drive will be choosen -- will it contain some old data in it, while
another drive contained new data but was declared "non fresh" at
reconstruction?
And speaking of the previous question, is there any difference here
between md device and single disk, which also does various write
reordering and stuff like that? -- I mean, does md layer increase
probability to see old data after reboot caused by a power loss
(for example) if an app (or whatever) was writing (or even when
the filesystem reported the write is complete) some new data during
the power loss?

Alot of questions.. but I think it's really worth to understand
how it all works.

Thanks.

/mjt

  reply	other threads:[~2005-01-04 11:57 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200501030916.j039Gqe23568@inv.it.uc3m.es>
2005-01-03 10:17 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy
2005-01-03 11:31   ` Peter T. Breuer
2005-01-03 17:34     ` Guy
2005-01-03 19:20       ` ext3 Gordon Henderson
2005-01-03 19:47         ` ext3 Morten Sylvest Olsen
2005-01-03 20:05           ` ext3 Gordon Henderson
2005-01-03 17:46     ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) maarten
2005-01-03 19:52       ` maarten
2005-01-03 20:41         ` Peter T. Breuer
2005-01-03 23:19           ` Peter T. Breuer
2005-01-03 23:46             ` Neil Brown
2005-01-04  0:28               ` Peter T. Breuer
2005-01-04  1:18                 ` Alvin Oga
2005-01-04  4:29                   ` Neil Brown
2005-01-04  8:43                     ` Peter T. Breuer
2005-01-04  2:07                 ` Neil Brown
2005-01-04  2:16                   ` Ewan Grantham
2005-01-04  2:22                     ` Neil Brown
2005-01-04  2:41                       ` Andy Smith
2005-01-04  3:42                         ` Neil Brown
2005-01-04  9:50                           ` Peter T. Breuer
2005-01-04 14:15                             ` David Greaves
2005-01-04 15:20                               ` Peter T. Breuer
2005-01-04 16:42                             ` Guy
2005-01-04 17:46                               ` Peter T. Breuer
2005-01-04  9:30                         ` Maarten
2005-01-04 10:18                           ` Peter T. Breuer
2005-01-04 13:36                             ` Maarten
2005-01-04 14:13                               ` Peter T. Breuer
2005-01-04 19:22                                 ` maarten
2005-01-04 20:05                                   ` Peter T. Breuer
2005-01-04 21:38                                     ` Guy
2005-01-04 23:53                                       ` Peter T. Breuer
2005-01-05  0:58                                       ` Mikael Abrahamsson
2005-01-04 21:48                                     ` maarten
2005-01-04 23:14                                       ` Peter T. Breuer
2005-01-05  1:53                                         ` maarten
2005-01-04  9:46                         ` Peter T. Breuer
2005-01-04 19:02                           ` maarten
2005-01-04 19:12                             ` David Greaves
2005-01-04 21:08                             ` Peter T. Breuer
2005-01-04 22:02                               ` Brad Campbell
2005-01-04 23:20                                 ` Peter T. Breuer
2005-01-05  5:44                                   ` Brad Campbell
2005-01-05  9:00                                     ` Peter T. Breuer
2005-01-05  9:14                                       ` Brad Campbell
2005-01-05  9:28                                         ` Peter T. Breuer
2005-01-05  9:43                                           ` Brad Campbell
2005-01-05 15:09                                             ` Guy
2005-01-05 15:52                                               ` maarten
2005-01-05 10:04                                           ` Andy Smith
2005-01-04 22:21                               ` Neil Brown
2005-01-05  0:08                                 ` Peter T. Breuer
2005-01-04 22:29                               ` Neil Brown
2005-01-05  0:19                                 ` Peter T. Breuer
2005-01-05  1:19                                   ` Jure Pe_ar
2005-01-05  2:29                                     ` Peter T. Breuer
2005-01-05  0:38                               ` maarten
2005-01-04  9:40                   ` Peter T. Breuer
2005-01-04 11:57                     ` Michael Tokarev [this message]
2005-01-04 12:40                       ` Which drive gets read in case of inconsistency? [was: ext3 journal on software raid etc] Morten Sylvest Olsen
2005-01-04 12:44                       ` Peter T. Breuer
2005-01-04 14:22                         ` Maarten
2005-01-04 14:56                           ` Peter T. Breuer
2005-01-04 14:03                     ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
2005-01-04 14:07                       ` Peter T. Breuer
2005-01-04 14:43                         ` David Greaves
2005-01-04 15:12                           ` Peter T. Breuer
2005-01-04 16:54                             ` David Greaves
2005-01-04 17:42                               ` Peter T. Breuer
2005-01-04 19:12                                 ` David Greaves
2005-01-04  0:45           ` maarten
2005-01-04 10:14             ` Peter T. Breuer
2005-01-04 13:24               ` Maarten
2005-01-04 14:05                 ` Peter T. Breuer
2005-01-04 15:31                   ` Maarten
2005-01-04 16:21                     ` Peter T. Breuer
2005-01-04 20:55                       ` maarten
2005-01-04 21:11                         ` Peter T. Breuer
2005-01-04 21:38                         ` Peter T. Breuer
2005-01-04 23:29                           ` Guy
2005-01-04 19:57                     ` Mikael Abrahamsson
2005-01-04 21:05                       ` maarten
2005-01-04 21:26                         ` Alvin Oga
2005-01-04 21:46                         ` Guy
2005-01-03 20:22       ` Peter T. Breuer
2005-01-03 23:05         ` Guy
2005-01-04  0:08         ` maarten
2005-01-04  8:57         ` I'm glad I don't live in Spain (was Re: ext3 journal on software raid) David L. Smith-Uchida
2005-01-03 21:36       ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Guy
2005-01-04  0:15         ` maarten
2005-01-04 11:21           ` Michael Tokarev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41DA84C5.3070403@tls.msk.ru \
    --to=mjt@tls.msk.ru \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).