From: Michael Tokarev <mjt@tls.msk.ru>
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
Date: Mon, 03 Jan 2005 15:11:03 +0300 [thread overview]
Message-ID: <41D93657.7090908@tls.msk.ru> (raw)
In-Reply-To: <2rana2-0if.ln1@news.it.uc3m.es>
Peter T. Breuer wrote:
[]
>>Let's focus on the personal machine of mine for now since it uses
>>Linux software RAID and therefore on-topic here. It has /boot on a
>>small RAID-1,
>
> This is always a VERY bad idea. /boot and /root want to be on as simple
> and uncomplicated a system as possible. Moreover, they never change, so
> what is the point of having a real time mirror for them? It would be
> sufficient to copy them every day (which is what I do) at file system
> level to another partition, if you want a spare copy for emergencies.
Raid1 (mirror) is the most "trivial" raid level out there, especially
having in mind that the underlying devices -- all of them -- contains
(or should, in theory -- modulo the "50% chance of any difference
being unnoticied" etc) exact copy of the filesystem. Also, root (and
/boot -- i for one have both /boot in root in a single small filesystem)
do change -- not that often but often enouth so that "newaliases problem"
(when you "forgot" to backup it after a change) happens from time to time.
After several years of expirience with alot of systems (and alot of various
disk failure scenarios too: when you have many systems, you have good
chances to see a failure ;), I now use very simple and (so far) reliable
approach, which I explained here on this list before. You have several
(we use 2, 3 or 4) disks which are the same (or almost: eg some 36Gb
disks are really 35Gb or 37Gb; in case they're differ, "extra" space
on large disk isn't used); root and /boot are on small raid1 partition
which is mirrored on *every* disk; swap is on raid1; the rest (/usr,
/home, /var etc) are on raid5 arrays (maybe also raid0 for some "scratch"
space). This way, you have "equal" drives, and *any* drive, including
boot one, may fail at any time and the system will continue working
as if all where working, including reboot (except of a (very rare in
fact) failure scenario when your boot disk has failed MBR or other
sectors required to boot, but "the rest" of that disk is working,
in which case you'll need physical presence to bring the machine up).
All the drives are "symmetrical", usage patterns for all drives are
the same, and due to usage of raid arrays, load is spread among them
quite nicely. You're free to reorder the drives in any way you want,
to replace any of them (maybe rearranging the rest if you're
replacing the boot drive) and so on.
Yes, root fs does not changes often, and yes it is small enouth
(I use 1Gb, or 512Mb, or even 256Mb for root fs - not a big deal
to allocate that space on every of 2 or 3 or 4 or 5 disks). So
it isn't quite relevant how fast the filesystem will be on writes,
and hence it's ok to place it on raid1 composed from 5 components.
The stuff just works, it is very simple to administer/support,
and does all the "backups" automatically. In case of some problem
(yes I dislike any additional layers for critical system components
as any layer may fail to start during boot etc), you can easily
bring the system up by booting off the underlying root-raid partiton
to repair the system -- all the utilities are here. More, you can
boot from one disk (without raid) and try to repair root fs on
another drive (if things are really screwed up), and when you're
done, bring the raid up on that repaired partition and add other
drives to the array.
To summarize: having /boot and root on raid1 is a very *good* idea. ;)
It saved our data alot of times in the past few years already.
If you're worried about "silent data corruption" due to different
data being read from different components of the raid array.. Well,
first of all, we never saw that yet (we have quite good "testcase")
(and no, I'm not saying it's impossible ofcourse). On rarely-changed
filesystem, with real drives which does no silent remapping of an
undeadable blocks to new place with the data on them becoming all-0s,
without drives with uncontrollable write caching (quite common for
IDE drives) and things like that, and with real memory (ECC I mean),
where you *know* what you're writing to each disk (yes, there's also
another possible cause of a problem: software errors aka bugs ;),
that case with different data on different drives becomes quite..
rare. In order to be really sure, one can mount -o remount,ro /
and just compare all components of the root raid, periodically.
When there's more than 2 components on that array, it should be
easy to determine which drive is "lying" in case of any difference.
I do similar procedure on my systems during boot.
>>There is nowhere that is not software RAID to put the journals, so
>
> Well, you can make somewhere. You only require an 8MB (one cylinder)
> partition.
Note scsi disks in linux only supports up to 14 partitions, which
isn't sometimes sufficient even without additional partitions for
journal. When you have large amount of disks (so having that
"fully-symmetrical" layout as I described above becomes impractical),
you can use one set of drives for data and another set of drives
for journal for that data. When you only have 4 (or less) drives...
And yes I'm aware of mdp devices (partitions inside the raid
arrays).. but that's just another layer "which may fail": if
raid5 array won't start, I at least can reconstruct filesystem
image by reading chunks of data from appropriate places from
all drives and try to recover that image; with any additional
structure inside the array (and the lack of "loopP" aka partitioned
loop devices) it becomes more and more tricky to recover any
data (from this point of view, raid1 is the niciest raid level ;)
Again: instead of using a partition for the journal, use (another?)
raid array. This way, the system will work if the drive wich
contains the journal fails. Note above about swap: in all my
systems, swap is also on raid (raid1 in this case). At the first
look, that can be a nonsense: having swap on raid. But we had
enouth cases when due to a failed drive swap becomes corrupt
(unreadable really), and the system goes havoc, *damaging*
other data which was unaffected by the disk failure! With
swap on raid1, the system continues working if any drive
fails, which is good. (Older kernels, esp. 2.2.* series,
had several probs with swap on raid, but that has been fixed
now; there where other bugs fixed too (incl. bugs in ext3fs)
so there should be no such damage to other data due to
unreadable swap.. hopefully. But I can't trust my systems
anymore after seeing (2 times in 4 years) what can happen with
the data...)
[]
And I also want to "re-reply" to the first your message in this
thread, where I was saying that "it's a nonsense that raid does
not preserve write ordering". Ofcourse I mean not write ordering
but working write barriers (as Neil pointed out, md subsystem does
not implement write barriers directly but the concept is "emulated"
by linux block subsystem). Write barriers should be sufficient to
implement journalling safely.
/mjt
next prev parent reply other threads:[~2005-01-03 12:11 UTC|newest]
Thread overview: 172+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-12-30 0:31 PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Georg C. F. Greve
2004-12-30 16:23 ` Georg C. F. Greve
2004-12-30 17:39 ` Peter T. Breuer
2004-12-30 17:53 ` Sandro Dentella
2004-12-30 18:31 ` Peter T. Breuer
2004-12-30 19:50 ` Michael Tokarev
[not found] ` <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>
2004-12-30 20:54 ` berk walker
2005-01-01 13:39 ` Helge Hafting
2004-12-30 21:39 ` Peter T. Breuer
2005-01-02 19:42 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
2005-01-02 20:18 ` Peter T. Breuer
2005-01-03 0:30 ` Andy Smith
2005-01-03 6:41 ` Neil Brown
2005-01-03 8:37 ` Peter T. Breuer
2005-01-03 8:03 ` Peter T. Breuer
2005-01-03 8:58 ` Guy
2005-01-03 10:18 ` Partiy error detection - was " Brad Campbell
2005-01-03 12:11 ` Michael Tokarev [this message]
2005-01-03 14:23 ` Peter T. Breuer
2005-01-03 18:30 ` maarten
2005-01-03 21:36 ` Michael Tokarev
2005-01-05 5:50 ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
2005-01-05 13:41 ` Michael Tokarev
2005-01-05 13:57 ` [help] [I2O] Adaptec 2400A on FC3 Angelo Piraino
2005-01-05 19:15 ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
2005-01-05 9:56 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
2005-01-05 10:44 ` Alvin Oga
2005-01-05 10:56 ` Brad Campbell
2005-01-05 11:39 ` Alvin Oga
2005-01-05 12:02 ` Brad Campbell
2005-01-05 13:23 ` Alvin Oga
2005-01-05 13:33 ` Brad Campbell
2005-01-05 14:44 ` parts -- " Alvin Oga
2005-01-19 4:46 ` Clemens Schwaighofer
2005-01-19 5:05 ` Alvin Oga
2005-01-19 5:49 ` Clemens Schwaighofer
2005-01-19 7:08 ` Alvin Oga
2005-01-05 13:36 ` Swap should be mirrored or not? (was Re: ext3 journal on software raid) Andy Smith
2005-01-05 14:12 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Erik Mouw
2005-01-05 14:37 ` Michael Tokarev
2005-01-05 14:55 ` errors " Alvin Oga
2005-01-05 17:11 ` Erik Mouw
2005-01-06 5:41 ` Brad Campbell
2005-01-05 15:17 ` Guy
2005-01-05 15:33 ` Alvin Oga
2005-01-05 16:22 ` Michael Tokarev
2005-01-05 17:23 ` Peter T. Breuer
2005-01-05 16:23 ` Andy Smith
2005-01-05 16:30 ` Andy Smith
2005-01-05 17:04 ` swp - " Alvin Oga
2005-01-05 17:26 ` Andy Smith
2005-01-05 18:32 ` Alvin Oga
2005-01-05 22:35 ` Andy Smith
2005-01-06 0:57 ` Guy
2005-01-06 1:28 ` Mike Hardy
2005-01-06 3:32 ` Guy
2005-01-06 4:49 ` Mike Hardy
2005-01-09 21:07 ` Mark Hahn
2005-01-06 5:04 ` Alvin Oga
2005-01-06 6:18 ` Guy
2005-01-06 6:31 ` Alvin Oga
2005-01-06 9:38 ` swap on RAID (was Re: swp - Re: ext3 journal on software raid) Andy Smith
2005-01-06 17:46 ` Mike Hardy
2005-01-06 22:08 ` No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)) Andrew Walrond
2005-01-06 22:34 ` Jesper Juhl
2005-01-06 22:57 ` Mike Hardy
2005-01-06 23:15 ` Guy
2005-01-07 9:28 ` Andrew Walrond
2005-02-28 20:07 ` Guy
2005-01-07 1:31 ` confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid) Alvin Oga
2005-01-07 2:28 ` Andy Smith
2005-01-07 13:04 ` Alvin Oga
2005-01-09 21:21 ` swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Mark Hahn
2005-01-09 22:20 ` Alvin Oga
2005-01-06 5:01 ` Alvin Oga
2005-01-05 17:07 ` Guy
2005-01-05 17:21 ` Alvin Oga
2005-01-05 17:32 ` Guy
2005-01-05 18:37 ` Alvin Oga
2005-01-05 17:34 ` ECC: RE: ext3 blah blah blah Gordon Henderson
2005-01-05 18:33 ` Alvin Oga
2005-01-05 17:26 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
2005-01-05 18:16 ` Peter T. Breuer
2005-01-05 18:28 ` Guy
2005-01-05 18:26 ` Guy
2005-01-05 15:48 ` Peter T. Breuer
2005-01-07 6:21 ` PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Clemens Schwaighofer
2005-01-07 9:39 ` Andy Smith
-- strict thread matches above, loose matches on Subject: below --
2005-01-03 9:30 ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Peter T. Breuer
[not found] <200501030916.j039Gqe23568@inv.it.uc3m.es>
2005-01-03 10:17 ` Guy
2005-01-03 11:31 ` Peter T. Breuer
2005-01-03 17:34 ` Guy
2005-01-03 17:46 ` maarten
2005-01-03 19:52 ` maarten
2005-01-03 20:41 ` Peter T. Breuer
2005-01-03 23:19 ` Peter T. Breuer
2005-01-03 23:46 ` Neil Brown
2005-01-04 0:28 ` Peter T. Breuer
2005-01-04 1:18 ` Alvin Oga
2005-01-04 4:29 ` Neil Brown
2005-01-04 8:43 ` Peter T. Breuer
2005-01-04 2:07 ` Neil Brown
2005-01-04 2:16 ` Ewan Grantham
2005-01-04 2:22 ` Neil Brown
2005-01-04 2:41 ` Andy Smith
2005-01-04 3:42 ` Neil Brown
2005-01-04 9:50 ` Peter T. Breuer
2005-01-04 14:15 ` David Greaves
2005-01-04 15:20 ` Peter T. Breuer
2005-01-04 16:42 ` Guy
2005-01-04 17:46 ` Peter T. Breuer
2005-01-04 9:30 ` Maarten
2005-01-04 10:18 ` Peter T. Breuer
2005-01-04 13:36 ` Maarten
2005-01-04 14:13 ` Peter T. Breuer
2005-01-04 19:22 ` maarten
2005-01-04 20:05 ` Peter T. Breuer
2005-01-04 21:38 ` Guy
2005-01-04 23:53 ` Peter T. Breuer
2005-01-05 0:58 ` Mikael Abrahamsson
2005-01-04 21:48 ` maarten
2005-01-04 23:14 ` Peter T. Breuer
2005-01-05 1:53 ` maarten
2005-01-04 9:46 ` Peter T. Breuer
2005-01-04 19:02 ` maarten
2005-01-04 19:12 ` David Greaves
2005-01-04 21:08 ` Peter T. Breuer
2005-01-04 22:02 ` Brad Campbell
2005-01-04 23:20 ` Peter T. Breuer
2005-01-05 5:44 ` Brad Campbell
2005-01-05 9:00 ` Peter T. Breuer
2005-01-05 9:14 ` Brad Campbell
2005-01-05 9:28 ` Peter T. Breuer
2005-01-05 9:43 ` Brad Campbell
2005-01-05 15:09 ` Guy
2005-01-05 15:52 ` maarten
2005-01-05 10:04 ` Andy Smith
2005-01-04 22:21 ` Neil Brown
2005-01-05 0:08 ` Peter T. Breuer
2005-01-04 22:29 ` Neil Brown
2005-01-05 0:19 ` Peter T. Breuer
2005-01-05 1:19 ` Jure Pe_ar
2005-01-05 2:29 ` Peter T. Breuer
2005-01-05 0:38 ` maarten
2005-01-04 9:40 ` Peter T. Breuer
2005-01-04 14:03 ` David Greaves
2005-01-04 14:07 ` Peter T. Breuer
2005-01-04 14:43 ` David Greaves
2005-01-04 15:12 ` Peter T. Breuer
2005-01-04 16:54 ` David Greaves
2005-01-04 17:42 ` Peter T. Breuer
2005-01-04 19:12 ` David Greaves
2005-01-04 0:45 ` maarten
2005-01-04 10:14 ` Peter T. Breuer
2005-01-04 13:24 ` Maarten
2005-01-04 14:05 ` Peter T. Breuer
2005-01-04 15:31 ` Maarten
2005-01-04 16:21 ` Peter T. Breuer
2005-01-04 20:55 ` maarten
2005-01-04 21:11 ` Peter T. Breuer
2005-01-04 21:38 ` Peter T. Breuer
2005-01-04 23:29 ` Guy
2005-01-04 19:57 ` Mikael Abrahamsson
2005-01-04 21:05 ` maarten
2005-01-04 21:26 ` Alvin Oga
2005-01-04 21:46 ` Guy
2005-01-03 20:22 ` Peter T. Breuer
2005-01-03 23:05 ` Guy
2005-01-04 0:08 ` maarten
2005-01-03 21:36 ` Guy
2005-01-04 0:15 ` maarten
2005-01-04 11:21 ` Michael Tokarev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41D93657.7090908@tls.msk.ru \
--to=mjt@tls.msk.ru \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).