From: Michael Tokarev <mjt@tls.msk.ru>
To: Moshe Yudkowsky <moshe@pobox.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Date: Sun, 03 Feb 2008 23:28:30 +0300 [thread overview]
Message-ID: <47A623EE.4050305@msgid.tls.msk.ru> (raw)
In-Reply-To: <47A612BE.5050707@pobox.com>
Moshe Yudkowsky wrote:
> I've been reading the draft and checking it against my experience.
> Because of local power fluctuations, I've just accidentally checked my
> system: My system does *not* survive a power hit. This has happened
> twice already today.
>
> I've got /boot and a few other pieces in a 4-disk RAID 1 (three running,
> one spare). This partition is on /dev/sd[abcd]1.
>
> I've used grub to install grub on all three running disks:
>
> grub --no-floppy <<EOF
> root (hd0,1)
> setup (hd0)
> root (hd1,1)
> setup (hd1)
> root (hd2,1)
> setup (hd2)
> EOF
>
> (To those reading this thread to find out how to recover: According to
> grub's "map" option, /dev/sda1 maps to hd0,1.)
I usually install all the drives identically in this regard -
to be treated as first bios disk (disk 0x80). As already
pointed out in this thread - not all BIOSes are able to boot
off a second or third disk, so if your first disk (sda) fail
your only option is to put your sdb into place of sda and boot
from it - this way, grub needs to think it's first boot drive
too.
By the way, lilo works here more easily and more reliable.
You just install a standard mbr (lilo has it too) which just
boots from an active partition, and install lilo onto the
raid array, and tell it to NOT do anything fancy with raid
at all (raid-extra-boot none). But for this to work, you
have to have identical partitions with identical offsets -
at least for the boot partitions.
> After the power hit, I get:
>
>> Error 16
>> Inconsistent filesystem mounted
But did it actually mount it?
> I then tried to boot up on hda1,1, hdd2,1 -- none of them worked.
Which is in fact expected after the above. You have 3 identical
copies (thanks to raid) of your boot filesystem, all 3 equally
broken. When it boots, it assembles your /boot raid array - the
same regardless if you boot off hda, hdb or hdc.
> The culprit, in my opinion, is the reiserfs file system. During the
> power hit, the reiserfs file system of /boot was left in an inconsistent
> state; this meant I had up to three bad copies of /boot.
I've never seen any problem with ext[23] wrt unexpected power loss, so
far. Running several 100s of different systems, some since 1998, some
since 2000. Sure there was several inconsistencies, sometimes (maybe
once or twice) some minor data loss (only few newly created files were
lost), but most serious was to find a few items in lost+found after an
fsck - that's ext2, never seen that with ext3.
More, I tried hard to "force" a power failure at "unexpected" time, by
doing massive write operations and cutting power while at it - I was
never able to trigger any problem this way, at all.
In any case, even if ext[23] is somewhat damaged, it can be mounted
still - access to some files may return I/O errors (in the parts
where it's really damaged), but the rest will work.
On the other hand, I had several immediate issues with reiserfs. It
was long time ago, when the filesystem first has been included into
mainline kernel, so that doesn't reflect current situation. Yet even
at that stage, reiserfs was declared "stable" by the authors. Issues
were trivially triggerable by cutting the power at an "unexpected"
time, and fsck didn't help several times.
So I tend to avoid reiserfs - due to my own experience, and due to
numerous problems elsewhere.
> Recommendations:
>
> 1. I'm going to try adding a data=journal option to the reiserfs file
> systems, including the /boot. If this does not work, then /boot must be
> ext3 in order to survive a power hit.
By the way, if your /boot is separate filesystem (ie, there's nothing
more there), I see absolutely, zero no reason for it to crash. /boot
is modified VERY rarely (only when installing a kernel), and only when
it's modified there's a chance for it to be damaged somehow. During
the rest of the time, it's constant, and any power cut should not hurt
it at all. If even for a non-modified filesystem reiserfs shows such
behavour (
> 2. We discussed what should be on the RAID1 bootable portion of the
> filesystem. True, it's nice to have the ability to boot from just the
> RAID1 portion. But if that RAID1 portion can't survive a power hit,
> there's little sense. It might make a lot more sense to put /boot on its
> own tiny partition.
Hehe.
/boot doesn't matter really. Separate /boot were used for 3 purposes:
1) to work around bios 1024th cylinder issues (long gone with LBA)
2) to be able to put the rest of the system onto an unsupported-by-
bootloader filesystem/raid/lvm/etc. Like, lilo didn't support
reiserfs (and still doesn't with tail packing enabled), so if you
want to use reiserfs for your root fs, put /boot into a separate
ext2fs. The same is true for raid - you can put the rest of the
system into a raid5 array (unsupported by grub/lilo), and in order
to boot, create small raid1 (or any other supported level) /boot.
3) to keep it as less volatile as possible. Like, an area of the
disk which never changes (except of a few very rare cases). For
example, if the first sector of a disk fails, it will be unbootable, --
so the less writes we do to that sector, the better. This was mostly
before sector relocation were standard.
Currently, points 1 and 3 are mostly moot. 2 stands still, but it
does not prevent us from "joining" /boot and / together, for easier
repair if one's needed.
Speaking of repairs. As I already mentioned, I always use small
(256M..1G) raid1 array for my root partition, including /boot,
/bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on
their own filesystems). And I had the following scenarios
happened already:
a) raid does not start (either operator error (most of the
cases) or disk failure (mdadm were unable to read superblocks).
This works by booting off of any component device (passing
root=/dev/hda1 to the bootloader).
Sadly, many initrd/initramfs things in use to day - I'd say all
but mine - don't let to pass additional arguments (or, rather,
don't recognize those arguments properly). For example, early
redhat stuff was using hardcoded root= argument and didn't parse
the corresponding root= kernel parameter - so it was not possible
to change root to mount. No of current initramfs builders as I'm
aware of allows to pass raid options on the kernel command line -
for example, instead of hardcoded md1=$GUUID_OF_THE_ARRAY, I
sometimes pass md1=/dev/sda1,/dev/sdc1 (omitting failed sdb),
and my initrd builds that instead of hardcoded... very handy
(but it's best to not encounter such situation where it might
be handy ;)
b) damaged filesystem. As I mentioned above, it happened once
or twice during all those years. Here, boot off any component
device (don't build raid), readonly. And I've all the tools
to check the root (and other) filesystem here - by examining
and even *modifying* (trying fsck for real) the other component(s)
of the raid1. At this stage I know it's easy to screw things up
because once I modify only a component of raid1, and next assemble
the array, I'll be reading random data - one read from modified
component, one read from original component etc. So this situation
needs extreme care, -- as is dealing with unbootable system where
the root filesystem is seriously damaged.
So basically, if I've 2-component raid1 for root, I can mount a
(damaged) first component and try to repair the second using fsck
and see if something will work from there. And if I were really
able to fix the 2nd component, I assemble the raid again - by
rebooting and specifying md1=/dev/sdB1 (only the 2nd component
which I just fsck'ed and fixed) - and resyncing sda1 later...
And so on... ;)
That's basically 2 cases covering everything.
/mjt
next prev parent reply other threads:[~2008-02-03 20:28 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-03 19:15 RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash) Moshe Yudkowsky
2008-02-03 20:01 ` Robin Hill
2008-02-03 20:46 ` Moshe Yudkowsky
2008-02-03 22:01 ` Robin Hill
2008-02-04 11:06 ` Moshe Yudkowsky
2008-02-04 11:40 ` Robin Hill
2008-02-03 20:28 ` Michael Tokarev [this message]
2008-02-03 20:54 ` Moshe Yudkowsky
2008-02-03 21:04 ` Michael Tokarev
2008-02-04 9:27 ` Michael Tokarev
2008-02-04 10:58 ` Moshe Yudkowsky
2008-02-04 13:52 ` Michael Tokarev
2008-02-04 14:09 ` Justin Piszcz
2008-02-04 14:25 ` Eric Sandeen
2008-02-04 14:42 ` Eric Sandeen
2008-02-04 15:31 ` Moshe Yudkowsky
2008-02-04 16:45 ` Eric Sandeen
2008-02-04 17:22 ` Michael Tokarev
2008-02-05 12:31 ` Linda Walsh
2008-02-04 16:38 ` Michael Tokarev
2008-02-04 19:02 ` Richard Scobie
2008-02-04 22:27 ` Justin Piszcz
2008-02-06 1:12 ` Linda Walsh
2008-02-06 2:12 ` Michael Tokarev
2008-02-06 9:14 ` Luca Berra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47A623EE.4050305@msgid.tls.msk.ru \
--to=mjt@tls.msk.ru \
--cc=linux-raid@vger.kernel.org \
--cc=moshe@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).