* Concept problem with RAID1?
@ 2006-03-24 10:16 Jim Klimov
2006-03-24 16:07 ` PFC
0 siblings, 1 reply; 2+ messages in thread
From: Jim Klimov @ 2006-03-24 10:16 UTC (permalink / raw)
To: linux-raid
Hello linux-raid,
In this rather long letter to the community of Linux RAID developers
I would like to express some ideas on how to make the basic RAID1
layer more reliable.
I support a number of cheap-hardware servers for a campus network,
and last year we decided to transfer our servers to mirrored hard
disk partitions to improve reliability. As a matter of fact, seems
it got worse, at least on one server. We do have problems with some
data going corrupt while the submirrors are considered valid and are
not moved out from the RAID metadevice for a rebuild/resync. Matters
get worse when the problems are not detected by any layer below RAID
and don't trigger an error, so the corrupt data seems valid to RAID.
Most of our servers run a system based on linux-2.6.15, some older
ones (and the problematic one in particular) run linux-2.4.31 or
2.4.32 based on VIA Apollo 133 (694/694T) chipset, Dual P-III or
Tualatin CPUs, some with additional Promise PCI IDE controllers.
As far as I can tell, some events can cause corruption of data on
the hard disk. We know, that's why we built RAID1 :)
Events in particular are bad shutdowns/power surges, when the
hardware might feel free to write random signal to the surface,
and kernel panics which occur for uncertain reasons and sometimes
leave the server frozen in the middle of some operation. In fact,
some our disks are old and their data may dissipate over time :)
We have an UPS so the former problem is rare, but the latter
ones happen to occur.
I understand that rolling in newer hardware for servers may help,
but I want Linux to be reliable too :) I feel that the raid driver
trusts too much to the data it has on disks which are not very
trustworthy in the first place. It is also not right to believe
that hardware errors will always be detected by SMART and remapped:
really old disks may run out of spare sectors, for example.
The problem is, that one of the disks in the mirror may have a
newer timestamp, so when the system restarts, the metadevice
rebuilds from it over a disk with an older time stamp. The
source disk might in fact have corrupt data in some location
where the destination disk had valid data, and vice versa,
so the data loss occurs and propagates.
I have even seen a case where the metadevice of the root file
system was force-rebuilt from two inconsistent halves and the
binaries were read in ok at some moment, then not executable
trash just a bit later, then executable again.
At least the reliable way of repairing a system was to leave
it in single-user mode for a several hours to resync all its
mirrored partitions, then fsck all metadevices and reboot to
multiuser. Otherwise random problems can happen. With nearly
a workday of downtime it is not exactly a reliable server...
I hope I have depicted the problem we have as bad as it is.
I also think I have a solution idea which can make mirroring
a reliable option at the cost of space and/or write speed.
In my case server IO is rare (like collecting syslogs from
smart switches) so IO speed doesn't matter much.
Alas, I am more of an administrator than a programmer, so I
hope that the gurus who made the fine Linux OS can conjure
up some solution faster and better than I would :)
1) Idea #1 - make transactions. Write to submirror-1, verify,
update timestamp, write to submirror-2, verify, update, etc.
In this case since the write-verify part takes some long
time, the kernel might want to define the more available
device as a master for the current write (in case of parallel
writes to different mirrors on same hardware).
2) Idea #2 - In case we have more than two halves of a mirror,
perhaps it is correct to check data on all of the halves
(if their timestamps have close values) and rebuild not just
the data from the most recently updated submirror, but the
data which is the same on most submirrors?
3) Idea #3 - follow the Sun. Metadevices in Solaris rely on
"metadevice database replicas" which have some information
about all the metadevices in the system - paths, timestamps,
etc. and can be kept on several devices, not just the ones
we have in the metadevice itself (i.e. in Linux we could
keep a spare replica on a CF disk-on-chip). In case of a
problem with disks, Solaris checks how many up-to-date
replicas it has and if it has a quorum (more than 50% and at
least 3 replicas), it rebuilds the meta devices automatically.
Otherwise it waits for an administrator to decide which half
of a mirror is more correct, because data is considered more
important than the downtime.
4) Here's an idea which stands off from the others a bit: make
mirrors over a meta-layer of blocks with short CRCs (and perhaps
personal timestamps). In case of rebuilding a device from
submirrors which had random corrupted trash in a few blocks
(i.e. noise from a landing HDD head) we can compare whichever
parts of the two (or more) submirrors are valid and keep the
consistent data in a repaired device.
A shortcoming of this approach is that we won't be able to
mount raw data of the partitions as we can now with a mirror.
We'd have to make at least a metadevice of CRC'd blocks and
mount it. In particular, support for such devices should be
built in OS loaders (lilo, grub)...
I believe that several hardware RAID controllers follow some
similar logic. I.e. HP SmartArray mirrors are by some 5% smaller
than the disks they are made of, according to diagnostics.
5) And a little optimisation idea from Sun Solaris as well: could
we define rebuild priorities for our metadevices? For example,
if my system rebuilds the mirrors, I want it to complete small
system partitions first (in case it fails again soon) and then
go on to large partitions and rebuild them for hours. Currently
it seems to pick them somewhat randomly.
Ideas #1 and #2 can be an option for the current raid1 driver,
ideas #3 and #5 are general LVM/RAID concepts, and idea #4
is probably best made as a separate LVM device type (but raid1
and bootloaders should take it into consideration for rebuilds).
I hope that my ideas can help make the good Linux even better :)
--
Best regards,
Jim Klimov mailto:klimov@2ka.mipt.ru
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Concept problem with RAID1?
2006-03-24 10:16 Concept problem with RAID1? Jim Klimov
@ 2006-03-24 16:07 ` PFC
0 siblings, 0 replies; 2+ messages in thread
From: PFC @ 2006-03-24 16:07 UTC (permalink / raw)
To: Jim Klimov, linux-raid
I think you would like something like this :
A LVM (or dm- device mapper) layer which sits between the RAID layer and
the physical disks. This layer computes checksums as data is written to
the physical disks, and checks read data against these checksums.
Problem is, where do you store the checksums ? You need a rather fast,
non-volatile memory to do this. An interesting option would be an USB key.
However, flash memory has a limited number of write cycles. Some scheme
would be needed to not always overwrite the same spot on the flash memory.
Interesting...
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2006-03-24 16:07 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-24 10:16 Concept problem with RAID1? Jim Klimov
2006-03-24 16:07 ` PFC
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).