* A Word of Warning about Linux Software Raid
@ 2006-08-11 18:09 Craig Shelley
2006-08-11 19:34 ` Adrian Ulrich
0 siblings, 1 reply; 6+ messages in thread
From: Craig Shelley @ 2006-08-11 18:09 UTC (permalink / raw)
To: reiserfs-list
[-- Attachment #1: Type: text/plain, Size: 3017 bytes --]
Hi all,
I have a little story that made me learn some very important lessons
about Linux Software Raid1 (Mirroring).
A local power outage caused my system to turn off in a very rough way.
The power didn't cleanly go off, instead it toggled on and off a few
times quickly before finally staying off.
When the power was restored my reiser4 partitions were a bit poorly, and
required some attention with fsck.reiser4.
Ever since this event, reiser4 warnings have often been displayed on the
console on unmount when shutting down/rebooting. Each time I saw the
messages, I ran fsck.reiser4 which sometimes resulted in errors being
found and fixed. Not knowing what partition was causing the problem was
a bit annoying since I have 4 reiser4 partitions.
Yesterday, running fsck.reiser4 resulted in not being able to boot the
system. Further runs of fsck.reiser4 would sometimes result in further
errors being found, and a few minutes later resulted in no errors being
found. At this point I began to wonder if my SATA controller had gone
faulty since the hardware was appearing to be time-variant.
Eventually the problem was diagnosed to be caused by the data on the two
mirrored disks not being identical. It seems that the kernel does not
check the integrity of the data on mirrored raid, and returns a "mix" of
data from each disk as it is accessed. Over time bad shutdowns/crashes
lead to differences between the data on the two mirrored disks, and this
can eventually have catastrophic consequences.
I re-synced the disks using the following commands: (let me know if
there is a nicer way)
prometheus:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1]
md0 : active raid1 hdc1[1] hda1[0]
4883648 blocks [2/2] [UU]
...
prometheus:~# mdadm --manage --fail /dev/md0 /dev/hdc1
mdadm: set /dev/hdc1 faulty in /dev/md0
prometheus:~# mdadm --manage --remove /dev/md0 /dev/hdc1
mdadm: hot removed /dev/hdc1
prometheus:~# mdadm --manage --add /dev/md0 /dev/hdc1
mdadm: hot added /dev/hdc1
prometheus:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1]
md0 : active raid1 hdc1[2] hda1[0]
4883648 blocks [2/1] [U_]
[====>................] recovery = 22.4% (1098368/4883648)
finish=3.0min speed=20364K/sec
...
fsck.reiser4 could then be run to properly fix the errors.
I checked several other systems that I admin, and after re-syncing the
mirrored partitions on each system, errors were found on their
filesystems.
It would be nice if in a similar way to how the kernel can hot-add disks
to the mirror, copying the data across in the background, that it could
also be told to run a background consistency check on the raid array,
and report/fix errors as it goes.
Are there any tools to do this or similar?
Although this is not a reiser4 issue, I thought it was important that I
make everyone aware of it.
Regards,
--
Craig Shelley
EMail: craig@microtron.org.uk
Jabber: shell@jabber.earth.li
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A Word of Warning about Linux Software Raid
2006-08-11 18:09 A Word of Warning about Linux Software Raid Craig Shelley
@ 2006-08-11 19:34 ` Adrian Ulrich
2006-08-12 12:33 ` Philippe Gramoullé
0 siblings, 1 reply; 6+ messages in thread
From: Adrian Ulrich @ 2006-08-11 19:34 UTC (permalink / raw)
To: reiserfs-list
> Eventually the problem was diagnosed to be caused by the data on the two
> mirrored disks not being identical.
I guess you didn't disable the write-cache of your Harddrives?
With write-cache enabled (/ no USV) it's somewhat unfair to
blaime 'md'..
> It seems that the kernel does not check the integrity of the data on mirrored raid,
It does if you tell the kernel to do so:
# echo check > /sys/block/md0/md/sync_action
Quote from 'Documentation/md.txt'
> check - A full check of redundancy was requested and is
> happening. This reads all block and checks
> them. A repair may also happen for some raid
> levels.
Regards,
Adrian
--
A. Top posters
Q. What's the most annoying thing on Usenet?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A Word of Warning about Linux Software Raid
2006-08-11 19:34 ` Adrian Ulrich
@ 2006-08-12 12:33 ` Philippe Gramoullé
2006-08-13 11:20 ` Justin Piszcz
2006-08-13 21:02 ` Craig Shelley
0 siblings, 2 replies; 6+ messages in thread
From: Philippe Gramoullé @ 2006-08-12 12:33 UTC (permalink / raw)
To: reiserfs-list
Hello,
On Fri, 11 Aug 2006 21:34:50 +0200
Adrian Ulrich <reiser4@blinkenlights.ch> wrote:
| > It seems that the kernel does not check the integrity of the data on mirrored raid,
|
| It does if you tell the kernel to do so:
|
| # echo check > /sys/block/md0/md/sync_action
I have just upgraded from an older kernel to be able to use such a functionality but :
boogie:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid4] [raid6] [multipath]
md0 : active raid0 sde1[0] sdk1[6] sdj1[5] sdi1[4] sdh1[3] sdg1[2] sdf1[1]
501773440 blocks 64k chunks
unused devices: <none>
boogie:~# echo check > /sys/block/md0/md/sync_action
-bash: /sys/block/md0/md/sync_action: Permission denied
boogie:~# mount | grep sys
sysfs on /sys type sysfs (rw)
What am i missing here ?
Thanks,
Philippe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A Word of Warning about Linux Software Raid
2006-08-12 12:33 ` Philippe Gramoullé
@ 2006-08-13 11:20 ` Justin Piszcz
2006-08-13 11:59 ` Philippe Gramoullé
2006-08-13 21:02 ` Craig Shelley
1 sibling, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2006-08-13 11:20 UTC (permalink / raw)
To: Philippe Gramoullé; +Cc: reiserfs-list
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 1033 bytes --]
On Sat, 12 Aug 2006, Philippe Gramoullé wrote:
>
> Hello,
>
> On Fri, 11 Aug 2006 21:34:50 +0200
> Adrian Ulrich <reiser4@blinkenlights.ch> wrote:
>
> | > It seems that the kernel does not check the integrity of the data on mirrored raid,
> |
> | It does if you tell the kernel to do so:
> |
> | # echo check > /sys/block/md0/md/sync_action
>
> I have just upgraded from an older kernel to be able to use such a functionality but :
>
> boogie:~# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid5] [raid4] [raid6] [multipath]
> md0 : active raid0 sde1[0] sdk1[6] sdj1[5] sdi1[4] sdh1[3] sdg1[2] sdf1[1]
> 501773440 blocks 64k chunks
>
> unused devices: <none>
>
> boogie:~# echo check > /sys/block/md0/md/sync_action
> -bash: /sys/block/md0/md/sync_action: Permission denied
>
> boogie:~# mount | grep sys
> sysfs on /sys type sysfs (rw)
>
> What am i missing here ?
>
> Thanks,
>
> Philippe
>
There is no parity with raid0, so there should be nothing to check?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A Word of Warning about Linux Software Raid
2006-08-13 11:20 ` Justin Piszcz
@ 2006-08-13 11:59 ` Philippe Gramoullé
0 siblings, 0 replies; 6+ messages in thread
From: Philippe Gramoullé @ 2006-08-13 11:59 UTC (permalink / raw)
To: Justin Piszcz; +Cc: reiserfs-list
Hello Justin,
On Sun, 13 Aug 2006 07:20:32 -0400 (EDT)
Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
| There is no parity with raid0, so there should be nothing to check?
Yes i forgot that my RAID1 is done in hardware, and that RAID0 is actually done
is software (yes funny setup, i know :)
Thanks,
Philippe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A Word of Warning about Linux Software Raid
2006-08-12 12:33 ` Philippe Gramoullé
2006-08-13 11:20 ` Justin Piszcz
@ 2006-08-13 21:02 ` Craig Shelley
1 sibling, 0 replies; 6+ messages in thread
From: Craig Shelley @ 2006-08-13 21:02 UTC (permalink / raw)
To: reiserfs-list
[-- Attachment #1: Type: text/plain, Size: 887 bytes --]
On Sat, 2006-08-12 at 14:33 +0200, Philippe Gramoullé wrote:
> Hello,
>
> On Fri, 11 Aug 2006 21:34:50 +0200
> Adrian Ulrich <reiser4@blinkenlights.ch> wrote:
>
> | > It seems that the kernel does not check the integrity of the data on mirrored raid,
> |
> | It does if you tell the kernel to do so:
> |
> | # echo check > /sys/block/md0/md/sync_action
>
> I have just upgraded from an older kernel to be able to use such a functionality but :
I too have just upgraded, kernel from 2.6.14.3 to 2.6.17.8, and from
debian sarge to etch, and things are much nicer. Debconf even asked how
often/when the I would prefer the disks to be checked.
I also set up smartmontools since this newer kernel supports the SMART
on SATA.
Thanks for pointing this out.
Regards,
--
Craig Shelley
EMail: craig@microtron.org.uk
Jabber: shell@jabber.earth.li
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-08-13 21:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-11 18:09 A Word of Warning about Linux Software Raid Craig Shelley
2006-08-11 19:34 ` Adrian Ulrich
2006-08-12 12:33 ` Philippe Gramoullé
2006-08-13 11:20 ` Justin Piszcz
2006-08-13 11:59 ` Philippe Gramoullé
2006-08-13 21:02 ` Craig Shelley
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.