All of lore.kernel.org
 help / color / mirror / Atom feed
* A Word of Warning about Linux Software Raid
@ 2006-08-11 18:09 Craig Shelley
  2006-08-11 19:34 ` Adrian Ulrich
  0 siblings, 1 reply; 6+ messages in thread
From: Craig Shelley @ 2006-08-11 18:09 UTC (permalink / raw)
  To: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 3017 bytes --]

Hi all,

I have a little story that made me learn some very important lessons
about Linux Software Raid1 (Mirroring).

A local power outage caused my system to turn off in a very rough way.
The power didn't cleanly go off, instead it toggled on and off a few
times quickly before finally staying off.

When the power was restored my reiser4 partitions were a bit poorly, and
required some attention with fsck.reiser4.

Ever since this event, reiser4 warnings have often been displayed on the
console on unmount when shutting down/rebooting. Each time I saw the
messages, I ran fsck.reiser4 which sometimes resulted in errors being
found and fixed. Not knowing what partition was causing the problem was
a bit annoying since I have 4 reiser4 partitions.

Yesterday, running fsck.reiser4 resulted in not being able to boot the
system. Further runs of fsck.reiser4 would sometimes result in further
errors being found, and a few minutes later resulted in no errors being
found. At this point I began to wonder if my SATA controller had gone
faulty since the hardware was appearing to be time-variant.

Eventually the problem was diagnosed to be caused by the data on the two
mirrored disks not being identical. It seems that the kernel does not
check the integrity of the data on mirrored raid, and returns a "mix" of
data from each disk as it is accessed. Over time bad shutdowns/crashes
lead to differences between the data on the two mirrored disks, and this
can eventually have catastrophic consequences.


I re-synced the disks using the following commands:  (let me know if
there is a nicer way)

prometheus:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1]
md0 : active raid1 hdc1[1] hda1[0]
      4883648 blocks [2/2] [UU]
...

prometheus:~# mdadm --manage --fail /dev/md0 /dev/hdc1
mdadm: set /dev/hdc1 faulty in /dev/md0
prometheus:~# mdadm --manage --remove /dev/md0 /dev/hdc1
mdadm: hot removed /dev/hdc1
prometheus:~# mdadm --manage --add /dev/md0 /dev/hdc1
mdadm: hot added /dev/hdc1

prometheus:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1]
md0 : active raid1 hdc1[2] hda1[0]
      4883648 blocks [2/1] [U_]
      [====>................]  recovery = 22.4% (1098368/4883648)
finish=3.0min speed=20364K/sec
...


fsck.reiser4 could then be run to properly fix the errors.

I checked several other systems that I admin, and after re-syncing the
mirrored partitions on each system, errors were found on their
filesystems. 

It would be nice if in a similar way to how the kernel can hot-add disks
to the mirror, copying the data across in the background, that it could
also be told to run a background consistency check on the raid array,
and report/fix errors as it goes.
Are there any tools to do this or similar?

Although this is not a reiser4 issue, I thought it was important that I
make everyone aware of it.

Regards,

-- 
Craig Shelley
EMail: craig@microtron.org.uk
Jabber: shell@jabber.earth.li

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-08-13 21:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-11 18:09 A Word of Warning about Linux Software Raid Craig Shelley
2006-08-11 19:34 ` Adrian Ulrich
2006-08-12 12:33   ` Philippe Gramoullé
2006-08-13 11:20     ` Justin Piszcz
2006-08-13 11:59       ` Philippe Gramoullé
2006-08-13 21:02     ` Craig Shelley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.