* raid1-diseaster on reboot: old version overwrites new version
@ 2005-04-02 15:43 peter pilsl
2005-04-02 17:27 ` Gordon Henderson
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: peter pilsl @ 2005-04-02 15:43 UTC (permalink / raw)
To: linux-raid
Two days ago I had a severe servercrash due to raid-problems. The whole
thing started with a (homemade) DOS-attack on the server. The server
went to its knees and needed to be resetted. After the reboot the server
was working fine and background-reconstruction of the mirrors started.
About 30 minutes later the first anomalies occured. Applications
reported missing libraries, fs-errors (reiserfs) and so on.
It took a while until I reckognized what was going on:
the /-partition was on a raid1 - /dev/md2 - based on two disks : hda6+hdc6.
For some reason the raid seemed to be out of sync for over a year and
hdc6 holded a old copy that was now successively overwriting hda6 and
changing the content of / while the raid was running.
I booted with a live-cd to discover the hdc6 was the exact copy of
spring 2004 (easily found out by content and timestamps of various files
over the system) and hda6 was not mountable. I ran reiserfsck and had
the tree rebuild on hda6, but it was too late. All current data was gone.
I had a backup and server is up again and my head is on my shoulders,
but it leaves a lot of questions to me:
* how can the raid be out of sync. I monitor /proc/mdstat on a
5-minute-interval and log the content to files. The output was
definitely like:
md2 : active raid1 hdc6[0] hda6[1]
5120000 blocks [2/2] [UU]
over the last year without a single exception. I just tested the entries
in my watchdog and checked functionality of the watchdog by removing one
disk. It definitely barks.
* how can in case of a unsynced raid the old version overwrite the new
version. This is like a nightmare (and I remember having such thing before)
* What did I do wrong?
The only explantion to me is, that I had the wrong entry in my
lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
which was started, but never had any data written to it. Is this a
possible explanation?
kernel 2.4.24
raidtools-0.90
thnx for any advice,
peter
--
mag. peter pilsl
goldfisch.at
IT-management
tel +43 699 1 3574035
fax +43 699 4 3574035
pilsl@goldfisch.at
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid1-diseaster on reboot: old version overwrites new version
2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
@ 2005-04-02 17:27 ` Gordon Henderson
2005-04-02 17:35 ` Tim Moore
2005-04-02 22:31 ` Neil Brown
2 siblings, 0 replies; 6+ messages in thread
From: Gordon Henderson @ 2005-04-02 17:27 UTC (permalink / raw)
To: linux-raid
On Sat, 2 Apr 2005, peter pilsl wrote:
> The only explantion to me is, that I had the wrong entry in my
> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
> which was started, but never had any data written to it. Is this a
> possible explanation?
It's possible - but I think the root= parameter needs to correspond to
whats in the /etc/fstab file - I'd check that too if it's still possible.
I've no experience with reiser though.
Gordon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid1-diseaster on reboot: old version overwrites new version
2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
2005-04-02 17:27 ` Gordon Henderson
@ 2005-04-02 17:35 ` Tim Moore
2005-04-02 18:10 ` peter pilsl
2005-04-04 19:39 ` Doug Ledford
2005-04-02 22:31 ` Neil Brown
2 siblings, 2 replies; 6+ messages in thread
From: Tim Moore @ 2005-04-02 17:35 UTC (permalink / raw)
To: linux-raid
peter pilsl wrote:
> The only explantion to me is, that I had the wrong entry in my
> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
> which was started, but never had any data written to it. Is this a
> possible explanation?
No. The lilo.conf entry just tells the kernel where root is located.
Can you publish your /etc/fstab and fdisk -l output?
--
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid1-diseaster on reboot: old version overwrites new version
2005-04-02 17:35 ` Tim Moore
@ 2005-04-02 18:10 ` peter pilsl
2005-04-04 19:39 ` Doug Ledford
1 sibling, 0 replies; 6+ messages in thread
From: peter pilsl @ 2005-04-02 18:10 UTC (permalink / raw)
To: Tim Moore; +Cc: linux-raid
Tim Moore wrote:
>
>
> peter pilsl wrote:
>
>> The only explantion to me is, that I had the wrong entry in my
>> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
>> So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
>> which was started, but never had any data written to it. Is this a
>> possible explanation?
>
>
> No. The lilo.conf entry just tells the kernel where root is located.
>
> Can you publish your /etc/fstab and fdisk -l output?
>
thnx. following is the output of fstab, fdisk of both involved drives
and my raidtab. (which reminds me to change the swap from raid to more
single-partions)
---------------fstab-------------------
# cat /etc/fstab
/dev/md2 / reiserfs noatime,notail 1 1
/dev/md0 /boot ext2 noatime 1 2
/dev/md3 /var reiserfs noatime,notail 1 2
/dev/md1 swap swap defaults 0 0
/dev/md4 /data reiserfs noatime,notail 1 2
/dev/md5 /backup_cust reiserfs noatime,notail 1 2
/dev/md6 /data2 reiserfs noatime,notail 1 2
/dev/hdc8 /opt_noraid reiserfs noatime,notail 1 2
/dev/hdd7 /opt reiserfs noatime,notail 1 2
none /dev/pts devpts mode=0620 0 0
none /dev/shm tmpfs defaults 0 0
none /proc proc defaults 0 0
#/dev/hdb /mnt/cdrom auto
user,iocharset=iso8859-1,exec,codepage=850,ro,noauto 0 0
#/dev/fd0 /mnt/floppy auto
user,iocharset=iso8859-1,sync,exec,codepage=850,noauto 0 0
---------------fdisk-------------------
# fdisk -l /dev/hda
Disk /dev/hda: 255 heads, 63 sectors, 7297 cylinders
Units = cylinders of 16065 * 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 1 3 24066 fd Linux raid autodetect
/dev/hda2 4 67 514080 fd Linux raid autodetect
/dev/hda3 68 7297 58074975 5 Extended
/dev/hda5 68 705 5124703+ fd Linux raid autodetect
/dev/hda6 706 1343 5124703+ fd Linux raid autodetect
/dev/hda7 1344 5168 30724281 fd Linux raid autodetect
/dev/hda8 5169 6443 10241406 fd Linux raid autodetect
/dev/hda9 6444 7297 6859723+ fd Linux raid autodetect
# fdisk -l /dev/hdc
Disk /dev/hdc: 16 heads, 63 sectors, 232581 cylinders
Units = cylinders of 1008 * 512 bytes
Device Boot Start End Blocks Id System
/dev/hdc1 1 48 24160+ fd Linux raid autodetect
/dev/hdc2 49 1069 514584 fd Linux raid autodetect
/dev/hdc3 1070 232581 116682048 5 Extended
/dev/hdc5 1070 11238 5125144+ fd Linux raid autodetect
/dev/hdc6 11239 21407 5125144+ fd Linux raid autodetect
/dev/hdc7 21408 82368 30724312+ fd Linux raid autodetect
/dev/hdc8 82369 232581 75707320+ 83 Linux
---------------raidtab-------------------
# cat /etc/raidtab
# /boot
raiddev /dev/md0
raid-level 1
chunk-size 64k
persistent-superblock 1
nr-raid-disks 3
device /dev/hdc1
raid-disk 0
device /dev/hda1
raid-disk 1
device /dev/hdd1
raid-disk 2
# swap
raiddev /dev/md1
raid-level 0
chunk-size 64k
persistent-superblock 1
nr-raid-disks 2
device /dev/hdc2
raid-disk 0
device /dev/hda2
raid-disk 1
# /
raiddev /dev/md2
raid-level 1
chunk-size 64k
persistent-superblock 1
nr-raid-disks 2
device /dev/hdc6
raid-disk 0
device /dev/hda6
raid-disk 1
# /var
raiddev /dev/md3
raid-level 1
chunk-size 64k
persistent-superblock 1
nr-raid-disks 2
device /dev/hda5
raid-disk 0
device /dev/hdc5
raid-disk 1
# /data
raiddev /dev/md4
raid-level 1
chunk-size 64k
persistent-superblock 1
nr-raid-disks 2
device /dev/hda7
raid-disk 0
device /dev/hdc7
raid-disk 1
# /back_customer
raiddev /dev/md5
raid-level 1
chunk-size 64k
persistent-superblock 1
nr-raid-disks 2
device /dev/hdd5
raid-disk 0
device /dev/hda8
raid-disk 1
# /data2
raiddev /dev/md6
raid-level 1
chunk-size 64k
persistent-superblock 1
nr-raid-disks 2
device /dev/hdd6
raid-disk 0
device /dev/hda9
raid-disk 1
thnx for your help,
peter
--
mag. peter pilsl
goldfisch.at
IT-management
tel +43 699 1 3574035
fax +43 699 4 3574035
pilsl@goldfisch.at
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid1-diseaster on reboot: old version overwrites new version
2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
2005-04-02 17:27 ` Gordon Henderson
2005-04-02 17:35 ` Tim Moore
@ 2005-04-02 22:31 ` Neil Brown
2 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2005-04-02 22:31 UTC (permalink / raw)
To: peter pilsl; +Cc: linux-raid
On Saturday April 2, pilsl@goldfisch.at wrote:
>
> * What did I do wrong?
>
> The only explantion to me is, that I had the wrong entry in my
> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
> which was started, but never had any data written to it. Is this a
> possible explanation?
Yep, this completely explains everything.
/ was *not* on /dev/md2, it was on /dev/hda6 which also happened to be
a part of an unused raid1 array.
After a crash, the raid1 array did a resync copying from hdc6 to
hda6. Very sad. Very good that you had backups.
2.6 won't let you do this: you cannot have a partition in a raid array
and mounted as a filesystem at the same time.
NeilBrown
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: raid1-diseaster on reboot: old version overwrites new version
2005-04-02 17:35 ` Tim Moore
2005-04-02 18:10 ` peter pilsl
@ 2005-04-04 19:39 ` Doug Ledford
1 sibling, 0 replies; 6+ messages in thread
From: Doug Ledford @ 2005-04-04 19:39 UTC (permalink / raw)
To: Tim Moore; +Cc: linux-raid
On Sat, 2005-04-02 at 09:35 -0800, Tim Moore wrote:
>
> peter pilsl wrote:
> > The only explantion to me is, that I had the wrong entry in my
> > lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> > So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
> > which was started, but never had any data written to it. Is this a
> > possible explanation?
>
> No. The lilo.conf entry just tells the kernel where root is located.
Yes, as Neil posted, this exactly explains the issue. If /dev/hda6 is
part of a raid1 array, and you write to it instead of /dev/md2, then
those writes are never sent to /dev/hdc6 and the two devices get out of
sync. Plus, standard initrd setups and the like are written to
accommodate users passing in arbitrary root= options on the kernel
command line to over ride the default root partition, and in those
situations the root partition must be taken from the command line and
not from fstab in order for this to work. So, whether it's lilo or grub
or whatever, the root= line on your kernel command line is *the*
authority when it comes to what will be mounted as the root partition
you actually use.
> Can you publish your /etc/fstab and fdisk -l output?
Keep in mind the root partitions is already mounted in ro mode by the
time fstab is available and the rc.sysinit script merely remounts it rw.
Again, the command line is the authority.
--
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2005-04-04 19:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
2005-04-02 17:27 ` Gordon Henderson
2005-04-02 17:35 ` Tim Moore
2005-04-02 18:10 ` peter pilsl
2005-04-04 19:39 ` Doug Ledford
2005-04-02 22:31 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).