raid1-diseaster on reboot: old version overwrites new version

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid1-diseaster on reboot: old version overwrites new version
@ 2005-04-02 15:43 peter pilsl
  2005-04-02 17:27 ` Gordon Henderson
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: peter pilsl @ 2005-04-02 15:43 UTC (permalink / raw)
  To: linux-raid

Two days ago I had a severe servercrash due to raid-problems. The whole 
thing started with a (homemade) DOS-attack on the server. The server 
went to its knees and needed to be resetted. After the reboot the server 
was working fine and background-reconstruction of the mirrors started.
About 30 minutes later the first anomalies occured. Applications 
reported missing libraries, fs-errors (reiserfs) and so on.
It took a while until I reckognized what was going on:

the /-partition was on a raid1 - /dev/md2 - based on two disks : hda6+hdc6.

For some reason the raid seemed to be out of sync for over a year and 
hdc6 holded a old copy that was now successively overwriting hda6 and 
changing the content of / while the raid was running.
I booted with a live-cd to discover the hdc6 was the exact copy of 
spring 2004 (easily found out by content and timestamps of various files 
over the system) and hda6 was not mountable. I ran reiserfsck and had 
the tree rebuild on hda6, but it was too late. All current data was gone.

I had a backup and server is up again and my head is on my shoulders, 
but it leaves a lot of questions to me:

* how can the raid be out of sync. I monitor /proc/mdstat on a 
5-minute-interval and log the content to files. The output was 
definitely like:

md2 : active raid1 hdc6[0] hda6[1]
       5120000 blocks [2/2] [UU]

over the last year without a single exception. I just tested the entries 
in my watchdog and checked functionality of the watchdog by removing one 
disk. It definitely barks.

* how can in case of a unsynced raid the old version overwrite the new 
version. This is like a nightmare (and I remember having such thing before)

* What did I do wrong?

The only explantion to me is, that I had the wrong entry in my 
lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
which was started, but never had any data written to it. Is this a 
possible explanation?

kernel 2.4.24
raidtools-0.90

thnx for any advice,
peter

-- 
mag. peter pilsl
goldfisch.at
IT-management
tel +43 699 1 3574035
fax +43 699 4 3574035
pilsl@goldfisch.at

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid1-diseaster on reboot: old version overwrites new version
  2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
@ 2005-04-02 17:27 ` Gordon Henderson
  2005-04-02 17:35 ` Tim Moore
  2005-04-02 22:31 ` Neil Brown
  2 siblings, 0 replies; 6+ messages in thread
From: Gordon Henderson @ 2005-04-02 17:27 UTC (permalink / raw)
  To: linux-raid

On Sat, 2 Apr 2005, peter pilsl wrote:

> The only explantion to me is, that I had the wrong entry in my
> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> So maybe root was always mounted as /dev/hda6 and never as /dev/md2,
> which was started, but never had any data written to it. Is this a
> possible explanation?

It's possible - but I think the root= parameter needs to correspond to
whats in the /etc/fstab file - I'd check that too if it's still possible.
I've no experience with reiser though.

Gordon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid1-diseaster on reboot: old version overwrites new version
  2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
  2005-04-02 17:27 ` Gordon Henderson
@ 2005-04-02 17:35 ` Tim Moore
  2005-04-02 18:10   ` peter pilsl
  2005-04-04 19:39   ` Doug Ledford
  2005-04-02 22:31 ` Neil Brown
  2 siblings, 2 replies; 6+ messages in thread
From: Tim Moore @ 2005-04-02 17:35 UTC (permalink / raw)
  To: linux-raid



peter pilsl wrote:
> The only explantion to me is, that I had the wrong entry in my 
> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
> which was started, but never had any data written to it. Is this a 
> possible explanation?

No.  The lilo.conf entry just tells the kernel where root is located.

Can you publish your /etc/fstab and fdisk -l output?

-- 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid1-diseaster on reboot: old version overwrites new version
  2005-04-02 17:35 ` Tim Moore
@ 2005-04-02 18:10   ` peter pilsl
  2005-04-04 19:39   ` Doug Ledford
  1 sibling, 0 replies; 6+ messages in thread
From: peter pilsl @ 2005-04-02 18:10 UTC (permalink / raw)
  To: Tim Moore; +Cc: linux-raid

Tim Moore wrote:
> 
> 
> peter pilsl wrote:
> 
>> The only explantion to me is, that I had the wrong entry in my 
>> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
>> So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
>> which was started, but never had any data written to it. Is this a 
>> possible explanation?
> 
> 
> No.  The lilo.conf entry just tells the kernel where root is located.
> 
> Can you publish your /etc/fstab and fdisk -l output?
> 

thnx. following is the output of fstab, fdisk of both involved drives 
and my raidtab. (which reminds me to change the swap from raid to more 
single-partions)

---------------fstab-------------------

# cat /etc/fstab

/dev/md2 / reiserfs noatime,notail 1 1
/dev/md0 /boot ext2 noatime 1 2
/dev/md3 /var reiserfs noatime,notail 1 2
/dev/md1 swap swap defaults 0 0
/dev/md4 /data reiserfs noatime,notail 1 2

/dev/md5 /backup_cust reiserfs noatime,notail 1 2
/dev/md6 /data2 reiserfs noatime,notail 1 2
/dev/hdc8 /opt_noraid reiserfs noatime,notail 1 2

/dev/hdd7 /opt reiserfs noatime,notail 1 2

none /dev/pts devpts mode=0620 0 0
none /dev/shm tmpfs defaults 0 0
none /proc proc defaults 0 0

#/dev/hdb /mnt/cdrom auto 
user,iocharset=iso8859-1,exec,codepage=850,ro,noauto 0 0
#/dev/fd0 /mnt/floppy auto 
user,iocharset=iso8859-1,sync,exec,codepage=850,noauto 0 0



---------------fdisk-------------------

# fdisk -l /dev/hda

Disk /dev/hda: 255 heads, 63 sectors, 7297 cylinders
Units = cylinders of 16065 * 512 bytes

    Device Boot    Start       End    Blocks   Id  System
/dev/hda1             1         3     24066   fd  Linux raid autodetect
/dev/hda2             4        67    514080   fd  Linux raid autodetect
/dev/hda3            68      7297  58074975    5  Extended
/dev/hda5            68       705   5124703+  fd  Linux raid autodetect
/dev/hda6           706      1343   5124703+  fd  Linux raid autodetect
/dev/hda7          1344      5168  30724281   fd  Linux raid autodetect
/dev/hda8          5169      6443  10241406   fd  Linux raid autodetect
/dev/hda9          6444      7297   6859723+  fd  Linux raid autodetect

# fdisk -l /dev/hdc

Disk /dev/hdc: 16 heads, 63 sectors, 232581 cylinders
Units = cylinders of 1008 * 512 bytes

    Device Boot    Start       End    Blocks   Id  System
/dev/hdc1             1        48     24160+  fd  Linux raid autodetect
/dev/hdc2            49      1069    514584   fd  Linux raid autodetect
/dev/hdc3          1070    232581 116682048    5  Extended
/dev/hdc5          1070     11238   5125144+  fd  Linux raid autodetect
/dev/hdc6         11239     21407   5125144+  fd  Linux raid autodetect
/dev/hdc7         21408     82368  30724312+  fd  Linux raid autodetect
/dev/hdc8         82369    232581  75707320+  83  Linux

---------------raidtab-------------------

# cat /etc/raidtab
# /boot
raiddev       /dev/md0
raid-level    1
chunk-size    64k
persistent-superblock 1
nr-raid-disks 3
     device    /dev/hdc1
     raid-disk 0
     device    /dev/hda1
     raid-disk 1
     device    /dev/hdd1
     raid-disk 2

# swap
raiddev       /dev/md1
raid-level    0
chunk-size    64k
persistent-superblock 1
nr-raid-disks 2
     device    /dev/hdc2
     raid-disk 0
     device    /dev/hda2
     raid-disk 1

# /
raiddev       /dev/md2
raid-level    1
chunk-size    64k
persistent-superblock 1
nr-raid-disks 2
     device    /dev/hdc6
     raid-disk 0
     device    /dev/hda6
     raid-disk 1

# /var
raiddev       /dev/md3
raid-level    1
chunk-size    64k
persistent-superblock 1
nr-raid-disks 2
     device    /dev/hda5
     raid-disk 0
     device    /dev/hdc5
     raid-disk 1

#  /data
raiddev       /dev/md4
raid-level    1
chunk-size    64k
persistent-superblock 1
nr-raid-disks 2
     device    /dev/hda7
     raid-disk 0
     device    /dev/hdc7
     raid-disk 1


#  /back_customer
raiddev       /dev/md5
raid-level    1
chunk-size    64k
persistent-superblock 1
nr-raid-disks 2
     device    /dev/hdd5
     raid-disk 0
     device    /dev/hda8
     raid-disk 1


# /data2
raiddev       /dev/md6
raid-level    1
chunk-size    64k
persistent-superblock 1
nr-raid-disks 2
     device    /dev/hdd6
     raid-disk 0
     device    /dev/hda9
     raid-disk 1



thnx for your help,

peter



-- 
mag. peter pilsl
goldfisch.at
IT-management
tel +43 699 1 3574035
fax +43 699 4 3574035
pilsl@goldfisch.at

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid1-diseaster on reboot: old version overwrites new version
  2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
  2005-04-02 17:27 ` Gordon Henderson
  2005-04-02 17:35 ` Tim Moore
@ 2005-04-02 22:31 ` Neil Brown
  2 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2005-04-02 22:31 UTC (permalink / raw)
  To: peter pilsl; +Cc: linux-raid

On Saturday April 2, pilsl@goldfisch.at wrote:
> 
> * What did I do wrong?
> 
> The only explantion to me is, that I had the wrong entry in my 
> lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
> which was started, but never had any data written to it. Is this a 
> possible explanation?

Yep, this completely explains everything.
/ was *not* on /dev/md2, it was on /dev/hda6 which also happened to be
a part of an unused raid1 array.

After a crash, the raid1 array did a resync copying from hdc6 to
hda6.  Very sad.  Very good that you had backups.

2.6 won't let you do this: you cannot have a partition in a raid array
and mounted as a filesystem at the same time.

NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: raid1-diseaster on reboot: old version overwrites new version
  2005-04-02 17:35 ` Tim Moore
  2005-04-02 18:10   ` peter pilsl
@ 2005-04-04 19:39   ` Doug Ledford
  1 sibling, 0 replies; 6+ messages in thread
From: Doug Ledford @ 2005-04-04 19:39 UTC (permalink / raw)
  To: Tim Moore; +Cc: linux-raid

On Sat, 2005-04-02 at 09:35 -0800, Tim Moore wrote:
> 
> peter pilsl wrote:
> > The only explantion to me is, that I had the wrong entry in my 
> > lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
> > So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
> > which was started, but never had any data written to it. Is this a 
> > possible explanation?
> 
> No.  The lilo.conf entry just tells the kernel where root is located.

Yes, as Neil posted, this exactly explains the issue.  If /dev/hda6 is
part of a raid1 array, and you write to it instead of /dev/md2, then
those writes are never sent to /dev/hdc6 and the two devices get out of
sync.  Plus, standard initrd setups and the like are written to
accommodate users passing in arbitrary root= options on the kernel
command line to over ride the default root partition, and in those
situations the root partition must be taken from the command line and
not from fstab in order for this to work.  So, whether it's lilo or grub
or whatever, the root= line on your kernel command line is *the*
authority when it comes to what will be mounted as the root partition
you actually use.

> Can you publish your /etc/fstab and fdisk -l output?

Keep in mind the root partitions is already mounted in ro mode by the
time fstab is available and the rc.sysinit script merely remounts it rw.
Again, the command line is the authority.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-04-04 19:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-02 15:43 raid1-diseaster on reboot: old version overwrites new version peter pilsl
2005-04-02 17:27 ` Gordon Henderson
2005-04-02 17:35 ` Tim Moore
2005-04-02 18:10   ` peter pilsl
2005-04-04 19:39   ` Doug Ledford
2005-04-02 22:31 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).