Re: raid1: All my data completely vanished into the void

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: raid1: All my data completely vanished into the void
  2005-12-28  4:40 Mitchell Laks
@ 2005-12-28  4:33 ` Mike Hardy
  2005-12-28  5:17   ` Mitchell Laks
  2005-12-28  4:50 ` Ross Vandegrift
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Hardy @ 2005-12-28  4:33 UTC (permalink / raw)
  To: Mitchell Laks, linux-raid

Mitchell Laks wrote:
> Hi,
> 
> I just set up a new server running stable Debian Sarge, fresh install,
> mdadm 1.9.0,  linux kernel 2.6.8-2 
> and all my (test) data has vanished into the ether. 
> 
> What did I do wrong? I don't want this to happen in real life :).

[ description of sane-looking md setup snipped ]

> Did I screw up everything by not unmounting the devices? What happened?

Failure to unmount the devices should have resulted in possibly last
writes but a journalling filesystem would have kept things consistent.

Failure to nicely stop the arrays also could have possibly degraged the
arrays if they were being written to as the machine went down, but they
should be there.

In short: it *definitely* should *not* have come back with what you saw.
I have no clue why this happened, all I can say is that I use md on
around 10 business critical production machines in exactly the way you
describe, and I have not seen this (though I've seen the failure modes I
described!).

Are you positively sure that nothing else weird could have been going
on? Some layer that remaps drive names? funky hardware? write caching
capable of holding 3GB before writing out? Write caching of some sort is
actually my best guess.

If there's no extra data to explain this, I'm at a loss.

Anyone else?

-Mike

^ permalink raw reply	[flat|nested] 10+ messages in thread

* raid1: All my data completely vanished into the void
@ 2005-12-28  4:40 Mitchell Laks
  2005-12-28  4:33 ` Mike Hardy
  2005-12-28  4:50 ` Ross Vandegrift
  0 siblings, 2 replies; 10+ messages in thread
From: Mitchell Laks @ 2005-12-28  4:40 UTC (permalink / raw)
  To: linux-raid

Hi,

I just set up a new server running stable Debian Sarge, fresh install,
mdadm 1.9.0,  linux kernel 2.6.8-2 
and all my (test) data has vanished into the ether. 

What did I do wrong? I don't want this to happen in real life :).

I have 6x 400Gb Western Digital sata drives attached to  sata controllers. One 
pair of channels  is on the motherboard using VIA VT8237 controller 
and 2 highpoint 1520 sata cards give 4 other channels.

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf

I partioned each hard drive completely as 1 partition, type fd
ie  fdisk then "n p 1 enter t fd w q":

/dev/sda1
/dev/sdb1
/dev/sdc1
/dev/sdd1
/dev/sde1
/dev/sdf1

I then created 3 raid1 arrays
mdadm -Cv -n2 -l1 /dev/md0 /dev/sda1 /dev/sdb1
mdadm -Cv -n2 -l1 /dev/md1 /dev/sdc1 /dev/sdd1
mdadm -Cv -n2 -l1 /dev/md2 /dev/sde1 /dev/sdf1

I then formated
mkfs.ext3 /dev/md0
mkfs.ext3 /dev/md1
mkfs.ext3 /dev/md2

I then mounted
mount /dev/md0 /home/big0
mount /dev/md1 /home/big1
mount /dev/md2 /home/big2

I then waited until all syncing was done.

Then I copied 3 gb of data to the 3 arrays from another machine over the 
network via rsync.

I did df -h and indeed the data was there. I cd into the directories and there 
was data there.

Then I did 
shutdown -h now

however I did not!! (?evil) umount the different devices 
/dev/md0 /dev/md1 /dev/md2 before doing that.

Now I reboot into linux. 

I try to assemble the arrays
mdadm -A /dev/md0 /dev/sda1 /dev/sdb1

mdadm --detail --scan gives me nothing at all.

And I get nothing.
I do fdisk -l

and no trace of any of the partitions. It is as if there are no 
partitions /dev/sdX1, where X is a-f.
I do 
fdisk /dev/sda

I get message:

Device contains neither a valid DOS partition table, nor Sun, SGI or OSF 
disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 48641.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

p shows nothing!!

??????????????????????????????

Did I screw up everything by not unmounting the devices? What happened?

Thanks,
Mitchell Laks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  4:40 Mitchell Laks
  2005-12-28  4:33 ` Mike Hardy
@ 2005-12-28  4:50 ` Ross Vandegrift
  1 sibling, 0 replies; 10+ messages in thread
From: Ross Vandegrift @ 2005-12-28  4:50 UTC (permalink / raw)
  To: Mitchell Laks; +Cc: linux-raid

On Tue, Dec 27, 2005 at 11:40:48PM -0500, Mitchell Laks wrote:
> I then created 3 raid1 arrays
> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda1 /dev/sdb1
> mdadm -Cv -n2 -l1 /dev/md1 /dev/sdc1 /dev/sdd1
> mdadm -Cv -n2 -l1 /dev/md2 /dev/sde1 /dev/sdf1

Are you 100% sure that you didn't do:

mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb, etc, etc?
(ie, note lack of subdevice for partition!!!)

That's the only thing I can image that would cause this:

> Device contains neither a valid DOS partition table, nor Sun, SGI or OSF 
> disklabel

It sounds like you created md devices out of whole disks instead of
partitions and overwrote the partition information.  I don't *think*
this should be a problem, but I don't 100% know...

Try this to see if it finds your array:

mdadm --assemble /dev/md0 /dev/sda /dev/sdb

> however I did not!! (?evil) umount the different devices 
> /dev/md0 /dev/md1 /dev/md2 before doing that.

When you shutdown cleanly, your system's scripts almost certainly did
that for you! ::-)

-- 
Ross Vandegrift
ross@lug.udel.edu

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  4:33 ` Mike Hardy
@ 2005-12-28  5:17   ` Mitchell Laks
  0 siblings, 0 replies; 10+ messages in thread
From: Mitchell Laks @ 2005-12-28  5:17 UTC (permalink / raw)
  To: linux-raid

On Tuesday 27 December 2005 11:33 pm, you wrote:

> In short: it *definitely* should *not* have come back with what you saw.
> I have no clue why this happened, all I can say is that I use md on
> around 10 business critical production machines in exactly the way you
> describe, and I have not seen this (though I've seen the failure modes I
> described!).
I myself have been using md, mdadm for more than a year in multiple machines. 
I am at a loss. The only difference is now SATA drives with SATA controllers.

>
> Are you positively sure that nothing else weird could have been going
> on? Some layer that remaps drive names? funky hardware? write caching
> capable of holding 3GB before writing out? Write caching of some sort is
> actually my best guess.
  9.8Gb actually! same 3.2 Gb sent to each of the 3 raid 1 drive arrays to 
really  give them a workout, sent with rsync.

Only kicker is that this setup is Sata disks using the kernel sata drivers as 
modules. I have always used Pata drives before.

 

I get no kernel error messages at all. 
dmesg from the reboot tells me

First for the modules that come with linux: and the devices

libata version 1.02 loaded.
sata_via version 0.20
ACPI: PCI interrupt 0000:00:0f.0[B] -> GSI 20 (level, low) -> IRQ 185
sata_via(0000:00:0f.0): routed to hard irq line 10
ata1: SATA max UDMA/133 cmd 0xC000 ctl 0xB802 bmdma 0xA800 irq 185
ata2: SATA max UDMA/133 cmd 0xB400 ctl 0xB002 bmdma 0xA808 irq 185
ata1: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 
88:407f
ata1: dev 0 ATA, max UDMA/133, 781422768 sectors: lba48
ata1: dev 0 configured for UDMA/133
scsi1 : sata_via
ata2: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 
88:407f
ata2: dev 0 ATA, max UDMA/133, 781422768 sectors: lba48
ata2: dev 0 configured for UDMA/133
scsi2 : sata_via
  Vendor: ATA       Model: WDC WD4000YR-01P  Rev: 01.0
 Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
SCSI device sde: drive cache: write back
 /dev/scsi/host1/bus0/target0/lun0: unknown partition table
Attached scsi disk sde at scsi1, channel 0, id 0, lun 0
  Vendor: ATA       Model: WDC WD4000YR-01P  Rev: 01.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdf: 781422768 512-byte hdwr sectors (400088 MB)
SCSI device sdf: drive cache: write back
 /dev/scsi/host2/bus0/target0/lun0: unknown partition table
Attached scsi disk sdf at scsi2, channel 0, id 0, lun 0

Notice the unknown partition table:

Then for the data from dmesg about the devices mounted on the highpoint 
rocketraid 1520 which used the proprietary hpt37x2.ko kernel module that I 
added to initrd and /lib/modules/`uname -r`/kernel/drivers/ide or whatever

hpt37x2: no version for "scsi_remove_host" found: kernel tainted.
HPT37x2 RAID Controller driver

SCSI device sda: 781422767 512-byte hdwr sectors (400088 MB)
sda: asking for cache data failed
sda: assuming drive cache: write through
 /dev/scsi/host0/bus0/target0/lun0: p1
SCSI device sda: 781422767 512-byte hdwr sectors (400088 MB)
sda: asking for cache data failed
sda: assuming drive cache: write through
 /dev/scsi/host0/bus0/target0/lun0: p1
SCSI device sdb: 781422767 512-byte hdwr sectors (400088 MB)
sdb: asking for cache data failed
sdb: assuming drive cache: write through
 /dev/scsi/host0/bus0/target1/lun0: p1
SCSI device sdb: 781422767 512-byte hdwr sectors (400088 MB)
sdb: asking for cache data failed
sdb: assuming drive cache: write through
 /dev/scsi/host0/bus0/target1/lun0: p1
SCSI device sdc: 781422767 512-byte hdwr sectors (400088 MB)
sdc: asking for cache data failed
sdc: assuming drive cache: write through
 /dev/scsi/host0/bus0/target2/lun0: p1
SCSI device sdc: 781422767 512-byte hdwr sectors (400088 MB)
sdc: asking for cache data failed
sdc: assuming drive cache: write through
 /dev/scsi/host0/bus0/target2/lun0: p1
SCSI device sdd: 781422767 512-byte hdwr sectors (400088 MB)
sdd: asking for cache data failed
sdd: assuming drive cache: write through
 /dev/scsi/host0/bus0/target3/lun0: p1
SCSI device sdd: 781422767 512-byte hdwr sectors (400088 MB)
sdd: asking for cache data failed
sdd: assuming drive cache: write through
 /dev/scsi/host0/bus0/target3/lun0: p1

> If there's no extra data to explain this, I'm at a loss.

I wish I knew what else to check. I just bought 16 SATA drives for 2 servers!

>
> Anyone else?
>
> -Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
@ 2005-12-28  5:30 Mitchell Laks
  2005-12-28  5:59 ` Sebastian Kuzminsky
  2005-12-28 21:52 ` Mark Hahn
  0 siblings, 2 replies; 10+ messages in thread
From: Mitchell Laks @ 2005-12-28  5:30 UTC (permalink / raw)
  To: linux-raid

On Tuesday 27 December 2005 11:50 pm,   
Ross Vandegrift wrote:

> Are you 100% sure that you didn't do:
>
> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb, etc, etc?
> (ie, note lack of subdevice for partition!!!)
> It sounds like you created md devices out of whole disks instead of
> partitions and overwrote the partition information.  I don't *think*
> this should be a problem, but I don't 100% know...
>
> Try this to see if it finds your array:
>
> mdadm --assemble /dev/md0 /dev/sda /dev/sdb

That works! In fact that does indeed find all the data! I then mount it as 
/home/big2 and the data is still there that I left before.

You are a genius!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

What an idiot I am. 

What does doing

mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb

do to the partition tables??? 
(And why can I still access the data if I messed up the partitions??? very 
odd).
Can you point me at an explanation of the effects of what I did?

Every way a person can screw up,  it will happen...

Mitcehll

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  5:30 raid1: All my data completely vanished into the void Mitchell Laks
@ 2005-12-28  5:59 ` Sebastian Kuzminsky
  2005-12-28  6:44   ` Daniel Pittman
  2005-12-28 21:52 ` Mark Hahn
  1 sibling, 1 reply; 10+ messages in thread
From: Sebastian Kuzminsky @ 2005-12-28  5:59 UTC (permalink / raw)
  To: linux-raid

Mitchell Laks <mlaks@verizon.net> wrote:
> What does doing
> 
> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb
> 
> do to the partition tables??? 
> (And why can I still access the data if I messed up the partitions??? very 
> odd).
> Can you point me at an explanation of the effects of what I did?

I'd expect that command to overwrite the partition table with the
MD metadata, or at least put the partition table at risk of being
overwritten later.

No problem, as long as you're aware of it.  That's how I usually set
my RAID disks up - no partition table, just use the whole disk raw.
LVM is a better solution to the partitioning problem.

-- 
Sebastian Kuzminsky

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  5:59 ` Sebastian Kuzminsky
@ 2005-12-28  6:44   ` Daniel Pittman
  2005-12-28  8:23     ` Max Waterman
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Pittman @ 2005-12-28  6:44 UTC (permalink / raw)
  To: linux-raid

Sebastian Kuzminsky <seb@highlab.com> writes:
> Mitchell Laks <mlaks@verizon.net> wrote:
>> What does doing
>> 
>> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb
>> 
>> do to the partition tables??? 
>> (And why can I still access the data if I messed up the partitions??? very 
>> odd).
>> Can you point me at an explanation of the effects of what I did?
>
> I'd expect that command to overwrite the partition table with the
> MD metadata, or at least put the partition table at risk of being
> overwritten later.

Nope: the MD metadata lives at the end of the disk, not the start, so
your partition table would still be there when the filesystem wrote over
the first block of the disk...

...and, if the partition table lived through that, I guess the
filesystem doesn't use (or respects) that block itself.

           Daniel


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  6:44   ` Daniel Pittman
@ 2005-12-28  8:23     ` Max Waterman
  2005-12-28 10:34       ` Daniel Pittman
  0 siblings, 1 reply; 10+ messages in thread
From: Max Waterman @ 2005-12-28  8:23 UTC (permalink / raw)
  To: linux-raid

Daniel Pittman wrote:
> Sebastian Kuzminsky <seb@highlab.com> writes:
>> Mitchell Laks <mlaks@verizon.net> wrote:
>>> What does doing
>>>
>>> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb
>>>
>>> do to the partition tables??? 
>>> (And why can I still access the data if I messed up the partitions??? very 
>>> odd).
>>> Can you point me at an explanation of the effects of what I did?
>> I'd expect that command to overwrite the partition table with the
>> MD metadata, or at least put the partition table at risk of being
>> overwritten later.
> 
> Nope: the MD metadata lives at the end of the disk, not the start, so
> your partition table would still be there when the filesystem wrote over
> the first block of the disk...
> 
> ....and, if the partition table lived through that, I guess the
> filesystem doesn't use (or respects) that block itself.

...but, just so as I understand, by using the whole disk (ie /dev/sda 
and not /dev/sda1, etc), you're telling md to make the whole disk 
available to your filesystem (or whatever), including the space normally 
used to store the partition table, and so any partition table that 
happens to be on the disk(s) is likely to be over-written.

right?

Max.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  8:23     ` Max Waterman
@ 2005-12-28 10:34       ` Daniel Pittman
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Pittman @ 2005-12-28 10:34 UTC (permalink / raw)
  To: linux-raid

Max Waterman <davidmaxwaterman+gmane@fastmail.co.uk> writes:
> Daniel Pittman wrote:
>> Sebastian Kuzminsky <seb@highlab.com> writes:
>>> Mitchell Laks <mlaks@verizon.net> wrote:
>>>> What does doing
>>>>
>>>> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb
>>>>
>>>> do to the partition tables??? (And why can I still access the data
>>>> if I messed up the partitions??? very odd).
>>>> Can you point me at an explanation of the effects of what I did?
>>> I'd expect that command to overwrite the partition table with the
>>> MD metadata, or at least put the partition table at risk of being
>>> overwritten later.
>> Nope: the MD metadata lives at the end of the disk, not the start, so
>> your partition table would still be there when the filesystem wrote over
>> the first block of the disk...
>> ....and, if the partition table lived through that, I guess the
>> filesystem doesn't use (or respects) that block itself.
>
> ...but, just so as I understand, by using the whole disk (ie /dev/sda
> and not /dev/sda1, etc), you're telling md to make the whole disk
> available to your filesystem (or whatever), including the space normally
> used to store the partition table, and so any partition table that
> happens to be on the disk(s) is likely to be over-written.
>
> right?

Yeah, pretty much.  You lose around 128KB from the end of the disk, but
the MD device should start the data at sector zero, right where the
partition table is.

          Daniel


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: raid1: All my data completely vanished into the void
  2005-12-28  5:30 raid1: All my data completely vanished into the void Mitchell Laks
  2005-12-28  5:59 ` Sebastian Kuzminsky
@ 2005-12-28 21:52 ` Mark Hahn
  1 sibling, 0 replies; 10+ messages in thread
From: Mark Hahn @ 2005-12-28 21:52 UTC (permalink / raw)
  To: Mitchell Laks; +Cc: linux-raid

> What does doing
> mdadm -Cv -n2 -l1 /dev/md0 /dev/sda /dev/sdb
> do to the partition tables??? 

overwrites them.  then again, there's nothing critical about 
partition tables - they're just a convention for slicing up 
the disk, not necessary in any sense.

in fact, it can be quite handy to avoid partitions - I have one 
cluster where /dev/hda is an ext3 filesystem, mainly to avoid the 
bios bug of trying to boot the disk (in preference to PXE) even 
if there are no active partitions.

the only downside to your "messed up" config is that the kernel won't
autostart the raids.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-12-28 21:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-28  5:30 raid1: All my data completely vanished into the void Mitchell Laks
2005-12-28  5:59 ` Sebastian Kuzminsky
2005-12-28  6:44   ` Daniel Pittman
2005-12-28  8:23     ` Max Waterman
2005-12-28 10:34       ` Daniel Pittman
2005-12-28 21:52 ` Mark Hahn
  -- strict thread matches above, loose matches on Subject: below --
2005-12-28  4:40 Mitchell Laks
2005-12-28  4:33 ` Mike Hardy
2005-12-28  5:17   ` Mitchell Laks
2005-12-28  4:50 ` Ross Vandegrift

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).