Re: Root Drive Mirroring and LVM.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Root Drive Mirroring and LVM.
       [not found] <868055F9-5045-11D8-8066-00039382032A@mindspring.com>
@ 2004-01-27  8:01 ` Atro Tossavainen
  2004-01-27  9:32   ` Sven Luther
  2004-01-28  1:42   ` Neil Brown
  0 siblings, 2 replies; 19+ messages in thread
From: Atro Tossavainen @ 2004-01-27  8:01 UTC (permalink / raw)
  To: linuxppc-dev, linux-raid; +Cc: tas


Sorry about the crossposting.

I wrote on the Yellow Dog Linux list when somebody asked about software
RAID on YDL about my experiences with it:

>> The one really big gotcha is that the Macintosh partitioning scheme
>> can't tell the Linux kernel that certain partitions are to be
>> considered "Linux RAID autodetect" (as in x86 using the DOS partition
>> table type 0xfd). This means that you can't boot a Mac Linux system
>> directly from RAID because the kernel won't be able to autostart the
>> RAID devices. You have to work around this by creating an initial RAM
>> disk that uses the raidstart command to start your metadevices, then
>> swaps the initrd out of the way and proceeds to start the real system.

to which Tim Seufert replied on the same list:

> Hmmm.  That would seem to be a lack in the Linux RAID code, since the
> Macintosh partition table has a vastly more flexible partition type
> field than DOS: instead of a single byte it's a string.  It would mean
> breaking from the convention of using the "Apple_SVR2_UNIX" type for
> Linux partitions, but that really is just a convention as far as I know.

Perhaps the PPC Linux developers and the Linux RAID developers should
get together on this and make some decisions so as to make it happen.

--
Atro Tossavainen (Mr.)               / The Institute of Biotechnology at
Systems Analyst, Techno-Amish &     / the University of Helsinki, Finland,
+358-9-19158939  UNIX Dinosaur     / employs me, but my opinions are my own.
< URL : http : / / www . helsinki . fi / %7E atossava / > NO FILE ATTACHMENTS

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-01-27  8:01 ` Root Drive Mirroring and LVM Atro Tossavainen
@ 2004-01-27  9:32   ` Sven Luther
  2004-01-29  1:01     ` Tom Vier
  2004-01-28  1:42   ` Neil Brown
  1 sibling, 1 reply; 19+ messages in thread
From: Sven Luther @ 2004-01-27  9:32 UTC (permalink / raw)
  To: Atro.Tossavainen; +Cc: linuxppc-dev, linux-raid, tas


On Tue, Jan 27, 2004 at 10:01:17AM +0200, Atro Tossavainen wrote:
>
> Sorry about the crossposting.
>
> I wrote on the Yellow Dog Linux list when somebody asked about software
> RAID on YDL about my experiences with it:
>
> >> The one really big gotcha is that the Macintosh partitioning scheme
> >> can't tell the Linux kernel that certain partitions are to be
> >> considered "Linux RAID autodetect" (as in x86 using the DOS partition
> >> table type 0xfd). This means that you can't boot a Mac Linux system
> >> directly from RAID because the kernel won't be able to autostart the
> >> RAID devices. You have to work around this by creating an initial RAM
> >> disk that uses the raidstart command to start your metadevices, then
> >> swaps the initrd out of the way and proceeds to start the real system.
>
> to which Tim Seufert replied on the same list:
>
> > Hmmm.  That would seem to be a lack in the Linux RAID code, since the
> > Macintosh partition table has a vastly more flexible partition type
> > field than DOS: instead of a single byte it's a string.  It would mean
> > breaking from the convention of using the "Apple_SVR2_UNIX" type for
> > Linux partitions, but that really is just a convention as far as I know.
>
> Perhaps the PPC Linux developers and the Linux RAID developers should
> get together on this and make some decisions so as to make it happen.

Seems ok for me. Also, i guess that there are other partition types,
like the amiga partitition table the pegasos boxes mostly use, which has
a 32bit identifier for partition types. I guess it is the task of the
RAID code to have some per partition type checking for this RAID autodetect
magic.

Friendly,

Sven Luther

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-01-27  8:01 ` Root Drive Mirroring and LVM Atro Tossavainen
  2004-01-27  9:32   ` Sven Luther
@ 2004-01-28  1:42   ` Neil Brown
  2004-01-28  8:15     ` Sven Luther
  2004-02-04 18:23     ` linas
  1 sibling, 2 replies; 19+ messages in thread
From: Neil Brown @ 2004-01-28  1:42 UTC (permalink / raw)
  To: Atro.Tossavainen; +Cc: linuxppc-dev, linux-raid, tas


On Tuesday January 27, atossava@cc.helsinki.fi wrote:
> Sorry about the crossposting.
>
> I wrote on the Yellow Dog Linux list when somebody asked about software
> RAID on YDL about my experiences with it:
>
> >> The one really big gotcha is that the Macintosh partitioning scheme
> >> can't tell the Linux kernel that certain partitions are to be
> >> considered "Linux RAID autodetect" (as in x86 using the DOS partition
> >> table type 0xfd). This means that you can't boot a Mac Linux system
> >> directly from RAID because the kernel won't be able to autostart the
> >> RAID devices. You have to work around this by creating an initial RAM
> >> disk that uses the raidstart command to start your metadevices, then
> >> swaps the initrd out of the way and proceeds to start the real system.
>

This is not entirely true.  Certainly an initial-ram-disk is one
solution and is (I think) the preferred long-term solution. However
you can also boot from raid with kernel-parameters like:

   md=0,/dev/hda1,/dev/hdc1 boot=/dev/md0

where '0' indicated which md device (md0 in this case), and the
remaining words are the devices to assemble it from.


> to which Tim Seufert replied on the same list:
>
> > Hmmm.  That would seem to be a lack in the Linux RAID code, since the
> > Macintosh partition table has a vastly more flexible partition type
> > field than DOS: instead of a single byte it's a string.  It would mean
> > breaking from the convention of using the "Apple_SVR2_UNIX" type for
> > Linux partitions, but that really is just a convention as far as I know.
>
> Perhaps the PPC Linux developers and the Linux RAID developers should
> get together on this and make some decisions so as to make it happen.
>

I personally think auto-detect is the wrong approach and have no
desire to extend it to other partition types (I cannot remove it from
DOS partitions as that breaks back-compatability).
Just use "md=..."

NeilBrown

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-01-28  1:42   ` Neil Brown
@ 2004-01-28  8:15     ` Sven Luther
  2004-02-04 18:23     ` linas
  1 sibling, 0 replies; 19+ messages in thread
From: Sven Luther @ 2004-01-28  8:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: Atro.Tossavainen, linuxppc-dev, linux-raid, tas

On Wed, Jan 28, 2004 at 12:42:18PM +1100, Neil Brown wrote:
> 
> On Tuesday January 27, atossava@cc.helsinki.fi wrote:
> > Sorry about the crossposting.
> >
> > I wrote on the Yellow Dog Linux list when somebody asked about software
> > RAID on YDL about my experiences with it:
> >
> > >> The one really big gotcha is that the Macintosh partitioning scheme
> > >> can't tell the Linux kernel that certain partitions are to be
> > >> considered "Linux RAID autodetect" (as in x86 using the DOS partition
> > >> table type 0xfd). This means that you can't boot a Mac Linux system
> > >> directly from RAID because the kernel won't be able to autostart the
> > >> RAID devices. You have to work around this by creating an initial RAM
> > >> disk that uses the raidstart command to start your metadevices, then
> > >> swaps the initrd out of the way and proceeds to start the real system.
> >
> 
> This is not entirely true.  Certainly an initial-ram-disk is one
> solution and is (I think) the preferred long-term solution. However
> you can also boot from raid with kernel-parameters like:
> 
>    md=0,/dev/hda1,/dev/hdc1 boot=/dev/md0
> 
> where '0' indicated which md device (md0 in this case), and the
> remaining words are the devices to assemble it from.

This is said to be broken on latest 2.4.x kernels though. Didn't try
myself though, since i only have a single drive, and don't really want
to mess up with it.

> > to which Tim Seufert replied on the same list:
> >
> > > Hmmm.  That would seem to be a lack in the Linux RAID code, since the
> > > Macintosh partition table has a vastly more flexible partition type
> > > field than DOS: instead of a single byte it's a string.  It would mean
> > > breaking from the convention of using the "Apple_SVR2_UNIX" type for
> > > Linux partitions, but that really is just a convention as far as I know.
> >
> > Perhaps the PPC Linux developers and the Linux RAID developers should
> > get together on this and make some decisions so as to make it happen.
> >
> 
> I personally think auto-detect is the wrong approach and have no
> desire to extend it to other partition types (I cannot remove it from
> DOS partitions as that breaks back-compatability).
> Just use "md=..."

Ok.

Friendly,

Sven Luther

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-01-27  9:32   ` Sven Luther
@ 2004-01-29  1:01     ` Tom Vier
  2004-01-29  7:22       ` Sven Luther
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Vier @ 2004-01-29  1:01 UTC (permalink / raw)
  To: Sven Luther; +Cc: linuxppc-dev, linux-raid

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Tue, Jan 27, 2004 at 10:32:13AM +0100, Sven Luther wrote:
> Seems ok for me. Also, i guess that there are other partition types,
> like the amiga partitition table the pegasos boxes mostly use, which has
> a 32bit identifier for partition types. I guess it is the task of the
> RAID code to have some per partition type checking for this RAID autodetect
> magic.

it's done in fs/partitions/. it could be made anything, as long as it's put
in the raid docs so people know.

here's a patch of mine for alpha, to give you an idea.

-- 
Tom Vier <tmv@comcast.net>
DSA Key ID 0xE6CB97DA

[-- Attachment #2: part-fstype.diff --]
[-- Type: text/plain, Size: 1388 bytes --]

diff -urN linux-2.4.10-ac7-patched-build/fs/partitions/osf.c linux-2.4.10-ac7-patched-build-osf/fs/partitions/osf.c
--- linux-2.4.10-ac7-patched-build/fs/partitions/osf.c	Sat Oct  6 13:25:48 2001
+++ linux-2.4.10-ac7-patched-build-osf/fs/partitions/osf.c	Sat Oct  6 13:25:20 2001
@@ -7,6 +7,7 @@
  *  Re-organised Feb 1998 Russell King
  */
 
+#include <linux/config.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/kernel.h>
@@ -17,6 +18,10 @@
 #include "check.h"
 #include "osf.h"
 
+#if CONFIG_BLK_DEV_MD
+extern void md_autodetect_dev(kdev_t dev);
+#endif
+
 int osf_partition(struct gendisk *hd, struct block_device *bdev,
 		unsigned long first_sector, int current_minor)
 {
@@ -74,10 +79,16 @@
 	for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {
 		if ((current_minor & mask) == 0)
 		        break;
-		if (le32_to_cpu(partition->p_size))
-			add_gd_partition(hd, current_minor,
-				first_sector+le32_to_cpu(partition->p_offset),
-				le32_to_cpu(partition->p_size));
+		if (le32_to_cpu(partition->p_size)) {
+				add_gd_partition(hd, current_minor,
+					first_sector+le32_to_cpu(partition->p_offset),
+					le32_to_cpu(partition->p_size));
+#if CONFIG_BLK_DEV_MD
+				if (partition->p_fstype == LINUX_RAID_PARTITION) {
+					md_autodetect_dev(MKDEV(hd->major,current_minor));
+				}
+#endif
+		}
 		current_minor++;
 	}
 	printk("\n");

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-01-29  1:01     ` Tom Vier
@ 2004-01-29  7:22       ` Sven Luther
  0 siblings, 0 replies; 19+ messages in thread
From: Sven Luther @ 2004-01-29  7:22 UTC (permalink / raw)
  To: Tom Vier; +Cc: Sven Luther, linuxppc-dev, linux-raid

On Wed, Jan 28, 2004 at 08:01:01PM -0500, Tom Vier wrote:
> On Tue, Jan 27, 2004 at 10:32:13AM +0100, Sven Luther wrote:
> > Seems ok for me. Also, i guess that there are other partition types,
> > like the amiga partitition table the pegasos boxes mostly use, which has
> > a 32bit identifier for partition types. I guess it is the task of the
> > RAID code to have some per partition type checking for this RAID autodetect
> > magic.
> 
> it's done in fs/partitions/. it could be made anything, as long as it's put
> in the raid docs so people know.

Ok, thanks. On amiga filesystem, partitition types are in 32bit ints,
but can also be represented as 4 chars, so maybe `RAID` would be a good
type for that ? I will try patching the kernel, and adding support in
libparted for recognizing such partitions, and try it out.

Does it make sense to do Software Raid on two partititons of the same
disk, just for testing purpose ?

> here's a patch of mine for alpha, to give you an idea.

Ok, thanks.

Friendly,

Sven Luther

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-01-28  1:42   ` Neil Brown
  2004-01-28  8:15     ` Sven Luther
@ 2004-02-04 18:23     ` linas
  2004-02-04 22:32       ` Mark Hahn
  2004-03-23  7:27       ` Atro Tossavainen
  1 sibling, 2 replies; 19+ messages in thread
From: linas @ 2004-02-04 18:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: Atro.Tossavainen, linuxppc-dev, linux-raid, tas

On Wed, Jan 28, 2004 at 12:42:18PM +1100, Neil Brown wrote:
> solution and is (I think) the preferred long-term solution. However
> you can also boot from raid with kernel-parameters like:
> 
>    md=0,/dev/hda1,/dev/hdc1 boot=/dev/md0
> 
> where '0' indicated which md device (md0 in this case), and the
> remaining words are the devices to assemble it from.
> 
> I personally think auto-detect is the wrong approach and have no
> desire to extend it to other partition types (I cannot remove it from
> DOS partitions as that breaks back-compatability).
> Just use "md=..."

What's wrong with autodetect?  Its saved my butt a number of times.

My recurring nightmare is involves a failed ide controller.  If the 
failed controller is on the motherboard, then plugging in a store-bought
ide controller causes the BIOS to "randomly" renumber to hard drives.
(yes, sorry this is a PC issue, but I would guess that similar issues 
lurk in open firmware, etc). 

Thus, any sort of mounting that explictly references /dev/hdanything
is guarenteed to screw up (sometimes catastrophically) when trying
to recover from failed controllers (and sometimes even failed disks,
if you are trying to bootstrap your way through by temporarily putting 
the replacement disk somewhere other than its permanent home). 

And I'm a comp sci geek who wrote the original linux RAID HOWTO,
I pity the mere mortal sysadmin who has to go through this ... 

So what's wrong with autodetect, again?

--linas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-02-04 18:23     ` linas
@ 2004-02-04 22:32       ` Mark Hahn
  2004-02-04 22:49         ` linas
  2004-03-23  7:27       ` Atro Tossavainen
  1 sibling, 1 reply; 19+ messages in thread
From: Mark Hahn @ 2004-02-04 22:32 UTC (permalink / raw)
  To: linas; +Cc: linux-raid

> So what's wrong with autodetect, again?

my guess is: in-kernel autodetect is the problem.  
out-of-kernel detection can be much smarter, 
and can be more easily tested/replaced.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-02-04 22:32       ` Mark Hahn
@ 2004-02-04 22:49         ` linas
  2004-02-04 23:07           ` Neil Brown
  0 siblings, 1 reply; 19+ messages in thread
From: linas @ 2004-02-04 22:49 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid

On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote:
> > So what's wrong with autodetect, again?
> 
> my guess is: in-kernel autodetect is the problem.  
> out-of-kernel detection can be much smarter, 
> and can be more easily tested/replaced.

Hm, yes, that makes sense.

--linas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-02-04 22:49         ` linas
@ 2004-02-04 23:07           ` Neil Brown
  2004-02-05  0:37             ` linas
                               ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Neil Brown @ 2004-02-04 23:07 UTC (permalink / raw)
  To: linas; +Cc: Mark Hahn, linux-raid

On Wednesday February 4, linas@austin.ibm.com wrote:
> On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote:
> > > So what's wrong with autodetect, again?
> > 
> > my guess is: in-kernel autodetect is the problem.  
> > out-of-kernel detection can be much smarter, 
> > and can be more easily tested/replaced.
> 
> Hm, yes, that makes sense.

Good :-)

Just to flesh out my thoughts a bit more:

 If the root filesystem is on an MD array, then I see the process of
 assembling that md array as quite similar to the process of finding
 the device for the root filesystem.

 We don't expect the kernel, or anyone else, to scan all devices
 looking for something that looks like a root filesystem, and loading
 that.  Rather we tell the kernel or boot loader exactly where to find
 the root filesystem.  And if the root filesystem moves, we get to
 explicitly tell the boot loader where it is (root=/dev/hdc1 or
 whatever).
 Assembling the root device should be handled the same way.  We tell
 the boot loader/kernel where to expect it, but can over-ride that if
 we need to:
   md=0,/dev/hdc1,/dev/hde4

 All other md arrays can, and so should, be assembled by code running
 out of the root filesystem.   This could be some program that
 assembles anything it finds after scanning all devices, or something
 a bit more focused, but it should be controllable by the sysadmin.

 It is true that in-kernel auto-detect can be controlled by fiddling
 with partition types, but the problem is that it runs *before* the
 root filesystem is mounted and so could conceivably confuse the
 assembly of the root device (if e.g. you plugged in some other device
 that also claimed to be part of /dev/md0, and it got scanned before
 your real root device).

NeilBrown

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-02-04 23:07           ` Neil Brown
@ 2004-02-05  0:37             ` linas
  2004-02-05  1:31               ` Guy
  2004-02-05  4:45             ` badblock handling Donghui Wen
  2004-02-05 18:04             ` Root Drive Mirroring and LVM Joe Pruett
  2 siblings, 1 reply; 19+ messages in thread
From: linas @ 2004-02-05  0:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: Mark Hahn, linux-raid, linuxppc64-dev, linux-hotplug-devel

On Thu, Feb 05, 2004 at 10:07:32AM +1100, Neil Brown wrote:
> On Wednesday February 4, linas@austin.ibm.com wrote:
> > On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote:
> > > > So what's wrong with autodetect, again?
> > > 
> > > my guess is: in-kernel autodetect is the problem.  
> > > out-of-kernel detection can be much smarter, 
> > > and can be more easily tested/replaced.
> > 
> > Hm, yes, that makes sense.
> 
> 
> Good :-)
> 
> Just to flesh out my thoughts a bit more:
> 
>  If the root filesystem is on an MD array, then I see the process of
>  assembling that md array as quite similar to the process of finding
>  the device for the root filesystem.
> 
>  We don't expect the kernel, or anyone else, to scan all devices
>  looking for something that looks like a root filesystem, and loading
>  that.  Rather we tell the kernel or boot loader exactly where to find
>  the root filesystem.  And if the root filesystem moves, we get to
>  explicitly tell the boot loader where it is (root=/dev/hdc1 or
>  whatever).

Well, this is actually one of my more bitter complaints.  I'd much
much rather have some symbolic disk label, and have the kernel scan 
*all* devices for that.  I can defend this for both low-end and high-end 
machines.

On the low end i.e. PC's, PC servers: I've dealt (multiple times) with 
disk and/or controller failure.  If you have more than 3 disks, it can
get very confusing as to which cable is which.  Add to that that some
BIOS'es allow you to enable/disable controllers, which cause devices 
to be renumbered. E.g. I have a mobo with two 'plain vanilla' ide
connectors and two '100MHZ' 80-wire ide connectors.  Which one is 
/dev/hda and which is not depends on the BIOS settings.  Worse: 
if I plug in a 3rd party ide controller, then the numbering becomes 
insane:

/dev/hda: mobo 33MHz controller
/dev/hdc: mobo 33MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller
/dev/hdi: mobo 100MHz controller
/dev/hdk: mobo 100MHz controller

But if I disable the 33MHz mobo connector in bios:

/dev/hda: mobo 100MHz controller
/dev/hdc: mobo 100MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller

Since some types of disk failures will lock up the kernel during boot,
you spend a lot of time plugging and unplugging disks and rebooting.
It gets confusing after a while.

To add to the confusion: If you RAID-1 mirror the root partition, 
but then mount it as /dev/hdk (and not /dev/md0) during a rescue 
operation (because rescue disks often don't have RAID on them), 
and then edit /etc/fstab to try to match the new config...
you end up with two root partitions (each of the mirror pair) with
different /etc/fstab's, on different cables/controllers, ...
after fiddling with that, its a nightmare to figure out which 
is which, what's mounted how and where.

I also had a similar experience on a machine with 30-odd scsi disks.
We installed a new kernel on /dev/sdp, rebooted ... and what confusion,
since /dev/sdp was now /dev/sdg.  Worse, since the bootloader was 
unaware of this difference ... So then we made a change, and rebooted...
... again, much confusion till we worked it out.

As a cure-all, I started fiddling with ext2 disk labels.  However,
an ext2 disk label, when written on a RAID-1 device, will identify
*three* things: /dev/md0 (for example), and both mirror pairs: 
/dev/hda1 and /dev/hde1 all have the same label.

rieserfs doesn't have ext2 labels....

I wanted to fool with other partition table schemes (non-DOS partition 
tables) until I realized most PC rescue disks wouldn't be able to get 
you out of that jam. 

So then I thought about using LVM to label my disks (i.e. use LVM
for only one reason, and no other reason: to be able to assign logical
names, and not physical names).  But think about it... its scary, 
LVM is not widespread, somewhat buggy, and standard debian/redhat/suse 
rescue diskettes won't have it ... etc.

>  Assembling the root device should be handled the same way.  We tell
>  the boot loader/kernel where to expect it, but can over-ride that if
>  we need to:
>    md=0,/dev/hdc1,/dev/hde4
> 
>  All other md arrays can, and so should, be assembled by code running
>  out of the root filesystem.  

*If* you mounted the "right" root filesystem.  If you have multiple
copies around, and edit fstab on one but not the others, and then
recable and reboot ... 

>  This could be some program that
>  assembles anything it finds after scanning all devices, or something
>  a bit more focused, but it should be controllable by the sysadmin.

At least once I managed to mount /usr as /var because of the confusion,
and upon reboot the init.d scripts spewed /var/lock crud into my poor
/usr filesystem ... 

I want to know that /dev/mdwhatever is /var *before* I mount it,
not after.

>  It is true that in-kernel auto-detect can be controlled by fiddling
>  with partition types, but the problem is that it runs *before* the
>  root filesystem is mounted and so could conceivably confuse the
>  assembly of the root device (if e.g. you plugged in some other device
>  that also claimed to be part of /dev/md0, and it got scanned before
>  your real root device).

Well, yes, that too, I suppose.  But as you see, explicitly
specifying it at the boot prompt works only if you type in the
right thing ...

--linas

cc.ing ppc64 because although not an architecture issue, it is a 
sysadmin issue on enterprise-class machines.

cc.ing hot-plug because this is a cold-plug issue.

Note you get similar crud for multiple ethernet cards ... 

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Root Drive Mirroring and LVM.
  2004-02-05  0:37             ` linas
@ 2004-02-05  1:31               ` Guy
  0 siblings, 0 replies; 19+ messages in thread
From: Guy @ 2004-02-05  1:31 UTC (permalink / raw)
  To: linas, 'Neil Brown'
  Cc: 'Mark Hahn', linux-raid, linuxppc64-dev,
	linux-hotplug-devel

I upgraded my firmware on a 2940U2W.  That changed the order my SCSI buses
were scanned.  This changed the boot order of my disks.  I had to disable
the bios on the 2940U2W so it would not attempt to boot from the disks on
that bus.

My MB has 2 SCSI buses and I have 2 SCSI cards.

So, anything that could have prevented this would be good.

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of linas@austin.ibm.com
Sent: Wednesday, February 04, 2004 7:37 PM
To: Neil Brown
Cc: Mark Hahn; linux-raid@vger.kernel.org;
linuxppc64-dev@lists.linuxppc.org; linux-hotplug-devel@lists.sourceforge.net
Subject: Re: Root Drive Mirroring and LVM.

On Thu, Feb 05, 2004 at 10:07:32AM +1100, Neil Brown wrote:
> On Wednesday February 4, linas@austin.ibm.com wrote:
> > On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote:
> > > > So what's wrong with autodetect, again?
> > > 
> > > my guess is: in-kernel autodetect is the problem.  
> > > out-of-kernel detection can be much smarter, 
> > > and can be more easily tested/replaced.
> > 
> > Hm, yes, that makes sense.
> 
> 
> Good :-)
> 
> Just to flesh out my thoughts a bit more:
> 
>  If the root filesystem is on an MD array, then I see the process of
>  assembling that md array as quite similar to the process of finding
>  the device for the root filesystem.
> 
>  We don't expect the kernel, or anyone else, to scan all devices
>  looking for something that looks like a root filesystem, and loading
>  that.  Rather we tell the kernel or boot loader exactly where to find
>  the root filesystem.  And if the root filesystem moves, we get to
>  explicitly tell the boot loader where it is (root=/dev/hdc1 or
>  whatever).

Well, this is actually one of my more bitter complaints.  I'd much
much rather have some symbolic disk label, and have the kernel scan 
*all* devices for that.  I can defend this for both low-end and high-end 
machines.

On the low end i.e. PC's, PC servers: I've dealt (multiple times) with 
disk and/or controller failure.  If you have more than 3 disks, it can
get very confusing as to which cable is which.  Add to that that some
BIOS'es allow you to enable/disable controllers, which cause devices 
to be renumbered. E.g. I have a mobo with two 'plain vanilla' ide
connectors and two '100MHZ' 80-wire ide connectors.  Which one is 
/dev/hda and which is not depends on the BIOS settings.  Worse: 
if I plug in a 3rd party ide controller, then the numbering becomes 
insane:

/dev/hda: mobo 33MHz controller
/dev/hdc: mobo 33MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller
/dev/hdi: mobo 100MHz controller
/dev/hdk: mobo 100MHz controller

But if I disable the 33MHz mobo connector in bios:

/dev/hda: mobo 100MHz controller
/dev/hdc: mobo 100MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller

Since some types of disk failures will lock up the kernel during boot,
you spend a lot of time plugging and unplugging disks and rebooting.
It gets confusing after a while.

To add to the confusion: If you RAID-1 mirror the root partition, 
but then mount it as /dev/hdk (and not /dev/md0) during a rescue 
operation (because rescue disks often don't have RAID on them), 
and then edit /etc/fstab to try to match the new config...
you end up with two root partitions (each of the mirror pair) with
different /etc/fstab's, on different cables/controllers, ...
after fiddling with that, its a nightmare to figure out which 
is which, what's mounted how and where.

I also had a similar experience on a machine with 30-odd scsi disks.
We installed a new kernel on /dev/sdp, rebooted ... and what confusion,
since /dev/sdp was now /dev/sdg.  Worse, since the bootloader was 
unaware of this difference ... So then we made a change, and rebooted...
... again, much confusion till we worked it out.

As a cure-all, I started fiddling with ext2 disk labels.  However,
an ext2 disk label, when written on a RAID-1 device, will identify
*three* things: /dev/md0 (for example), and both mirror pairs: 
/dev/hda1 and /dev/hde1 all have the same label.

rieserfs doesn't have ext2 labels....

I wanted to fool with other partition table schemes (non-DOS partition 
tables) until I realized most PC rescue disks wouldn't be able to get 
you out of that jam. 

So then I thought about using LVM to label my disks (i.e. use LVM
for only one reason, and no other reason: to be able to assign logical
names, and not physical names).  But think about it... its scary, 
LVM is not widespread, somewhat buggy, and standard debian/redhat/suse 
rescue diskettes won't have it ... etc.

>  Assembling the root device should be handled the same way.  We tell
>  the boot loader/kernel where to expect it, but can over-ride that if
>  we need to:
>    md=0,/dev/hdc1,/dev/hde4
> 
>  All other md arrays can, and so should, be assembled by code running
>  out of the root filesystem.  

*If* you mounted the "right" root filesystem.  If you have multiple
copies around, and edit fstab on one but not the others, and then
recable and reboot ... 

>  This could be some program that
>  assembles anything it finds after scanning all devices, or something
>  a bit more focused, but it should be controllable by the sysadmin.

At least once I managed to mount /usr as /var because of the confusion,
and upon reboot the init.d scripts spewed /var/lock crud into my poor
/usr filesystem ... 

I want to know that /dev/mdwhatever is /var *before* I mount it,
not after.

>  It is true that in-kernel auto-detect can be controlled by fiddling
>  with partition types, but the problem is that it runs *before* the
>  root filesystem is mounted and so could conceivably confuse the
>  assembly of the root device (if e.g. you plugged in some other device
>  that also claimed to be part of /dev/md0, and it got scanned before
>  your real root device).

Well, yes, that too, I suppose.  But as you see, explicitly
specifying it at the boot prompt works only if you type in the
right thing ...

--linas

cc.ing ppc64 because although not an architecture issue, it is a 
sysadmin issue on enterprise-class machines.

cc.ing hot-plug because this is a cold-plug issue.

Note you get similar crud for multiple ethernet cards ... 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* badblock handling
  2004-02-04 23:07           ` Neil Brown
  2004-02-05  0:37             ` linas
@ 2004-02-05  4:45             ` Donghui Wen
  2004-02-05  5:26               ` Guy
                                 ` (2 more replies)
  2004-02-05 18:04             ` Root Drive Mirroring and LVM Joe Pruett
  2 siblings, 3 replies; 19+ messages in thread
From: Donghui Wen @ 2004-02-05  4:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Hi,
     I am running a server with Linux software-raid (3ware controller). But
from time to time,
a disk is kicked out by md. This will happen when 3ware card reports a
unrecovered read error
for a sector. But when I run raidhotadd, the disk can added back to raid
with no problem.
     So my questions are:
        (1) Will md  kick out one disk if it find out ONE bad block?
        (2) Is it possible to set up a threshold, only the amount of bad
blocks pass this threshold,
            the disk will be kicked out.
        (3) Is it possible to remap the bad blocks to some spare blocks
automatically on the fly?

Thanks!

Donghui


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: badblock handling
  2004-02-05  4:45             ` badblock handling Donghui Wen
@ 2004-02-05  5:26               ` Guy
  2004-02-05  7:35               ` Martin Dohmen
  2004-02-06  8:54               ` Holger Kiehl
  2 siblings, 0 replies; 19+ messages in thread
From: Guy @ 2004-02-05  5:26 UTC (permalink / raw)
  To: 'Donghui Wen', linux-raid; +Cc: 'Neil Brown'

I have had the same problems.  Once there is a bad block on the disk, you
can't read it.  However, if you overwrite the bad block with new data (or
the same data) the disk drive will relocate the bad block to a spare block.
Once relocated, the disk is good again.  This auto relocation of bad blocks
(or sectors) has been around since IDE disks.  Not sure all IDE disk, but
all that I know about.

What I think should happen is:
If Read error,
  re-create missing data,
  re-write the data to the "bad" disk.
  If write fails,
    fail the disk,
  else,
    go on with life.

In the past 2 years I have had to remove and re-add failed disks about 5-10
times.  I never had to replace a bad disk.

If something like the above could be implemented, it would save a lot of
effort.

Also, if the bad block table is almost full, then fail the drive regardless.
Maybe a threshold that can be set.  Like 90% full.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Donghui Wen
Sent: Wednesday, February 04, 2004 11:46 PM
To: linux-raid@vger.kernel.org
Cc: Neil Brown
Subject: badblock handling

Hi,
     I am running a server with Linux software-raid (3ware controller). But
from time to time,
a disk is kicked out by md. This will happen when 3ware card reports a
unrecovered read error
for a sector. But when I run raidhotadd, the disk can added back to raid
with no problem.
     So my questions are:
        (1) Will md  kick out one disk if it find out ONE bad block?
        (2) Is it possible to set up a threshold, only the amount of bad
blocks pass this threshold,
            the disk will be kicked out.
        (3) Is it possible to remap the bad blocks to some spare blocks
automatically on the fly?

Thanks!

Donghui

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: badblock handling
  2004-02-05  4:45             ` badblock handling Donghui Wen
  2004-02-05  5:26               ` Guy
@ 2004-02-05  7:35               ` Martin Dohmen
  2004-02-06  4:43                 ` Donghui Wen
  2004-02-06  8:54               ` Holger Kiehl
  2 siblings, 1 reply; 19+ messages in thread
From: Martin Dohmen @ 2004-02-05  7:35 UTC (permalink / raw)
  To: 'Donghui Wen', linux-raid

I have seen the same problem, but I have also noticed that if I use the 
3ware cards own raid5 instead of md things work diffrently, the only thing 
that seems to happen when a disk develops a bad block is this in the log:

3w-xxxx: scsi0: AEN: ERROR: Drive error: Port #6.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #6.

I do not now if 3ware sets some block asside when the raid is created for 
use when errors occure or if they justs writes the block agan and the disk 
does the realocation by it self.

The downside of using 3ware raid5 is that performance is quite much lower 
than with md, with 8 disks I get 90MB/s with md and 40MB/s with 3ware in 
writespeed.


--On den 4 februari 2004 20:45 -0800 Donghui Wen 
<dhwen@protegonetworks.com> wrote:

> Hi,
>      I am running a server with Linux software-raid (3ware controller).
> But from time to time,
> a disk is kicked out by md. This will happen when 3ware card reports a
> unrecovered read error
> for a sector. But when I run raidhotadd, the disk can added back to raid
> with no problem.
>      So my questions are:
>         (1) Will md  kick out one disk if it find out ONE bad block?
>         (2) Is it possible to set up a threshold, only the amount of bad
> blocks pass this threshold,
>             the disk will be kicked out.
>         (3) Is it possible to remap the bad blocks to some spare blocks
> automatically on the fly?
>
> Thanks!
>
> Donghui
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



Med vänlig hälsning
TRIPNET AB

Martin Dohmen
________________________________________________________________________
Tripnet AB                Besöksadress:      Telefon:  031-725 25 00
Box 5071                  Åvägen 42          Telefax:  031-725 25 01
402 22  GÖTEBORG          GÖTEBORG           Direkt:   031-725 25 11
http://www.tripnet.se     dohmen@tripnet.se  Mobil:    0733-58 25 11

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-02-04 23:07           ` Neil Brown
  2004-02-05  0:37             ` linas
  2004-02-05  4:45             ` badblock handling Donghui Wen
@ 2004-02-05 18:04             ` Joe Pruett
  2 siblings, 0 replies; 19+ messages in thread
From: Joe Pruett @ 2004-02-05 18:04 UTC (permalink / raw)
  To: linux-raid

On Thu, 5 Feb 2004, Neil Brown wrote:

>  We don't expect the kernel, or anyone else, to scan all devices
>  looking for something that looks like a root filesystem, and loading
>  that.  Rather we tell the kernel or boot loader exactly where to find
>  the root filesystem.  And if the root filesystem moves, we get to
>  explicitly tell the boot loader where it is (root=/dev/hdc1 or
>  whatever).

what about:
root=LABEL=/



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: badblock handling
  2004-02-05  7:35               ` Martin Dohmen
@ 2004-02-06  4:43                 ` Donghui Wen
  0 siblings, 0 replies; 19+ messages in thread
From: Donghui Wen @ 2004-02-06  4:43 UTC (permalink / raw)
  To: Martin Dohmen, linux-raid

I read some posts before says 3ware's hardware raid is doing bad block
remapping, which is transparent to file system. I was wondering if md is
doing
the same way.

I am trying hardware raid 1+0 now, even it will lost half of the disk space,
but if it is stable and fast, I will go for it.

Donghui

----- Original Message ----- 
From: "Martin Dohmen" <md@tripnet.se>
To: "'Donghui Wen'" <dhwen@protegonetworks.com>;
<linux-raid@vger.kernel.org>
Sent: Wednesday, February 04, 2004 11:35 PM
Subject: Re: badblock handling


I have seen the same problem, but I have also noticed that if I use the
3ware cards own raid5 instead of md things work diffrently, the only thing
that seems to happen when a disk develops a bad block is this in the log:

3w-xxxx: scsi0: AEN: ERROR: Drive error: Port #6.
3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #6.

I do not now if 3ware sets some block asside when the raid is created for
use when errors occure or if they justs writes the block agan and the disk
does the realocation by it self.

The downside of using 3ware raid5 is that performance is quite much lower
than with md, with 8 disks I get 90MB/s with md and 40MB/s with 3ware in
writespeed.


--On den 4 februari 2004 20:45 -0800 Donghui Wen
<dhwen@protegonetworks.com> wrote:

> Hi,
>      I am running a server with Linux software-raid (3ware controller).
> But from time to time,
> a disk is kicked out by md. This will happen when 3ware card reports a
> unrecovered read error
> for a sector. But when I run raidhotadd, the disk can added back to raid
> with no problem.
>      So my questions are:
>         (1) Will md  kick out one disk if it find out ONE bad block?
>         (2) Is it possible to set up a threshold, only the amount of bad
> blocks pass this threshold,
>             the disk will be kicked out.
>         (3) Is it possible to remap the bad blocks to some spare blocks
> automatically on the fly?
>
> Thanks!
>
> Donghui
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



Med vänlig hälsning
TRIPNET AB

Martin Dohmen
________________________________________________________________________
Tripnet AB                Besöksadress:      Telefon:  031-725 25 00
Box 5071                  Åvägen 42          Telefax:  031-725 25 01
402 22  GÖTEBORG          GÖTEBORG           Direkt:   031-725 25 11
http://www.tripnet.se     dohmen@tripnet.se  Mobil:    0733-58 25 11

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: badblock handling
  2004-02-05  4:45             ` badblock handling Donghui Wen
  2004-02-05  5:26               ` Guy
  2004-02-05  7:35               ` Martin Dohmen
@ 2004-02-06  8:54               ` Holger Kiehl
  2 siblings, 0 replies; 19+ messages in thread
From: Holger Kiehl @ 2004-02-06  8:54 UTC (permalink / raw)
  To: Donghui Wen; +Cc: linux-raid, Neil Brown

On Wed, 4 Feb 2004, Donghui Wen wrote:

> Hi,
>      I am running a server with Linux software-raid (3ware controller). But
> from time to time,
> a disk is kicked out by md. This will happen when 3ware card reports a
> unrecovered read error
> for a sector. But when I run raidhotadd, the disk can added back to raid
> with no problem.
>      So my questions are:
>         (1) Will md  kick out one disk if it find out ONE bad block?
>         (2) Is it possible to set up a threshold, only the amount of bad
> blocks pass this threshold,
>             the disk will be kicked out.
>         (3) Is it possible to remap the bad blocks to some spare blocks
> automatically on the fly?
> 
I am not an expert on this and I am only guessing, so please someone
correct me if I say something wrong.

The md block device gets an error from the underneath scsi/ide/usb/firewire/
network layer. It does not tell exactly what type of error has occured,
the best thing for md to do is to kick out this device.

Holger


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Root Drive Mirroring and LVM.
  2004-02-04 18:23     ` linas
  2004-02-04 22:32       ` Mark Hahn
@ 2004-03-23  7:27       ` Atro Tossavainen
  1 sibling, 0 replies; 19+ messages in thread
From: Atro Tossavainen @ 2004-03-23  7:27 UTC (permalink / raw)
  To: linas; +Cc: Neil Brown, Atro.Tossavainen, linuxppc-dev, linux-raid, tas

On Wed, 4 Feb 2004 12:23:17 -0600, linas@austin.ibm.com wrote:

> On Wed, Jan 28, 2004 at 12:42:18PM +1100, Neil Brown wrote:
>
> > I personally think auto-detect is the wrong approach and have no
> > desire to extend it to other partition types (I cannot remove it from
> > DOS partitions as that breaks back-compatability).
> > Just use "md=..."
> 
> What's wrong with autodetect?  Its saved my butt a number of times.
> 
> My recurring nightmare is involves a failed ide controller.  If the 
> failed controller is on the motherboard, then plugging in a store-bought
> ide controller causes the BIOS to "randomly" renumber to hard drives.
> (yes, sorry this is a PC issue, but I would guess that similar issues 
> lurk in open firmware, etc). 

Indeed.  Speaking for myself only, the whole issue has only come about
because somewhere after 2.4.20, the handling of Promise add-on IDE
controllers was changed so that where the Promise hard drives on the
Apple Xserve were previously /dev/hde to /dev/hdh and the "mainboard"
IDE (which only handles the CD-ROM) was /dev/hda to /dev/hdd, their
internal order is now reversed, and for some reason I am unable to get
the metadevice to work after the change (so am effectively unable to
update my kernel as root is on md).

-- 
Atro Tossavainen (Mr.)               / The Institute of Biotechnology at
Systems Analyst, Techno-Amish &     / the University of Helsinki, Finland,
+358-9-19158939  UNIX Dinosaur     / employs me, but my opinions are my own.
< URL : http : / / www . helsinki . fi / %7E atossava / > NO FILE ATTACHMENTS

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-03-23  7:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <868055F9-5045-11D8-8066-00039382032A@mindspring.com>
2004-01-27  8:01 ` Root Drive Mirroring and LVM Atro Tossavainen
2004-01-27  9:32   ` Sven Luther
2004-01-29  1:01     ` Tom Vier
2004-01-29  7:22       ` Sven Luther
2004-01-28  1:42   ` Neil Brown
2004-01-28  8:15     ` Sven Luther
2004-02-04 18:23     ` linas
2004-02-04 22:32       ` Mark Hahn
2004-02-04 22:49         ` linas
2004-02-04 23:07           ` Neil Brown
2004-02-05  0:37             ` linas
2004-02-05  1:31               ` Guy
2004-02-05  4:45             ` badblock handling Donghui Wen
2004-02-05  5:26               ` Guy
2004-02-05  7:35               ` Martin Dohmen
2004-02-06  4:43                 ` Donghui Wen
2004-02-06  8:54               ` Holger Kiehl
2004-02-05 18:04             ` Root Drive Mirroring and LVM Joe Pruett
2004-03-23  7:27       ` Atro Tossavainen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).