Bug report: mdadm -E oddity

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bug report: mdadm -E oddity
@ 2005-05-13 15:44 Doug Ledford
  2005-05-13 17:11 ` Doug Ledford
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2005-05-13 15:44 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

If you create stacked md devices, ala:

[root@pe-fc4 devel]# cat /proc/mdstat
Personalities : [raid0] [raid5] [multipath]
md_d0 : active raid5 md3[3] md2[2] md1[1] md0[0]
      53327232 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active multipath sdm1[1] sdi1[0]
      17775808 blocks [2/2] [UU]

md2 : active multipath sdl1[1] sdh1[0]
      17775808 blocks [2/2] [UU]

md1 : active multipath sdk1[1] sdg1[0]
      17775808 blocks [2/2] [UU]

md0 : active multipath sdj1[0] sdf1[1]
      17775808 blocks [2/2] [UU]

unused devices: <none>
[root@pe-fc4 devel]#

and then run mdadm -E --scan, then you get this (obviously wrong)
output:

[root@pe-fc4 devel]# /sbin/mdadm -E --scan
ARRAY /dev/md0 level=multipath num-devices=2
UUID=34f4efec:bafe48ef:f1bb5b94:f5aace52
   devices=/dev/sdj1,/dev/sdf1
ARRAY /dev/md1 level=multipath num-devices=2
UUID=bbaaf9fd:a1f118a9:bcaa287b:e7ac8c0f
   devices=/dev/sdk1,/dev/sdg1
ARRAY /dev/md2 level=multipath num-devices=2
UUID=a719f449:1c63e488:b9344127:98a9bcad
   devices=/dev/sdl1,/dev/sdh1
ARRAY /dev/md3 level=multipath num-devices=2
UUID=37b23a92:f25ffdc2:153713f7:8e5d5e3b
   devices=/dev/sdm1,/dev/sdi1
ARRAY /dev/md0 level=raid5 num-devices=4
UUID=910b1fc9:d545bfd6:e4227893:75d72fd8

devices=/dev/md3,/dev/md2,/dev/md1,/dev/md0,/dev/md3,/dev/md2,/dev/md1,/dev/md0
[root@pe-fc4 devel]#

This is with mdadm-1.11.0.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-13 15:44 Bug report: mdadm -E oddity Doug Ledford
@ 2005-05-13 17:11 ` Doug Ledford
  2005-05-13 23:01   ` Neil Brown
  2005-05-16 22:11   ` Doug Ledford
  0 siblings, 2 replies; 23+ messages in thread
From: Doug Ledford @ 2005-05-13 17:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 2005-05-13 at 11:44 -0400, Doug Ledford wrote:
> If you create stacked md devices, ala:
> 
> [root@pe-fc4 devel]# cat /proc/mdstat
> Personalities : [raid0] [raid5] [multipath]
> md_d0 : active raid5 md3[3] md2[2] md1[1] md0[0]
>       53327232 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> md3 : active multipath sdm1[1] sdi1[0]
>       17775808 blocks [2/2] [UU]
> 
> md2 : active multipath sdl1[1] sdh1[0]
>       17775808 blocks [2/2] [UU]
> 
> md1 : active multipath sdk1[1] sdg1[0]
>       17775808 blocks [2/2] [UU]
> 
> md0 : active multipath sdj1[0] sdf1[1]
>       17775808 blocks [2/2] [UU]
> 
> unused devices: <none>
> [root@pe-fc4 devel]#
> 
> and then run mdadm -E --scan, then you get this (obviously wrong)
> output:
> 
> [root@pe-fc4 devel]# /sbin/mdadm -E --scan
> ARRAY /dev/md0 level=multipath num-devices=2
> UUID=34f4efec:bafe48ef:f1bb5b94:f5aace52
>    devices=/dev/sdj1,/dev/sdf1
> ARRAY /dev/md1 level=multipath num-devices=2
> UUID=bbaaf9fd:a1f118a9:bcaa287b:e7ac8c0f
>    devices=/dev/sdk1,/dev/sdg1
> ARRAY /dev/md2 level=multipath num-devices=2
> UUID=a719f449:1c63e488:b9344127:98a9bcad
>    devices=/dev/sdl1,/dev/sdh1
> ARRAY /dev/md3 level=multipath num-devices=2
> UUID=37b23a92:f25ffdc2:153713f7:8e5d5e3b
>    devices=/dev/sdm1,/dev/sdi1
> ARRAY /dev/md0 level=raid5 num-devices=4
> UUID=910b1fc9:d545bfd6:e4227893:75d72fd8
> 
> devices=/dev/md3,/dev/md2,/dev/md1,/dev/md0,/dev/md3,/dev/md2,/dev/md1,/dev/md0
> [root@pe-fc4 devel]#
> 
> This is with mdadm-1.11.0.

OK, this appears to extend to mdadm -Ss and mdadm -A --scan as well.
Basically, mdadm does not properly handle mixed md and mdp type devices
well, especially in a stacked configuration.  I got it to work
reasonably well using this config file:

DEVICE partitions /dev/md[0-3]
MAILADDR root
ARRAY /dev/md0 level=multipath num-devices=2
UUID=34f4efec:bafe48ef:f1bb5b94:f5aace52 auto=md
ARRAY /dev/md1 level=multipath num-devices=2
UUID=bbaaf9fd:a1f118a9:bcaa287b:e7ac8c0f auto=md
ARRAY /dev/md2 level=multipath num-devices=2
UUID=a719f449:1c63e488:b9344127:98a9bcad auto=md
ARRAY /dev/md3 level=multipath num-devices=2
UUID=37b23a92:f25ffdc2:153713f7:8e5d5e3b auto=md
ARRAY /dev/md_d0 level=raid5 num-devices=4
UUID=910b1fc9:d545bfd6:e4227893:75d72fd8 auto=part

This generates a number of warnings during both assembly and stop, but
works.

One more thing, since the UUID is a good identifier, it would be nice to
have mdadm -E --scan not print a devices= part.  Device names can
change, and picking up your devices via UUID regardless of that change
is preferable, IMO, to having it fail.
-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-13 17:11 ` Doug Ledford
@ 2005-05-13 23:01   ` Neil Brown
  2005-05-14 13:28     ` Doug Ledford
  2005-05-16 16:46     ` Doug Ledford
  2005-05-16 22:11   ` Doug Ledford
  1 sibling, 2 replies; 23+ messages in thread
From: Neil Brown @ 2005-05-13 23:01 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On Friday May 13, dledford@redhat.com wrote:
> On Fri, 2005-05-13 at 11:44 -0400, Doug Ledford wrote:
> > If you create stacked md devices, ala:
> > 
> > [root@pe-fc4 devel]# cat /proc/mdstat
> > Personalities : [raid0] [raid5] [multipath]
> > md_d0 : active raid5 md3[3] md2[2] md1[1] md0[0]
> >       53327232 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
...
> > 
> > and then run mdadm -E --scan, then you get this (obviously wrong)
> > output:
> > 
> > [root@pe-fc4 devel]# /sbin/mdadm -E --scan
..
> > ARRAY /dev/md0 level=raid5 num-devices=4
> > UUID=910b1fc9:d545bfd6:e4227893:75d72fd8

Yes, I expect you would.  -E just looks as the superblocks, and the
superblock doesn't record whether the array is meant to be partitioned
or not.  With version-1 superblocks they don't even record the
sequence number of the array they are part of.  In that case, "-Es"
will report 
  ARRAY /dev/?? level=.....

Possibly I could utilise one of the high bits in the on-disc minor
number to record whether partitioning was used...

> OK, this appears to extend to mdadm -Ss and mdadm -A --scan as well.
> Basically, mdadm does not properly handle mixed md and mdp type devices
> well, especially in a stacked configuration.  I got it to work
> reasonably well using this config file:
> 
> DEVICE partitions /dev/md[0-3]
> MAILADDR root
> ARRAY /dev/md0 level=multipath num-devices=2
> UUID=34f4efec:bafe48ef:f1bb5b94:f5aace52 auto=md
> ARRAY /dev/md1 level=multipath num-devices=2
> UUID=bbaaf9fd:a1f118a9:bcaa287b:e7ac8c0f auto=md
> ARRAY /dev/md2 level=multipath num-devices=2
> UUID=a719f449:1c63e488:b9344127:98a9bcad auto=md
> ARRAY /dev/md3 level=multipath num-devices=2
> UUID=37b23a92:f25ffdc2:153713f7:8e5d5e3b auto=md
> ARRAY /dev/md_d0 level=raid5 num-devices=4
> UUID=910b1fc9:d545bfd6:e4227893:75d72fd8 auto=part
> 
> This generates a number of warnings during both assembly and stop, but
> works.

What warnings are they?  I would expect this configuration to work
smoothly.

> 
> One more thing, since the UUID is a good identifier, it would be nice to
> have mdadm -E --scan not print a devices= part.  Device names can
> change, and picking up your devices via UUID regardless of that change
> is preferable, IMO, to having it fail.

The output of "-E --scan" was never intended to be used unchanged in
mdadm.conf.
It simply provides all available information in a brief format that is
reasonably compatible with mdadm.conf.  As is says in the Examples
section of mdadm.8

         echo 'DEVICE /dev/hd*[0-9] /dev/sd*[0-9]' > mdadm.conf
         mdadm --detail --scan >> mdadm.conf
       This  will  create  a  prototype  config  file that describes currently
       active arrays that are known to be made from partitions of IDE or  SCSI
       drives.   This file should be reviewed before being used as it may con-
       tain unwanted detail.

However I note that the doco for --examine says

              If --brief is given, or --scan then multiple  devices  that  are
              components of the one array are grouped together and reported in
              a single entry suitable for inclusion in  /etc/mdadm.conf.

which seems to make it alright to use it directly in mdadm.conf.

Maybe the -brief version should just give minimal detail (uuid), and
 --verbose be required for the device names.


So there are little things that could be done to smooth some of this
over, but the core problem seems to be that you want to use the output
of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
work.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-13 23:01   ` Neil Brown
@ 2005-05-14 13:28     ` Doug Ledford
  2005-05-15 17:32       ` Luca Berra
  2005-05-20  7:00       ` Neil Brown
  2005-05-16 16:46     ` Doug Ledford
  1 sibling, 2 replies; 23+ messages in thread
From: Doug Ledford @ 2005-05-14 13:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Sat, 2005-05-14 at 09:01 +1000, Neil Brown wrote:
> On Friday May 13, dledford@redhat.com wrote:
> So there are little things that could be done to smooth some of this
> over, but the core problem seems to be that you want to use the output
> of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
> work.

Actually, what I'm working on is Fedora Core 4 and the boot up sequence.
Specifically, I'm trying to get the use of mdadm -As acceptable as the
default means of starting arrays in the initrd.  My choices are either
this or make the raidautorun facility handle things properly.  Since
last I knew, you marked the raidautorun facility as deprecated, that
leaves mdadm (note this also has implications for the raid shutdown
sequence, I'll bring that up later).

So, here's some of the feedback I've gotten about this and the
constraints I'm working under as a result:

     1. People don't want to remake initrd images and update mdadm.conf
        files with every raid change.  So, the scan facility needs to
        properly handled unknown arrays (it doesn't currently).
     2. People don't want to have a degraded array not get started (this
        isn't a problem as far as I can tell).
     3. Udev throws some kinks in things because the raid startup is
        handled immediately after disk initialization and from the looks
        of things /sbin/hotplug isn't always done running by the time
        the mdadm command is run so occasionally disks get missed (like
        in my multipath setup, sometimes the second path isn't available
        and the array gets started with only one path as a result).  I
        either need to add procps and grep to the initrd and run a while
        ps ax | grep hotplut | grep -v grep or find some other way to
        make sure that the raid startup doesn't happen until after
        hotplug operations are complete (sleeping is another option, but
        a piss poor one in my opinion since I don't know a reasonable
        sleep time that's guaranteed to make sure all hotplug operations
        have completed).  However, with the advent of stacked md
        devices, this problem has to be moved into the mdadm binary
        itself.  The reason for this (and I had this happen to me
        yesterday), is that mdadm started the 4 multipath arrays, but
        one of them wasn't done with hotplug before it tried to start
        the stacked raid5 array, and as a result it started the raid5
        array with 3 out 4 of devices in degraded mode.  So, it is a
        requirement of stacked md devices that the mdadm binary, if it's
        going to be used to start the devices, wait for all md device it
        just started before proceeding with the next pass of md device
        startup.  If you are going to put that into the mdadm binary,
        then you might as well as use it for both the delay between md
        device startup and the initial delay to wait for all block
        devices to be started.
     4. Currently, the default number of partitions on a partitionable
        raid device is only 4.  That's pretty small.  I know of no way
        to encode the number of partitions the device was created with
        into the superblock, but really that's what we need so we can
        always start these devices with the correct number of minor
        numbers relative to how the array was created.  I would suggest
        encoding this somewhere in the superblock and then setting this
        to 0 on normal arrays and non-0 for partitionable arrays and
        that becomes your flag that determines whether an array is
        partitionable.
     5. I would like to be able to support dynamic multipath in the
        initrd image.  In order to do this properly, I need the ability
        to create multipath arrays with non-consistent superblocks.  I
        would then need mdadm to scan these multipath devices when
        autodetecting raid arrays.  I can see having a multipath config
        option that is one of three settings:
             A. Off - No multipath support except for persistent
                multipath arrays
             B. On - Setup any detected multipath devices as non-
                persistent multipath arrays, then use the multipath
                device in preference to the underlying devices for
                things like persuant mounts and raid starts
             C. All - Setup all devices as multipath devices in order to
                support dynamically adding new paths to previously
                single path devices at run time.  This would be useful
                on machines using fiber channel for boot disks (amongst
                other useful scenarios, you can tie this
                into /sbin/hotplug so that new disks get their unique ID
                checked and automatically get added to existing
                multipath arrays should they be just another path, that
                way bringing up a new fiber channel switch and plugging
                it into a controller and adding paths to existing
                devices will all just work properly and seamlessly).
     6. There are some other multipath issues I'd like to address, but I
        can do that separately (things like it's unclear whether the
        preferred way of creating a multipath array is with 2 devices or
        1 device and 1 spare, but mdadm requires that you pass the -f
        flag to create the 1 device 1 spare type, however should we ever
        support both active-active and active-passive multipath in the
        future, then this difference would seem to be the obvious way to
        differentiate between the two, so I would think either should be
        allowed by mdadm, but since readding a path that has come back
        to life to an existing multipath array always puts it in as a
        spare, if we want to support active-active then we need some way
        to trigger a spare->active transition on a multipath element as
        well).

That pretty much covers it for now.

> > On Fri, 2005-05-13 at 11:44 -0400, Doug Ledford wrote:
> > > If you create stacked md devices, ala:
> > > 
> > > [root@pe-fc4 devel]# cat /proc/mdstat
> > > Personalities : [raid0] [raid5] [multipath]
> > > md_d0 : active raid5 md3[3] md2[2] md1[1] md0[0]
> > >       53327232 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> ...
> > > 
> > > and then run mdadm -E --scan, then you get this (obviously wrong)
> > > output:
> > > 
> > > [root@pe-fc4 devel]# /sbin/mdadm -E --scan
> ..
> > > ARRAY /dev/md0 level=raid5 num-devices=4
> > > UUID=910b1fc9:d545bfd6:e4227893:75d72fd8
> 
> Yes, I expect you would.  -E just looks as the superblocks, and the
> superblock doesn't record whether the array is meant to be partitioned
> or not. 

Yeah, that's *gotta* be fixed.  Users will string me up by my testicles
otherwise.  Whether it's a matter of encode it in the superblock or read
the first sector and look for a partition table I don't care (the
partition table bit has some interesting aspects to it, like if you run
fdisk on a device then reboot it would change from non partitioned to
partitioned, but might have a superminor conflict, not sure whether that
would be a bug or a feature), but the days of remaking initrd images for
every little change have long since passed and if I do something to
bring them back they'll kill me.

>  With version-1 superblocks they don't even record the
> sequence number of the array they are part of.  In that case, "-Es"
> will report 
>   ARRAY /dev/?? level=.....
> 
> Possibly I could utilise one of the high bits in the on-disc minor
> number to record whether partitioning was used...

As mentioned previously, recording the number of minors used in creating
it would be preferable.

> > OK, this appears to extend to mdadm -Ss and mdadm -A --scan as well.
> > Basically, mdadm does not properly handle mixed md and mdp type devices
> > well, especially in a stacked configuration.  I got it to work
> > reasonably well using this config file:
> > 
> > DEVICE partitions /dev/md[0-3]
> > MAILADDR root
> > ARRAY /dev/md0 level=multipath num-devices=2
> > UUID=34f4efec:bafe48ef:f1bb5b94:f5aace52 auto=md
> > ARRAY /dev/md1 level=multipath num-devices=2
> > UUID=bbaaf9fd:a1f118a9:bcaa287b:e7ac8c0f auto=md
> > ARRAY /dev/md2 level=multipath num-devices=2
> > UUID=a719f449:1c63e488:b9344127:98a9bcad auto=md
> > ARRAY /dev/md3 level=multipath num-devices=2
> > UUID=37b23a92:f25ffdc2:153713f7:8e5d5e3b auto=md
> > ARRAY /dev/md_d0 level=raid5 num-devices=4
> > UUID=910b1fc9:d545bfd6:e4227893:75d72fd8 auto=part
> > 
> > This generates a number of warnings during both assembly and stop, but
> > works.
> 
> What warnings are they?  I would expect this configuration to work
> smoothly.

During stop, it tried to stop all the multipath devices before trying to
stop the raid5 device, so I get 4 warnings about the md device still
being in use, then it stops the raid5.  You have to run mdadm a second
time to then stop the multipath devices.  Assembly went smoother this
time, but it wasn't contending with device startup delays.

> > One more thing, since the UUID is a good identifier, it would be nice to
> > have mdadm -E --scan not print a devices= part.  Device names can
> > change, and picking up your devices via UUID regardless of that change
> > is preferable, IMO, to having it fail.
> 
> The output of "-E --scan" was never intended to be used unchanged in
> mdadm.conf.
> It simply provides all available information in a brief format that is
> reasonably compatible with mdadm.conf.  As is says in the Examples
> section of mdadm.8
> 
>          echo 'DEVICE /dev/hd*[0-9] /dev/sd*[0-9]' > mdadm.conf
>          mdadm --detail --scan >> mdadm.conf
>        This  will  create  a  prototype  config  file that describes currently
>        active arrays that are known to be made from partitions of IDE or  SCSI
>        drives.   This file should be reviewed before being used as it may con-
>        tain unwanted detail.
> 
> However I note that the doco for --examine says
> 
>               If --brief is given, or --scan then multiple  devices  that  are
>               components of the one array are grouped together and reported in
>               a single entry suitable for inclusion in  /etc/mdadm.conf.
> 
> which seems to make it alright to use it directly in mdadm.conf.

Well, if you intend to make --brief work for a default mdadm.conf
inclusion directly (which would be useful specifically for anaconda,
which is the Red Hat/Fedora install code so that after setting up the
drives we could basically just do the equivalent of

echo "DEVICE partitions" > mdadm.conf
mdadm -Es --brief >> mdadm.conf

and get things right) then I would suggest that each ARRAY line should
include the dev name (properly, right now /dev/md_d0 shows up
as /dev/md0), level, number of disks, uuid, and auto={md|p(number)}.
That would generate a very useful ARRAY line.

> Maybe the -brief version should just give minimal detail (uuid), and
>  --verbose be required for the device names.
> 
> 

> 
> NeilBrown
-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-14 13:28     ` Doug Ledford
@ 2005-05-15 17:32       ` Luca Berra
  2005-05-20  7:00       ` Neil Brown
  1 sibling, 0 replies; 23+ messages in thread
From: Luca Berra @ 2005-05-15 17:32 UTC (permalink / raw)
  To: linux-raid

i would like to add some comments
On Sat, May 14, 2005 at 09:28:35AM -0400, Doug Ledford wrote:
>Actually, what I'm working on is Fedora Core 4 and the boot up sequence.
>Specifically, I'm trying to get the use of mdadm -As acceptable as the
hint, mdassemble :)

>So, here's some of the feedback I've gotten about this and the
>constraints I'm working under as a result:
initrd should be used to mount root filesystem, handling other stuff in
initrd is just asking for more trouble

>     1. People don't want to remake initrd images and update mdadm.conf
>        files with every raid change.  So, the scan facility needs to
>        properly handled unknown arrays (it doesn't currently).
do really people change the UUID of the array containing / that often?

>     2. People don't want to have a degraded array not get started (this
>        isn't a problem as far as I can tell).
what would be the issue causing this.

>     3. Udev throws some kinks in things because the raid startup is
        i run udevstart in initrd after loading modules and before
        assembling md's. if nash was a better shell, i could use udev
        instead of udevstart

>        have completed).  However, with the advent of stacked md
        mdadm should create device files for md arrays if you ask it to,
        is this what you are after or are there other issues

>     4. Currently, the default number of partitions on a partitionable
        ....
>        numbers relative to how the array was created.  I would suggest
>        encoding this somewhere in the superblock and then setting this
>        to 0 on normal arrays and non-0 for partitionable arrays and
>        that becomes your flag that determines whether an array is
>        partitionable.
        agreed

>     5. I would like to be able to support dynamic multipath in the
>        initrd image.  In order to do this properly, I need the ability

i don't like using fc for boot disks, so i never tried, and i'll just
shut up.


-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-13 23:01   ` Neil Brown
  2005-05-14 13:28     ` Doug Ledford
@ 2005-05-16 16:46     ` Doug Ledford
  2005-05-20  7:08       ` Neil Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2005-05-16 16:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Sat, 2005-05-14 at 09:01 +1000, Neil Brown wrote:
> On Friday May 13, dledford@redhat.com wrote:
> > On Fri, 2005-05-13 at 11:44 -0400, Doug Ledford wrote:
> > > If you create stacked md devices, ala:
> > > 
> > > [root@pe-fc4 devel]# cat /proc/mdstat
> > > Personalities : [raid0] [raid5] [multipath]
> > > md_d0 : active raid5 md3[3] md2[2] md1[1] md0[0]
> > >       53327232 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> ...
> > > 
> > > and then run mdadm -E --scan, then you get this (obviously wrong)
> > > output:
> > > 
> > > [root@pe-fc4 devel]# /sbin/mdadm -E --scan
> ..
> > > ARRAY /dev/md0 level=raid5 num-devices=4
> > > UUID=910b1fc9:d545bfd6:e4227893:75d72fd8
> 
> Yes, I expect you would.  -E just looks as the superblocks, and the
> superblock doesn't record whether the array is meant to be partitioned
> or not.  With version-1 superblocks they don't even record the
> sequence number of the array they are part of.  In that case, "-Es"
> will report 
>   ARRAY /dev/?? level=.....
> 
> Possibly I could utilise one of the high bits in the on-disc minor
> number to record whether partitioning was used...

Does that mean with version 1 superblocks you *have* to have an ARRAY
line in order to get some specific UUID array started at the right minor
number?  If so, people will complain about that loudly.

> which seems to make it alright to use it directly in mdadm.conf.
> 
> Maybe the -brief version should just give minimal detail (uuid), and
>  --verbose be required for the device names.
> 
> 
> So there are little things that could be done to smooth some of this
> over, but the core problem seems to be that you want to use the output
> of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
> work.

Well, also keep in mind that the documentation on mdadm touts the fact
that is meant to be run without a config file.  I'm perfectly happy to
do that, but -A mode combined with --scan mode only assembles and starts
arrays in the config file.  As I mentioned in a previous email, I guess
I could do something like mdadm -Es | mdadm -c - -As -device=partitions
but that just seems kinda silly given that all that is needed is for the
assemble mode to start all devices found and we already know how to find
them in the -E mode.  Maybe either changing -As mode to assemble
everything regardless of config file or adding another flag to assemble
mode to do the same.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-13 17:11 ` Doug Ledford
  2005-05-13 23:01   ` Neil Brown
@ 2005-05-16 22:11   ` Doug Ledford
  1 sibling, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2005-05-16 22:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 2005-05-13 at 13:11 -0400, Doug Ledford wrote:

> > devices=/dev/md3,/dev/md2,/dev/md1,/dev/md0,/dev/md3,/dev/md2,/dev/md1,/dev/md0

OK, this particular little oddity is caused by the mdadm.conf file's
DEVICES= line containing both partitions and md[0-3].  Evidently mdadm
scans devices twice if they are both in the /proc/partitions file and
explicitly specified on the DEVICES= line.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-14 13:28     ` Doug Ledford
  2005-05-15 17:32       ` Luca Berra
@ 2005-05-20  7:00       ` Neil Brown
  2005-05-20 12:30         ` Doug Ledford
  1 sibling, 1 reply; 23+ messages in thread
From: Neil Brown @ 2005-05-20  7:00 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On Saturday May 14, dledford@redhat.com wrote:
> On Sat, 2005-05-14 at 09:01 +1000, Neil Brown wrote:
> > On Friday May 13, dledford@redhat.com wrote:
> > So there are little things that could be done to smooth some of this
> > over, but the core problem seems to be that you want to use the output
> > of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
> > work.
> 
> Actually, what I'm working on is Fedora Core 4 and the boot up sequence.
> Specifically, I'm trying to get the use of mdadm -As acceptable as the
> default means of starting arrays in the initrd.  My choices are either
> this or make the raidautorun facility handle things properly.  Since
> last I knew, you marked the raidautorun facility as deprecated, that
> leaves mdadm (note this also has implications for the raid shutdown
> sequence, I'll bring that up later).

I'd like to start with an observation that may (or may not) be helpful.
  Turn-key and flexibility are to some extent opposing goals.

By this I mean that making something highly flexible means providing
lots of options.  Making something "just works" tends to mean not using
a lot of those options.
Making an initrd for assembling md arrays that will do what everyone
wants in all circumstances just isn't going to work.

To illustrate: some years ago we had a RAID based external storage box
made by DEC - with whopping big 9G drives in it!!
When you plugged a drive in it would be automatically checked and, if
it was new, labeled as a spare.  It would then appear in the spare set
and maybe get immediately incorporated into any degraded array.

This was a great feature, but not something I would ever consider
putting into mdadm, and definitely not into the kernel.

If you want a turnkey system, there must be a place at which you say
"This is a turn-key system, you manage it for me" and you give up some
of the flexibility that you might otherwise have said.  For many this
is a good tradeoff.

So you definitely could (or should be able to) make a single initrd
script/program that gets things exactly right providing they were
configured according to the model that is proscribed by the maker of
that initrd.

Version-1 superblocks have fields intended to be used in exactly this
way.

> 
> So, here's some of the feedback I've gotten about this and the
> constraints I'm working under as a result:
> 
>      1. People don't want to remake initrd images and update mdadm.conf
>         files with every raid change.  So, the scan facility needs to
>         properly handled unknown arrays (it doesn't currently).

If this is true, then presumably people also don't want to update
/etc/fstab with every filesystem change.  
Is that a fair statement?  Do you automatically mount unexpected
filesystems?  Is there an important difference between raid
configuration and filesystem configuration that I am missing?

>      2. People don't want to have a degraded array not get started (this
>         isn't a problem as far as I can tell).

There is a converse to this.  People should be made to take notice if
there is possible data corruption.

i.e. if you have a system crash while running a degraded raid5, then
silent data corruption could ensue.  mdadm will currently not start
any array in this state without an explicit '--force'.  This is somewhat
akin to fsck sometime requiring human interaction.  Ofcourse if there
is good reason to believe the data is still safe, mdadm should -- and
I believe does -- assemble the array even if degraded.

>      3. Udev throws some kinks in things because the raid startup is
>         handled immediately after disk initialization and from the looks
>         of things /sbin/hotplug isn't always done running by the time
>         the mdadm command is run so occasionally disks get missed (like
>         in my multipath setup, sometimes the second path isn't available
>         and the array gets started with only one path as a result).  I
>         either need to add procps and grep to the initrd and run a while
>         ps ax | grep hotplut | grep -v grep or find some other way to
>         make sure that the raid startup doesn't happen until after
>         hotplug operations are complete (sleeping is another option, but
>         a piss poor one in my opinion since I don't know a reasonable
>         sleep time that's guaranteed to make sure all hotplug operations
>         have completed).

I've thought about this issue a bit, but as I don't actually face it I
haven't progressed very far.

I think I would like it to be easy to assemble arrays incrementally.
Every time hotplug reports a new device, it if appears to be part of a
known array, it gets attached to that array.

As soon as an array has enough disks to be accessed, it gets started
in read-only mode.  No resync or reconstruct happens, it just sits and
waits.
If more devices get found that are part of the array, they get added
too.  If the array becomes "full", then maybe it switches to writable.

With the bitmap stuff, it may be reasonable to switch it to writable
earlier and as more drives are found, only the blocks that have
actually been updated get synced.

At some point, you would need to be able to say "all drives have been
found, don't bother waiting any more" at which point, recovery onto a
spare might start if appropriate.
There are a number of questions in here that I haven't thought deeply
enough about yet, partly due to lack of relevant experience.

>                            However, with the advent of stacked md
>         devices, this problem has to be moved into the mdadm binary
>         itself.  The reason for this (and I had this happen to me
>         yesterday), is that mdadm started the 4 multipath arrays, but
>         one of them wasn't done with hotplug before it tried to start
>         the stacked raid5 array, and as a result it started the raid5
>         array with 3 out 4 of devices in degraded mode.  So, it is a
>         requirement of stacked md devices that the mdadm binary, if it's
>         going to be used to start the devices, wait for all md device it
>         just started before proceeding with the next pass of md device
>         startup.  If you are going to put that into the mdadm binary,
>         then you might as well as use it for both the delay between md
>         device startup and the initial delay to wait for all block
>         devices to be started.

I'm not sure I really understand the issue here, probably due to a
lack of experience with hotplug.
What do you mean "one of them wasn't done with hotplug before it tried
to start the stacked raid5 array", and how does this "not being done"
interfere with the starting of the raid5?

>      4. Currently, the default number of partitions on a partitionable
>         raid device is only 4.  That's pretty small.  I know of no way
>         to encode the number of partitions the device was created with
>         into the superblock, but really that's what we need so we can
>         always start these devices with the correct number of minor
>         numbers relative to how the array was created.  I would suggest
>         encoding this somewhere in the superblock and then setting this
>         to 0 on normal arrays and non-0 for partitionable arrays and
>         that becomes your flag that determines whether an array is
>         partitionable.

How does hotplug/udev create the right number of partitions for other,
non-md, devices?  Can the same mechanism be used for md?
I am loathe to include the number of partitions in the superblock much
as I am loath to include the filesystem type.  However see later
comments on version-1 superblocks.

>      5. I would like to be able to support dynamic multipath in the
>         initrd image.  In order to do this properly, I need the ability
>         to create multipath arrays with non-consistent superblocks.  I
>         would then need mdadm to scan these multipath devices when
>         autodetecting raid arrays.  I can see having a multipath config
>         option that is one of three settings:
>              A. Off - No multipath support except for persistent
>                 multipath arrays
>              B. On - Setup any detected multipath devices as non-
>                 persistent multipath arrays, then use the multipath
>                 device in preference to the underlying devices for
>                 things like persuant mounts and raid starts
>              C. All - Setup all devices as multipath devices in order to
>                 support dynamically adding new paths to previously
>                 single path devices at run time.  This would be useful
>                 on machines using fiber channel for boot disks (amongst
>                 other useful scenarios, you can tie this
>                 into /sbin/hotplug so that new disks get their unique ID
>                 checked and automatically get added to existing
>                 multipath arrays should they be just another path, that
>                 way bringing up a new fiber channel switch and plugging
>                 it into a controller and adding paths to existing
>                 devices will all just work properly and seamlessly).

This sounds reasonable....
I've often thought that it would be nice to be able to transparently
convert a single device into a raid1.  Then a drive could be added and
synced. the old drive removed, and then the drive morphed back into a
single drive - just a different one.

I remember many moons ago Linus saying that all this distinction
between ide drives and scsi drives and md drive etc was silly.  
There should be just one major number called "disk" and anything that
was a disk should appear there.
Given that sort of abstraction, we could teach md (and dm) to splice
with "disk"s etc.  It would be nice.
It would also be hell in terms of device name stability, but that is a
problem that has to be solved anyway.

If I ever do look at unifying md and dm, that is the goal I would work
towards.

But for now your proposal seems fine.  Is there any particular support
needed in md or mdadm?

>      6. There are some other multipath issues I'd like to address, but I
>         can do that separately (things like it's unclear whether the
>         preferred way of creating a multipath array is with 2 devices or
>         1 device and 1 spare, but mdadm requires that you pass the -f
>         flag to create the 1 device 1 spare type, however should we ever
>         support both active-active and active-passive multipath in the
>         future, then this difference would seem to be the obvious way to
>         differentiate between the two, so I would think either should be
>         allowed by mdadm, but since readding a path that has come back
>         to life to an existing multipath array always puts it in as a
>         spare, if we want to support active-active then we need some way
>         to trigger a spare->active transition on a multipath element as
>         well).

For active-active, I think there should be no spares.  All devices
should be active.  You should be able to --grow a multipath if you find
you have more paths than were initially allowed for.

For active-passive, I think md/multipath would need some serious work.
I have a suspicion that with such arrangements, if you access the
passive path, the active path turns off (is that correct).  If so, the
approach of reading the superblock from each path would be a problem.

> > > > 
> > > > [root@pe-fc4 devel]# /sbin/mdadm -E --scan
> > ..
> > > > ARRAY /dev/md0 level=raid5 num-devices=4
> > > > UUID=910b1fc9:d545bfd6:e4227893:75d72fd8
> > 
> > Yes, I expect you would.  -E just looks as the superblocks, and the
> > superblock doesn't record whether the array is meant to be partitioned
> > or not. 
> 
> Yeah, that's *gotta* be fixed.  Users will string me up by my testicles
> otherwise.  Whether it's a matter of encode it in the superblock or read
> the first sector and look for a partition table I don't care (the
> partition table bit has some interesting aspects to it, like if you run
> fdisk on a device then reboot it would change from non partitioned to
> partitioned, but might have a superminor conflict, not sure whether that
> would be a bug or a feature), but the days of remaking initrd images for
> every little change have long since passed and if I do something to
> bring them back they'll kill me.

I would suggest that initrd should only assemble the array containing
the root filesystem, and that whether it is partitioned or not probably
won't change very often...

I would also suggest (in line with the opening comment) that if people
want auto-assemble to "just work", then they should seriously consider
making all for their md arrays the partitionable type and deprecating
old major-9 non-partitionable arrays.  That would remove any
confusion.

> 
> >  With version-1 superblocks they don't even record the
> > sequence number of the array they are part of.  In that case, "-Es"
> > will report 
> >   ARRAY /dev/?? level=.....
> > 
> > Possibly I could utilise one of the high bits in the on-disc minor
> > number to record whether partitioning was used...
> 
> As mentioned previously, recording the number of minors used in creating
> it would be preferable.
> 
(and in a subsequent email)
> Does that mean with version 1 superblocks you *have* to have an ARRAY
> line in order to get some specific UUID array started at the right minor
> number?  If so, people will complain about that loudly.

What do you mean by "the right minor number"?  In these days of
non-stable device numbers I would have thought that an oxy-moron.

The Version-1 superblock has a 32 character 'set_name' field which can
freely be set by userspace (it cannot currently be changed while the
array is assembled, but that should get fixed).

It is intended effectively as an equivalent to md_minor in version0.90

mdadm doesn't currently support it (as version-1 support is still
under development) but the intention is something like that:

When an array is created, the set_name could be set to something like:

   fred:bob:5

which might mean:
  The array should be assembled by host 'fred' and should be get a
  device special file at
         /dev/bob
  with 5 partitions created.

What minor gets assigned is irrelevant.  Mdadm could use this
information when reporting the "-Es" output, and when assembling with 
  --config=partitions

Possibly, mdadm could also be given a helper program which parses the
set_name and returns the core information that mdadm needs (which I
guess is a yes/no for "should it be assembled", a named for the
device, a preferred minor number if any, and a number of partitions).
This would allow a system integrator a lot of freedom to do clever
things with set_name but still use mdadm.

You could conceivably even store a (short) path name in there and have
the filesystem automatically mounted if you really wanted.  This is
all user-space policy and you can do whatever your device management
system wanted (though you would impose limits on how users of your
system could create arrays.  e.g. require version-1 super blocks).

> > 
> > What warnings are they?  I would expect this configuration to work
> > smoothly.
> 
> During stop, it tried to stop all the multipath devices before trying to
> stop the raid5 device, so I get 4 warnings about the md device still
> being in use, then it stops the raid5.  You have to run mdadm a second
> time to then stop the multipath devices.  Assembly went smoother this
> time, but it wasn't contending with device startup delays.

This should be fixed in mdadm 1.9.0. From the change log:
    -   Make "mdadm -Ss" stop stacked devices properly, by reversing the
	order in which arrays are stopped.

Is it ?

> 
> Well, if you intend to make --brief work for a default mdadm.conf
> inclusion directly (which would be useful specifically for anaconda,
> which is the Red Hat/Fedora install code so that after setting up the
> drives we could basically just do the equivalent of
> 
> echo "DEVICE partitions" > mdadm.conf
> mdadm -Es --brief >> mdadm.conf
> 
> and get things right) then I would suggest that each ARRAY line should
> include the dev name (properly, right now /dev/md_d0 shows up
> as /dev/md0), level, number of disks, uuid, and auto={md|p(number)}.
> That would generate a very useful ARRAY line.

I'll have to carefully think through the consequences of overloading a
high bit in md_minor.
I'm not at all keen on including the number of partitions in a
version-0.90 superblock, but maybe I could be convinced.....

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-16 16:46     ` Doug Ledford
@ 2005-05-20  7:08       ` Neil Brown
  2005-05-20 11:29         ` Doug Ledford
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2005-05-20  7:08 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

On Monday May 16, dledford@redhat.com wrote:
> On Sat, 2005-05-14 at 09:01 +1000, Neil Brown wrote:
> > 
> > So there are little things that could be done to smooth some of this
> > over, but the core problem seems to be that you want to use the output
> > of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
> > work.
> 
> Well, also keep in mind that the documentation on mdadm touts the fact
> that is meant to be run without a config file.  I'm perfectly happy to
> do that, but -A mode combined with --scan mode only assembles and starts
> arrays in the config file.  As I mentioned in a previous email, I guess
> I could do something like mdadm -Es | mdadm -c - -As -device=partitions
> but that just seems kinda silly given that all that is needed is for the
> assemble mode to start all devices found and we already know how to find
> them in the -E mode.  Maybe either changing -As mode to assemble
> everything regardless of config file or adding another flag to assemble
> mode to do the same.

It isn't meant to say that mdadm "Should" be run without a config
file.  Just that mdadm "can" be run without a config file, in stark
contrast to raidtools that requires a config file even for stopping an
array.   Maybe I should tone down that part of the man page.

I don't like having a "blindly start everything" function for mdadm
for exactly the same reason as I don't like the in-kernel
"autodetect".   It is too easy for things to go wrong.

Suppose I have two computers with raid arrays.  One dies so I shut
down the other, plug all drives into the working one, and bring it up
again.
A "blindly start everything" function could get confused and pick the
wrong array to be the root filesystem, or hit other snags.

That is why in my previous email with my example for version-1 usage I
included a host name in the "set_name".  Hopefully the
start-everything script would have access to a host name (via dhcp or
command line or something) and would get that scenario right.

I really DON'T WANT people to
  mdadm -Es | mdadm -c - -As -device=partitions

I want them to
  mdadm -Es >> /etc/mdadm.conf
  edit /etc/mdadm.conf
    think about what should be there

just as they
   edit /etc/fstab
     think about what should be there.

Is that unreasonable?

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20  7:08       ` Neil Brown
@ 2005-05-20 11:29         ` Doug Ledford
  0 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2005-05-20 11:29 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Answering this email first, then I'll go back to the other one.

On Fri, 2005-05-20 at 17:08 +1000, Neil Brown wrote:
> On Monday May 16, dledford@redhat.com wrote:
> > On Sat, 2005-05-14 at 09:01 +1000, Neil Brown wrote:
> > > 
> > > So there are little things that could be done to smooth some of this
> > > over, but the core problem seems to be that you want to use the output
> > > of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
> > > work.
> > 
> > Well, also keep in mind that the documentation on mdadm touts the fact
> > that is meant to be run without a config file.  I'm perfectly happy to
> > do that, but -A mode combined with --scan mode only assembles and starts
> > arrays in the config file.  As I mentioned in a previous email, I guess
> > I could do something like mdadm -Es | mdadm -c - -As -device=partitions
> > but that just seems kinda silly given that all that is needed is for the
> > assemble mode to start all devices found and we already know how to find
> > them in the -E mode.  Maybe either changing -As mode to assemble
> > everything regardless of config file or adding another flag to assemble
> > mode to do the same.
> 
> It isn't meant to say that mdadm "Should" be run without a config
> file.  Just that mdadm "can" be run without a config file, in stark
> contrast to raidtools that requires a config file even for stopping an
> array.   Maybe I should tone down that part of the man page.
> 
> I don't like having a "blindly start everything" function for mdadm
> for exactly the same reason as I don't like the in-kernel
> "autodetect".   It is too easy for things to go wrong.
> 
> Suppose I have two computers with raid arrays.  One dies so I shut
> down the other, plug all drives into the working one, and bring it up
> again.
> A "blindly start everything" function could get confused and pick the
> wrong array to be the root filesystem, or hit other snags.

Yes it could, especially if your filesystems have identical labels and
now mount is confused by which one to mount.

> That is why in my previous email with my example for version-1 usage I
> included a host name in the "set_name".  Hopefully the
> start-everything script would have access to a host name (via dhcp or
> command line or something) and would get that scenario right.
> 
> I really DON'T WANT people to
>   mdadm -Es | mdadm -c - -As -device=partitions
> 
> I want them to
>   mdadm -Es >> /etc/mdadm.conf
>   edit /etc/mdadm.conf
>     think about what should be there
> 
> just as they
>    edit /etc/fstab
>      think about what should be there.
> 
> Is that unreasonable?

You brought up the fstab analogy in your last email too, and I can
somewhat see that analogy, but the other side of the argument is that
people don't have to edit a disk tab to get their disk devices, they
just show up.  Fstab is at a higher level.  Md devices are disk devices
as far as users are concerned, and to a certain extent they just want
them to be there.

I had actually put some thought into this over the last week.  One thing
I had thought about is that most modern hardware has some way of getting
at a serial number (i386 and x86_64 have cpuid serial numbers, I think
Sparc has an openfirmware serial number, etc).  I thought that it might
be useful to create an arch specific serial number routine that would be
accessible by the kernel at boot time, then when creating an array with
version 0.90 superblocks you have a 128bit UUID, so what about encoding
the upper 32 bits as the serial number and only auto starting arrays
that have the right serial number in the UUID?  That would avoid the
problem that you spoke of earlier with moving disks around.  You could
also make mdadm capable of doing an update=serial-number operation to
permanently move an array from one machine to another.  In the event
that a machine dies and you move a / array to another machine and just
want it to work, the presence of the UUID= in an ARRAY line for the /
array would override the mismatched serial number so you can start the
array for the / partition in the initrd, and then the remaining ARRAY
lines would result in the other known arrays getting started as well.
If a user wants to make things match, which wouldn't strictly be
required, then they run mdadm --update on the arrays for the new serial
number.  Anyway, just a thought.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20  7:00       ` Neil Brown
@ 2005-05-20 12:30         ` Doug Ledford
  2005-05-20 16:04           ` Paul Clements
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2005-05-20 12:30 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, 2005-05-20 at 17:00 +1000, Neil Brown wrote:
> On Saturday May 14, dledford@redhat.com wrote:
> > On Sat, 2005-05-14 at 09:01 +1000, Neil Brown wrote:
> > > On Friday May 13, dledford@redhat.com wrote:
> > > So there are little things that could be done to smooth some of this
> > > over, but the core problem seems to be that you want to use the output
> > > of "--examine --scan" unchanged in mdadm.conf, and that simply cannot
> > > work.
> > 
> > Actually, what I'm working on is Fedora Core 4 and the boot up sequence.
> > Specifically, I'm trying to get the use of mdadm -As acceptable as the
> > default means of starting arrays in the initrd.  My choices are either
> > this or make the raidautorun facility handle things properly.  Since
> > last I knew, you marked the raidautorun facility as deprecated, that
> > leaves mdadm (note this also has implications for the raid shutdown
> > sequence, I'll bring that up later).
> 
> I'd like to start with an observation that may (or may not) be helpful.
>   Turn-key and flexibility are to some extent opposing goals.
> 
> By this I mean that making something highly flexible means providing
> lots of options.  Making something "just works" tends to mean not using
> a lot of those options.
> Making an initrd for assembling md arrays that will do what everyone
> wants in all circumstances just isn't going to work.
> 
> To illustrate: some years ago we had a RAID based external storage box
> made by DEC - with whopping big 9G drives in it!!
> When you plugged a drive in it would be automatically checked and, if
> it was new, labeled as a spare.  It would then appear in the spare set
> and maybe get immediately incorporated into any degraded array.
> 
> This was a great feature, but not something I would ever consider
> putting into mdadm, and definitely not into the kernel.
> 
> If you want a turnkey system, there must be a place at which you say
> "This is a turn-key system, you manage it for me" and you give up some
> of the flexibility that you might otherwise have said.  For many this
> is a good tradeoff.
> 
> So you definitely could (or should be able to) make a single initrd
> script/program that gets things exactly right providing they were
> configured according to the model that is proscribed by the maker of
> that initrd.
> 
> Version-1 superblocks have fields intended to be used in exactly this
> way.
> 
> 
> > 
> > So, here's some of the feedback I've gotten about this and the
> > constraints I'm working under as a result:
> > 
> >      1. People don't want to remake initrd images and update mdadm.conf
> >         files with every raid change.  So, the scan facility needs to
> >         properly handled unknown arrays (it doesn't currently).
> 
> If this is true, then presumably people also don't want to update
> /etc/fstab with every filesystem change.  
> Is that a fair statement?  Do you automatically mount unexpected
> filesystems?  Is there an important difference between raid
> configuration and filesystem configuration that I am missing?
> 
> 
> >      2. People don't want to have a degraded array not get started (this
> >         isn't a problem as far as I can tell).
> 
> There is a converse to this.  People should be made to take notice if
> there is possible data corruption.
> 
> i.e. if you have a system crash while running a degraded raid5, then
> silent data corruption could ensue.  mdadm will currently not start
> any array in this state without an explicit '--force'.  This is somewhat
> akin to fsck sometime requiring human interaction.  Ofcourse if there
> is good reason to believe the data is still safe, mdadm should -- and
> I believe does -- assemble the array even if degraded.

Well, as I explained in my email sometime back on the issue of silent
data corruption, this is where journaling saves your ass.  Since the
journal has to be written before the filesystem proper updates are
writting, if the array goes down it either is in the journal write, in
which case you are throwing those blocks away anyway and so corruption
is irrelevant, or it's in the filesystem proper writes and if they get
corrupted you don't care because we are going to replay the journal and
rewrite them.

> 
> >      3. Udev throws some kinks in things because the raid startup is
> >         handled immediately after disk initialization and from the looks
> >         of things /sbin/hotplug isn't always done running by the time
> >         the mdadm command is run so occasionally disks get missed (like
> >         in my multipath setup, sometimes the second path isn't available
> >         and the array gets started with only one path as a result).  I
> >         either need to add procps and grep to the initrd and run a while
> >         ps ax | grep hotplut | grep -v grep or find some other way to
> >         make sure that the raid startup doesn't happen until after
> >         hotplug operations are complete (sleeping is another option, but
> >         a piss poor one in my opinion since I don't know a reasonable
> >         sleep time that's guaranteed to make sure all hotplug operations
> >         have completed).
> 
> I've thought about this issue a bit, but as I don't actually face it I
> haven't progressed very far.
> 
> I think I would like it to be easy to assemble arrays incrementally.
> Every time hotplug reports a new device, it if appears to be part of a
> known array, it gets attached to that array.
> 
> As soon as an array has enough disks to be accessed, it gets started
> in read-only mode.  No resync or reconstruct happens, it just sits and
> waits.
> If more devices get found that are part of the array, they get added
> too.  If the array becomes "full", then maybe it switches to writable.
> 
> With the bitmap stuff, it may be reasonable to switch it to writable
> earlier and as more drives are found, only the blocks that have
> actually been updated get synced.
> 
> At some point, you would need to be able to say "all drives have been
> found, don't bother waiting any more" at which point, recovery onto a
> spare might start if appropriate.
> There are a number of questions in here that I haven't thought deeply
> enough about yet, partly due to lack of relevant experience.

At least at bootup, it's easier and less error prone to just let the
devices get probed and not start anything until that's done.  Post
bootup, that may be a different issue.

> 
> >                            However, with the advent of stacked md
> >         devices, this problem has to be moved into the mdadm binary
> >         itself.  The reason for this (and I had this happen to me
> >         yesterday), is that mdadm started the 4 multipath arrays, but
> >         one of them wasn't done with hotplug before it tried to start
> >         the stacked raid5 array, and as a result it started the raid5
> >         array with 3 out 4 of devices in degraded mode.  So, it is a
> >         requirement of stacked md devices that the mdadm binary, if it's
> >         going to be used to start the devices, wait for all md device it
> >         just started before proceeding with the next pass of md device
> >         startup.  If you are going to put that into the mdadm binary,
> >         then you might as well as use it for both the delay between md
> >         device startup and the initial delay to wait for all block
> >         devices to be started.
> 
> I'm not sure I really understand the issue here, probably due to a
> lack of experience with hotplug.
> What do you mean "one of them wasn't done with hotplug before it tried
> to start the stacked raid5 array", and how does this "not being done"
> interfere with the starting of the raid5?

OK, I didn't actually trace this to be positive, but knowing how the
block layer *used* to do things, I think this is what's going on.  Let's
say you create a device node in the udev space of /dev/sda
and /dev/sda1.  You do this *before* loading the SCSI driver.  You load
the SCSI driver and it finds a disk.  It then calls the various init
sequences for the block device.  I can't remember what it's called right
now, but IIRC you call init_onedisk or something like that for sda.
Assuming sda is not currently in use, it starts out by basically tearing
down all the sda1 - sda15 block devices, then rereading the partition
table, then rebuilding all the sda1 - sda15 devices according to what
exists in the partition table.  With a persistent /dev namespace, that
was fine.  If you tried to open /dev/sda1 while this was happening it
just blocked the open until the sequence was over.  With udev, tearing
down all those devices and reiniting them not only makes then
unopenable, but it removes the device entry from the /dev namespace.

Now enter a stacked md device.  So you pass auto=md to mdadm, and you
tell it to start md0 - md3.  Great.  It creates the various md
*target* /dev entries that it needs, and it scans /proc/partitions for
the constituent devices, but if those devices are in the middle of their
init/validate sequence at the time that mdadm is run, they might or
might not be present.  From the time that sda was found, we have a delay
for reading the partition table from disk (could be multiple disk reads
if extended partitions present) and then setting up the device minors
based upon that as a race window against the mdadm startup sequence.
What I found was that on host scsi2, which is the first path to the 4
multipath drives, all of those paths were reliably accessible at raid
startup.  However, the devices on scsi3, the last scsi controller in the
system, were occasionally unavailable.  Then you have the same sort of
delay between when you start md0 - md3 and when they are all available
in the /dev namespace to be utilized in constructing the stacked array.
So even though mdadm creates the target device if needed, what I seem to
be running into is inconsistent existence of the constituent devices.

> 
> 
> >      4. Currently, the default number of partitions on a partitionable
> >         raid device is only 4.  That's pretty small.  I know of no way
> >         to encode the number of partitions the device was created with
> >         into the superblock, but really that's what we need so we can
> >         always start these devices with the correct number of minor
> >         numbers relative to how the array was created.  I would suggest
> >         encoding this somewhere in the superblock and then setting this
> >         to 0 on normal arrays and non-0 for partitionable arrays and
> >         that becomes your flag that determines whether an array is
> >         partitionable.
> 
> How does hotplug/udev create the right number of partitions for other,
> non-md, devices?

Reads the partition table.

>   Can the same mechanism be used for md?
> I am loathe to include the number of partitions in the superblock much
> as I am loath to include the filesystem type.  However see later
> comments on version-1 superblocks.

Possibly, but that still requires knowing whether or not the kernel
*should* read the partition table and create the minors.

> 
> >      5. I would like to be able to support dynamic multipath in the
> >         initrd image.  In order to do this properly, I need the ability
> >         to create multipath arrays with non-consistent superblocks.  I
> >         would then need mdadm to scan these multipath devices when
> >         autodetecting raid arrays.  I can see having a multipath config
> >         option that is one of three settings:
> >              A. Off - No multipath support except for persistent
> >                 multipath arrays
> >              B. On - Setup any detected multipath devices as non-
> >                 persistent multipath arrays, then use the multipath
> >                 device in preference to the underlying devices for
> >                 things like persuant mounts and raid starts
> >              C. All - Setup all devices as multipath devices in order to
> >                 support dynamically adding new paths to previously
> >                 single path devices at run time.  This would be useful
> >                 on machines using fiber channel for boot disks (amongst
> >                 other useful scenarios, you can tie this
> >                 into /sbin/hotplug so that new disks get their unique ID
> >                 checked and automatically get added to existing
> >                 multipath arrays should they be just another path, that
> >                 way bringing up a new fiber channel switch and plugging
> >                 it into a controller and adding paths to existing
> >                 devices will all just work properly and seamlessly).
> 
> This sounds reasonable....
> I've often thought that it would be nice to be able to transparently
> convert a single device into a raid1.  Then a drive could be added and
> synced. the old drive removed, and then the drive morphed back into a
> single drive - just a different one.

I spent several hours one day talking with Al Viro about this basic
thing.  The conclusion of all that was that currently, there is no way
to "morph" a drive from one type to another.  So, the only way I could
see to support dynamic multipath was to basically make the kernel treat
*all* devices as multipath devices then let hotplug/udev do the work of
adding any new devices either to A) an existing multipath array or B) a
new multipath array.  The partitionable md devices are ideal for this
setup since they allow the device to act exactly the same as a multipath
device or a regular block device.

What I started to setup was basically a setup in the initrd such that if
multipath was set to all devices, then it would do scsi_id on all
devices, create a list of unique IDs, then create multipath arrays for
each unique ID adding all elements into the multipath array of the same
ID, then move all the /dev/sd* entries to /dev/multipath_paths/sd* and
create links from /dev/sd? to /dev/md_d? and do the same for the
detected partitions.

Then you could also plug into the hotplug scripts a check of scsi_id
versus the scsi_id for already existing multipath devices, and if a
match was found, add the new drive into the existing array and create
the /dev/multipath_path entries and the /dev links for that device.
That way an admin won't accidentally try to access the raw path device.

> I remember many moons ago Linus saying that all this distinction
> between ide drives and scsi drives and md drive etc was silly.  
> There should be just one major number called "disk" and anything that
> was a disk should appear there.

I tend to agree with this.

> Given that sort of abstraction, we could teach md (and dm) to splice
> with "disk"s etc.  It would be nice.
> It would also be hell in terms of device name stability, but that is a
> problem that has to be solved anyway.

Hence why filesystem labels are nice.

> If I ever do look at unifying md and dm, that is the goal I would work
> towards.
> 
> But for now your proposal seems fine.  Is there any particular support
> needed in md or mdadm?

No, just clarification on the item below, aka how to properly create the
arrays.

> >      6. There are some other multipath issues I'd like to address, but I
> >         can do that separately (things like it's unclear whether the
> >         preferred way of creating a multipath array is with 2 devices or
> >         1 device and 1 spare, but mdadm requires that you pass the -f
> >         flag to create the 1 device 1 spare type, however should we ever
> >         support both active-active and active-passive multipath in the
> >         future, then this difference would seem to be the obvious way to
> >         differentiate between the two, so I would think either should be
> >         allowed by mdadm, but since readding a path that has come back
> >         to life to an existing multipath array always puts it in as a
> >         spare, if we want to support active-active then we need some way
> >         to trigger a spare->active transition on a multipath element as
> >         well).
> 
> For active-active, I think there should be no spares.  All devices
> should be active.  You should be able to --grow a multipath if you find
> you have more paths than were initially allowed for.

Not true.  You have to keep in mind capacity planning and stuff like
that on fiber channel.  Given two controller that are dual port, and two
switches, and to ports on an external raid chassis, you could actually
have 4 or 8 paths to a device, but you really only want one path on each
controller active and maybe you want them to use different switches for
load reasons, so you really need to be able to specify a map in that
case (this would be a map for the user space stuff to access) and only
have the "right" paths be active up until such time as a failover
occurs.  And then you very well might want to specify the manner in
which failover happens (in which case the event mechanism would be very
helpful here).

As for --grow'ing multipath, yes, that should be allowed.

> For active-passive, I think md/multipath would need some serious work.
> I have a suspicion that with such arrangements, if you access the
> passive path, the active path turns off (is that correct). 

Yes.

>  If so, the
> approach of reading the superblock from each path would be a problem.

Correct.  But this doesn't really rely on active/passive.  For dynamic
multipath you wouldn't be relying on superblocks anyway.  The decision
about whether or not a device is multipath can be made without that, and
once the decision is made, there is absolutely no reason to read the
same block multiple times or write it multiple times.  So, really,
ideally the multipath code should never do more than 1 read for any
superblock read/write since they are all the same block except when
trying to autodetect superblocks on two different devices that it
doesn't already know are multipath.

> 
> > > > > 
> > > > > [root@pe-fc4 devel]# /sbin/mdadm -E --scan
> > > ..
> > > > > ARRAY /dev/md0 level=raid5 num-devices=4
> > > > > UUID=910b1fc9:d545bfd6:e4227893:75d72fd8
> > > 
> > > Yes, I expect you would.  -E just looks as the superblocks, and the
> > > superblock doesn't record whether the array is meant to be partitioned
> > > or not. 
> > 
> > Yeah, that's *gotta* be fixed.  Users will string me up by my testicles
> > otherwise.  Whether it's a matter of encode it in the superblock or read
> > the first sector and look for a partition table I don't care (the
> > partition table bit has some interesting aspects to it, like if you run
> > fdisk on a device then reboot it would change from non partitioned to
> > partitioned, but might have a superminor conflict, not sure whether that
> > would be a bug or a feature), but the days of remaking initrd images for
> > every little change have long since passed and if I do something to
> > bring them back they'll kill me.
> 
> I would suggest that initrd should only assemble the array containing
> the root filesystem, and that whether it is partitioned or not probably
> won't change very often...
> 
> I would also suggest (in line with the opening comment) that if people
> want auto-assemble to "just work", then they should seriously consider
> making all for their md arrays the partitionable type and deprecating
> old major-9 non-partitionable arrays.  That would remove any
> confusion.

So, Jeremy Katz and I were discussing this at length in IRC the other
day.  There are some reasons why we can't just do as you suggest.

1) You can't reliably use whole disk devices as constituent raid devices
(like you can't make a raid1 partitionable array out of /dev/sda
and /dev/sdb, although this would be nice to do).  This is in large part
due to two things: 1) dual boot scenarios where other OSes won't know
about or acknowledge the superblock at the end of the device, but *will*
see a partition table and might do something like create a partition
that goes all the way to the end of the device and 2) ia64 and i386 both
have GPT partition tables that store partition information at the end of
the device, so we have overlap.

2) You can't reliably do a /boot partition out of a partitioned md
device because of #1.  In order for, say grub, to find the filesystem
and "do the right thing" at boot time, it wants to read a partition
table and find the filesystem.  If you could use whole disk devices as
partitioned md devices, that would be ideal because then the partition
table for say a root/boot raid1 array would *look* just like the
partition table on a single disk and grub wouldn't know the difference.
However, if you have to create a single, large linux raid autodetect
partition to hold the md device, then the md device is *further*
partitioned, you know have to traverse two distinct partition tables to
get to the actual /boot partition.  Oops, grub just went belly up (as
did lilo, etc)  So, in order to make booting possible, even if you want
to use a partitionable raid array for all non-boot partitions, you still
need an old style md device for the /boot partition.

> > 
> > >  With version-1 superblocks they don't even record the
> > > sequence number of the array they are part of.  In that case, "-Es"
> > > will report 
> > >   ARRAY /dev/?? level=.....
> > > 
> > > Possibly I could utilise one of the high bits in the on-disc minor
> > > number to record whether partitioning was used...
> > 
> > As mentioned previously, recording the number of minors used in creating
> > it would be preferable.
> > 
> (and in a subsequent email)
> > Does that mean with version 1 superblocks you *have* to have an ARRAY
> > line in order to get some specific UUID array started at the right minor
> > number?  If so, people will complain about that loudly.
> 
> What do you mean by "the right minor number"?  In these days of
> non-stable device numbers I would have thought that an oxy-moron.

Well, yes and no.  I'm not really referring to minor number as in the
minor used to create the device nodes, but assuming the name of the
devices are /dev/md_d0, /dev/md_d1, etc., then I was referring to what
used to be determined by the super-minor which is the device's place in
that numerical list of md_d devices.

> The Version-1 superblock has a 32 character 'set_name' field which can
> freely be set by userspace (it cannot currently be changed while the
> array is assembled, but that should get fixed).
> 
> It is intended effectively as an equivalent to md_minor in version0.90
> 
> mdadm doesn't currently support it (as version-1 support is still
> under development) but the intention is something like that:
> 
> When an array is created, the set_name could be set to something like:
> 
>    fred:bob:5
> 
> which might mean:
>   The array should be assembled by host 'fred' and should be get a
>   device special file at
>          /dev/bob
>   with 5 partitions created.
> 
> What minor gets assigned is irrelevant.  Mdadm could use this
> information when reporting the "-Es" output, and when assembling with 
>   --config=partitions

OK, good enough, I can see that working.

> Possibly, mdadm could also be given a helper program which parses the
> set_name and returns the core information that mdadm needs (which I
> guess is a yes/no for "should it be assembled", a named for the
> device, a preferred minor number if any, and a number of partitions).
> This would allow a system integrator a lot of freedom to do clever
> things with set_name but still use mdadm.

Yes, which is exactly the sort of thing we have to contend with.  As you
said earlier, flexibility and features are not always what the customer
wants ;-)  Sometimes they want it to "just work".  Our job is to try and
get as many of those features accessible as possible while maintaining
that "just work" operation.  If they want things other than "just work",
then they are always free to go to the command line, create things
themselves, add ARRAY lines into the /etc/mdadm.conf file, etc (unlike
Windows where if it doesn't "just work", then you can't do it yourself,
which is why I always hated Windows and switched to linux 12 or so years
ago, when you both A) don't do what I want and B) keep me from doing it
myself, you piss me off)

> You could conceivably even store a (short) path name in there and have
> the filesystem automatically mounted if you really wanted.  This is
> all user-space policy and you can do whatever your device management
> system wanted (though you would impose limits on how users of your
> system could create arrays.  e.g. require version-1 super blocks).

Doesn't work that way (the imposing limits part).  If we try that, then
they just download your mdadm instead of our prebuilt one and make what
they want.  Trust me, they do that a lot ;-)  The design goal in a
situation like that may be that we don't automatically do anything with
the stuff they manually create, but we can't allow it to break the stuff
we do automatically.

> 
> > > 
> > > What warnings are they?  I would expect this configuration to work
> > > smoothly.
> > 
> > During stop, it tried to stop all the multipath devices before trying to
> > stop the raid5 device, so I get 4 warnings about the md device still
> > being in use, then it stops the raid5.  You have to run mdadm a second
> > time to then stop the multipath devices.  Assembly went smoother this
> > time, but it wasn't contending with device startup delays.
> 
> 
> This should be fixed in mdadm 1.9.0. From the change log:
>     -   Make "mdadm -Ss" stop stacked devices properly, by reversing the
> 	order in which arrays are stopped.
> 
> Is it ?

Nope.  It seems to be an ordering issue.  The mdadm binary assumes that
md0 is first, md1 second, etc.  In my case, md0 and md_d0 got sorted
into the same place, tried to start md_d0 first, but it needed md0-3, so
it failed.  So, no it's broken.  It either needs to A) not warn on
missing devices and just loop through the list until no further progress
can be made (the simple way) or B) do real dependency resolution and
only start a device after it's constituent devices have been started
regardless of the ordering of the devices in the ARRAY lines or
regardless of the ordering of device names.

> > 
> > Well, if you intend to make --brief work for a default mdadm.conf
> > inclusion directly (which would be useful specifically for anaconda,
> > which is the Red Hat/Fedora install code so that after setting up the
> > drives we could basically just do the equivalent of
> > 
> > echo "DEVICE partitions" > mdadm.conf
> > mdadm -Es --brief >> mdadm.conf
> > 
> > and get things right) then I would suggest that each ARRAY line should
> > include the dev name (properly, right now /dev/md_d0 shows up
> > as /dev/md0), level, number of disks, uuid, and auto={md|p(number)}.
> > That would generate a very useful ARRAY line.
> 
> I'll have to carefully think through the consequences of overloading a
> high bit in md_minor.
> I'm not at all keen on including the number of partitions in a
> version-0.90 superblock, but maybe I could be convinced.....

This is when I was thinking about the whole UUID thing in my last email.
If you encode a machine serial number into the uuid, you could also
encode the partition information.  Maybe a combination of serial number
in high 32 bits, magic in next 24 bits, then either the number of
partitions or just a partition shift encoded in the remaining 8 bits.
Still leaves 64 bits of UUID and now the 64 bits of UUID are tied to a
specific machine.  Something like that anyway.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 12:30         ` Doug Ledford
@ 2005-05-20 16:04           ` Paul Clements
  2005-05-20 17:16             ` Peter T. Breuer
  2005-05-20 17:45             ` Doug Ledford
  0 siblings, 2 replies; 23+ messages in thread
From: Paul Clements @ 2005-05-20 16:04 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid

Hi Doug,

Doug Ledford wrote:
> On Fri, 2005-05-20 at 17:00 +1000, Neil Brown wrote:

>>There is a converse to this.  People should be made to take notice if
>>there is possible data corruption.
>>
>>i.e. if you have a system crash while running a degraded raid5, then
>>silent data corruption could ensue.  mdadm will currently not start
>>any array in this state without an explicit '--force'.  This is somewhat
>>akin to fsck sometime requiring human interaction.  Ofcourse if there
>>is good reason to believe the data is still safe, mdadm should -- and
>>I believe does -- assemble the array even if degraded.
> 
> 
> Well, as I explained in my email sometime back on the issue of silent
> data corruption, this is where journaling saves your ass.  Since the
> journal has to be written before the filesystem proper updates are
> writting, if the array goes down it either is in the journal write, in
> which case you are throwing those blocks away anyway and so corruption
> is irrelevant, or it's in the filesystem proper writes and if they get
> corrupted you don't care because we are going to replay the journal and
> rewrite them.

I think you may be misunderstanding the nature of the data corruption 
that ensues when a system with a degraded raid4, raid5, or raid6 array 
crashes. Data that you aren't even actively writing can get corrupted. 
For example, say we have a 3 disk raid5 and disk 3 is missing. This 
means that for some stripes, we'll be writing parity and data:

disk1   disk2   {disk3}

  D1       P      {D2}

So, say we're in the middle of updating this stripe, and we're writing 
D1 and P to disk when the system crashes. We may have just corrupted D2, 
which isn't even active right now. This is because we'll use D1 and P to 
reconstruct D2 when disk3 (or its replacement) comes back. If we wrote 
D1 and not P, then when we use D1 and P to reconstruct D2, we'll get the 
wrong data. Same goes if we wrote P and not D1, or some partial piece of 
either or both.

There's no way for a filesystem journal to protect us from D2 getting 
corrupted, as far as I know.

Note that if we lose the parity disk in a raid4, this type of data 
corruption isn't possible. Also note that for some stripes in a raid5 or 
raid6, this type of corruption can't happen (as long as the parity for 
that stripe is on the missing disk). Also, if you have a non-volatile 
cache on the array, as most hardware RAIDs do, then this type of data 
corruption doesn't occur.

--
Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 16:04           ` Paul Clements
@ 2005-05-20 17:16             ` Peter T. Breuer
  2005-05-20 18:40               ` Doug Ledford
  2005-05-20 17:45             ` Doug Ledford
  1 sibling, 1 reply; 23+ messages in thread
From: Peter T. Breuer @ 2005-05-20 17:16 UTC (permalink / raw)
  To: linux-raid

Paul Clements <paul.clements@steeleye.com> wrote:
> disk1   disk2   {disk3}

>   D1       P      {D2}

> So, say we're in the middle of updating this stripe, and we're writing 
> D1 and P to disk when the system crashes. We may have just corrupted D2, 
> which isn't even active right now. This is because we'll use D1 and P to 
> reconstruct D2 when disk3 (or its replacement) comes back. If we wrote 
> D1 and not P, then when we use D1 and P to reconstruct D2, we'll get the 
> wrong data. Same goes if we wrote P and not D1, or some partial piece of 
> either or both.

> There's no way for a filesystem journal to protect us from D2 getting 
> corrupted, as far as I know.

Surely the raid won't have acked the write, so the journal won't
consider the write done and will replay it next chance it gets. Mind
you ... owwww! If we restart the array AGAIN without D3, and the
journal is now replayed(to redo the write), then since we have already
written D1, the parity in P is all wrong relative to it, and hence we
will have virtual data in D3 which is all wrong, and hence when we come
to write the parity info P we will get it wrong. No? (I haven't done
the calculation and so there might be some idempotency here that the
casual reasoning above fails to take account of).

On the other hand, if the journal itself is what we are talking about,
being located on the raid device, all bets are off (I've said that
before, and remain to be convinced that it is not so, but it may be so
- I simply see a danger that I have not been made to feel good about ..). 

Peter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 16:04           ` Paul Clements
  2005-05-20 17:16             ` Peter T. Breuer
@ 2005-05-20 17:45             ` Doug Ledford
  2005-05-20 18:33               ` Peter T. Breuer
  1 sibling, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2005-05-20 17:45 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid

On Fri, 2005-05-20 at 12:04 -0400, Paul Clements wrote:
> Hi Doug,
> 
> Doug Ledford wrote:
> > On Fri, 2005-05-20 at 17:00 +1000, Neil Brown wrote:
> 
> >>There is a converse to this.  People should be made to take notice if
> >>there is possible data corruption.
> >>
> >>i.e. if you have a system crash while running a degraded raid5, then
> >>silent data corruption could ensue.  mdadm will currently not start
> >>any array in this state without an explicit '--force'.  This is somewhat
> >>akin to fsck sometime requiring human interaction.  Ofcourse if there
> >>is good reason to believe the data is still safe, mdadm should -- and
> >>I believe does -- assemble the array even if degraded.
> > 
> > 
> > Well, as I explained in my email sometime back on the issue of silent
> > data corruption, this is where journaling saves your ass.  Since the
> > journal has to be written before the filesystem proper updates are
> > writting, if the array goes down it either is in the journal write, in
> > which case you are throwing those blocks away anyway and so corruption
> > is irrelevant, or it's in the filesystem proper writes and if they get
> > corrupted you don't care because we are going to replay the journal and
> > rewrite them.
> 
> I think you may be misunderstanding the nature of the data corruption 
> that ensues when a system with a degraded raid4, raid5, or raid6 array 
> crashes.

No, I understand it just fine.

>  Data that you aren't even actively writing can get corrupted. 
> For example, say we have a 3 disk raid5 and disk 3 is missing. This 
> means that for some stripes, we'll be writing parity and data:
> 
> disk1   disk2   {disk3}
> 
>   D1       P      {D2}
> 
> So, say we're in the middle of updating this stripe, and we're writing 
> D1 and P to disk when the system crashes. We may have just corrupted D2, 
> which isn't even active right now. This is because we'll use D1 and P to 
> reconstruct D2 when disk3 (or its replacement) comes back.

Correct.

>  If we wrote 
> D1 and not P, then when we use D1 and P to reconstruct D2, we'll get the 
> wrong data.

Absolutely correct.

>  Same goes if we wrote P and not D1, or some partial piece of 
> either or both.

Yep.  Now, reread my original email.  *WE DON'T CARE*.  If this stripe
is in the filesystem proper, then whatever write we did to D1 and P will
get replayed when the journal is replayed.  If this stripe was part of
the journal, then those writes were uncommitted journal entries and are
going to get thrown away (aka, they are transient, temporary data and
before it's ever used again it will be rewritten).  Your only
requirement is that if the array goes down degraded, then you need to
replay the journal in that degraded state, prior to adding back in
disk3.  That's it.  And since the journal will be replayed even before
you get to the point of a single user login (unless the filesystem isn't
checked in fstab), and nothing automatically readds disks into a
degraded array, it's all a moot point.

> There's no way for a filesystem journal to protect us from D2 getting 
> corrupted, as far as I know.

Sure it does.  Since the replay happens in the same state as when the
machine crashed, namely degraded, the replay repairs the corruption
between D1 and P.  It doesn't touch D2.  Now when you readd disk3 into
the array, the *proper* data for D2 gets reconstructed out of D1 and P,
which are now in sync.  This is why my recommendation, if you have a
big, fast software RAID4/5 array is to use journal=data and give a
goodly journal size (I'd use a 64MB or larger journal) and be all safe
and cozy in your combination of disk redundancy and double writes to
keep you safe.

> Note that if we lose the parity disk in a raid4, this type of data 
> corruption isn't possible. Also note that for some stripes in a raid5 or 
> raid6, this type of corruption can't happen (as long as the parity for 
> that stripe is on the missing disk). Also, if you have a non-volatile 
> cache on the array, as most hardware RAIDs do, then this type of data 
> corruption doesn't occur.

And it's not possible with normal raid4/5 if you use a journaling
filesystem and the raid layer does the only sane thing which is make
parity writes synchronous with regular data block writes in degraded
mode as opposed to letting the parity be write behind.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 17:45             ` Doug Ledford
@ 2005-05-20 18:33               ` Peter T. Breuer
  2005-05-20 20:01                 ` berk walker
  2005-05-20 20:05                 ` Paul Clements
  0 siblings, 2 replies; 23+ messages in thread
From: Peter T. Breuer @ 2005-05-20 18:33 UTC (permalink / raw)
  To: linux-raid

Doug Ledford <dledford@redhat.com> wrote:
> >  Same goes if we wrote P and not D1, or some partial piece of 
> > either or both.

> Yep.  Now, reread my original email.  *WE DON'T CARE*.  If this stripe
> is in the filesystem proper, then whatever write we did to D1 and P will

I think Paul missed that too, but consider

   a) it is the journal (placed on the same raid partition) that we have the 
      bad luck to be talking about; OR

   b) rewriting is not necessarily idempotent, when half of it consists
      of using a parity to construct what you should write.

I explained further in a reply to Paul. reassure me!

> get replayed when the journal is replayed.  If this stripe was part of
> the journal, then those writes were uncommitted journal entries and are
> going to get thrown away (aka, they are transient, temporary data and
> before it's ever used again it will be rewritten).

You are saying the write to a journal on RAID will always be discarded
if incomplete.  Fine.  That's great.  I like that (I think that should
always happen, and one should never roll forward any incomplete write,
whether to the journal or not).

> Your only
> requirement is that if the array goes down degraded, then you need to
> replay the journal in that degraded state, prior to adding back in
> disk3.

Careful ...  I don't believe writes are necessarily idempotent in this
situation.

> That's it.  And since the journal will be replayed even before
> you get to the point of a single user login (unless the filesystem isn't
> checked in fstab), and nothing automatically readds disks into a
> degraded array, it's all a moot point.

Well, take one moot admin, and see what he can do! But sure, fine.

> > There's no way for a filesystem journal to protect us from D2 getting 
> > corrupted, as far as I know.

> Sure it does.  Since the replay happens in the same state as when the
> machine crashed, namely degraded, the replay repairs the corruption

Careful with your assumptions. Prove to me that write is idempotent.

> between D1 and P.  It doesn't touch D2.  Now when you readd disk3 into
> the array, the *proper* data for D2 gets reconstructed out of D1 and P,
> which are now in sync.  This is why my recommendation, if you have a
> big, fast software RAID4/5 array is to use journal=data and give a
> goodly journal size (I'd use a 64MB or larger journal) and be all safe
> and cozy in your combination of disk redundancy and double writes to
> keep you safe.


Peter


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 17:16             ` Peter T. Breuer
@ 2005-05-20 18:40               ` Doug Ledford
  2005-05-20 19:15                 ` Peter T. Breuer
  0 siblings, 1 reply; 23+ messages in thread
From: Doug Ledford @ 2005-05-20 18:40 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Fri, 2005-05-20 at 19:16 +0200, Peter T. Breuer wrote:
> Paul Clements <paul.clements@steeleye.com> wrote:
> > disk1   disk2   {disk3}
> 
> >   D1       P      {D2}
> 
> > So, say we're in the middle of updating this stripe, and we're writing 
> > D1 and P to disk when the system crashes. We may have just corrupted D2, 
> > which isn't even active right now. This is because we'll use D1 and P to 
> > reconstruct D2 when disk3 (or its replacement) comes back. If we wrote 
> > D1 and not P, then when we use D1 and P to reconstruct D2, we'll get the 
> > wrong data. Same goes if we wrote P and not D1, or some partial piece of 
> > either or both.
> 
> > There's no way for a filesystem journal to protect us from D2 getting 
> > corrupted, as far as I know.
> 
> Surely the raid won't have acked the write, so the journal won't
> consider the write done and will replay it next chance it gets. Mind
> you ... owwww! If we restart the array AGAIN without D3, and the
> journal is now replayed(to redo the write), then since we have already
> written D1, the parity in P is all wrong relative to it, and hence we
> will have virtual data in D3 which is all wrong, and hence when we come
> to write the parity info P we will get it wrong. No? (I haven't done
> the calculation and so there might be some idempotency here that the
> casual reasoning above fails to take account of).

No.  There's no need to do any parity calculations if you are writing
both D1 and P (because you have D1 and D2 as the write itself, and
therefore you are getting P from them, not from off of disk, so a full
stripe write should generate the right data *always*).

If you are attempting to do a partial stripe write, and let's say you
are writing D2 in this case (true whenever the element you are trying to
write is the missing element), then you can read all available elements,
D1 and P, generate D2, xor D2 out of P, xor in new D2 into P, write P.
But, really, that's a lot of wasted time.  Your better off to just read
all available D? elements, ignore the existing parity, and generate a
new parity off of the all the existing D elements and the missing D
element that you have a write for and write that out to the P element.

Where you start to get into trouble is only with a partial stripe write
that doesn't write D2.  Then you have to read D1, read P, xor D1 out of
P, xor new D1 into P, write both.  Only in this case is a replay
problematic, and that's because you need the new D1 and new P writes to
be atomic.  If you replay with both of those complete, then you end up
with pristine data.  If you replay with only D1 complete, then you end
up xor'ing the same bit of data in and out of the P block, leaving it
unchanged and corrupting D2.  If you replay with only P complete then
you get the same thing since the net result is P xor D xor D' xor D xor
D' = P.  As far as I know, to solve this issue you have to do a minimal
journal in the raid device itself.  For example, some raid controllers
reserve a 200MB region at the beginning of each disk for this sort of
thing.  When in degraded mode, full stripe writes can be sent straight
through since they will always generate new, correct parity.  Any
partial stripe writes that rewrite the missing data block are safe since
they can be regenerated from a combination of A) the data to be written
and B) the data blocks that aren't touched without relying on the parity
block and an xor calculation.  Partial stripe writes that actually
require the parity generation sequence to work, aka those that don't
write to the missing element and therefore the missing data *must* be
preserved, can basically be buffered just like a journal itself does by
doing something like writing the new data into a ring buffer of writes,
waiting for completion, then starting the final writes, then when those
are done, revoking the ones in the buffer.  If you crash during this
time, then you replay those writes (prior to going read/write) from the
ring buffer, which gives you the updated data on disk.  If the journal
then replays the writes as well, you don't care because your parity will
be preserved.

> On the other hand, if the journal itself is what we are talking about,
> being located on the raid device, all bets are off (I've said that
> before, and remain to be convinced that it is not so, but it may be so
> - I simply see a danger that I have not been made to feel good about ..). 

Given this specific scenario, it *could* corrupt your journal, but only
in the case were you have some complete and some incomplete journal
transactions in the same stripe.  But, then again, the journal is a ring
buffer, and you have the option of telling (at least ext3) how big your
stripe size is so that the file system layout can be optimized to that,
so it could just as easily be solved by making the ext3 journal write in
stripe sized chunks whenever possible (for all I know, it already does,
I haven't checked).  Or you could do what I mentioned above.

All of this sounds pretty heavy, with double copying of writes in two
places, but it's what you have to do when in degraded mode.  In normal
mode, you just let the journal do its job and never buffer anything
because the write replays will always be correct.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 18:40               ` Doug Ledford
@ 2005-05-20 19:15                 ` Peter T. Breuer
  2005-05-20 21:31                   ` Doug Ledford
  0 siblings, 1 reply; 23+ messages in thread
From: Peter T. Breuer @ 2005-05-20 19:15 UTC (permalink / raw)
  To: linux-raid

Doug Ledford <dledford@redhat.com> wrote:
> > Surely the raid won't have acked the write, so the journal won't
> > consider the write done and will replay it next chance it gets. Mind
> > you ... owwww! If we restart the array AGAIN without D3, and the
> > journal is now replayed(to redo the write), then since we have already
> > written D1, the parity in P is all wrong relative to it, and hence we
> > will have virtual data in D3 which is all wrong, and hence when we come
> > to write the parity info P we will get it wrong. No? (I haven't done
> > the calculation and so there might be some idempotency here that the
> > casual reasoning above fails to take account of).

> No.  There's no need to do any parity calculations if you are writing
> both D1 and P (because you have D1 and D2 as the write itself, and

OK - you're right as far as this goes.  P is the old difference between
D1 and D2.  When you write anew you want P as the new difference between
D1 and D2.

However, sometimes one calculates the new P by calculating the parity
difference between (cached) old and new data, and updating P with that
info. I don't know when or if the linux raid5 algorithm does that.

> therefore you are getting P from them, not from off of disk, so a full
> stripe write should generate the right data *always*).

> If you are attempting to do a partial stripe write, and let's say you
> are writing D2 in this case (true whenever the element you are trying to
> write is the missing element), then you can read all available elements,
> D1 and P, generate D2, xor D2 out of P, xor in new D2 into P, write P.
> But, really, that's a lot of wasted time.

Depends on relative latencies. If you have the data cached in memory
it's not so silly.  And I believe/guess some of your suggested op
sequence  above is not needed, in the sense that it can be done in
fewer ops.

> Your better off to just read
> all available D? elements, ignore the existing parity, and generate a
> new parity off of the all the existing D elements and the missing D
> element that you have a write for and write that out to the P element.

> Where you start to get into trouble is only with a partial stripe write
> that doesn't write D2.  Then you have to read D1, read P, xor D1 out of
> P, xor new D1 into P, write both.  Only in this case is a replay
> problematic, and that's because you need the new D1 and new P writes to
> be atomic. 

I.e. do both of D1 and P, or neither. But we are discussing precisely
the case when the crash happened after writing D1 but not having
written P (with D2 not present).  I suppose we could also have thought
about P having been updated, but not D1 (it's a race).


> If you replay with both of those complete, then you end up
> with pristine data.  If you replay with only D1 complete, then you end
> up xor'ing the same bit of data in and out of the P block, leaving it
> unchanged and corrupting D2. 

Hmm. I thought you had discussed it above already, and concluded that we
rewrite P (correctly) from the new D1 and D2.

> If you replay with only P complete then
> you get the same thing since the net result is P xor D xor D' xor D xor
> D' = P.

Well, cross me with a salamander, but I thought that was what I was
discussing - I am all confuscicated...

> As far as I know, to solve this issue you have to do a minimal
> journal in the raid device itself.

You are aiming for atomicity? Then, yes, you need the journalling
trick.

> For example, some raid controllers
> reserve a 200MB region at the beginning of each disk for this sort of
> thing.  When in degraded mode, full stripe writes can be sent straight
> through since they will always generate new, correct parity.  Any

OK.

> partial stripe writes that rewrite the missing data block are safe since
> they can be regenerated from a combination of A) the data to be written
> and B) the data blocks that aren't touched without relying on the parity
> block and an xor calculation.  Partial stripe writes that actually
> require the parity generation sequence to work, aka those that don't
> write to the missing element and therefore the missing data *must* be
> preserved, can basically be buffered just like a journal itself does by
> doing something like writing the new data into a ring buffer of writes,
> waiting for completion, then starting the final writes, then when those
> are done, revoking the ones in the buffer.  If you crash during this

I understood journalling to be a generic technique, insensitive to
fs structure. In that case, I don't see why you need discuss the
mechanism.

> time, then you replay those writes (prior to going read/write) from the
> ring buffer, which gives you the updated data on disk.  If the journal
> then replays the writes as well, you don't care because your parity will
> be preserved.
>  
> > On the other hand, if the journal itself is what we are talking about,
> > being located on the raid device, all bets are off (I've said that
> > before, and remain to be convinced that it is not so, but it may be so
> > - I simply see a danger that I have not been made to feel good about ..). 

> Given this specific scenario, it *could* corrupt your journal, but only
> in the case were you have some complete and some incomplete journal
> transactions in the same stripe.  But, then again, the journal is a ring
> buffer, and you have the option of telling (at least ext3) how big your
> stripe size is so that the file system layout can be optimized to that,
> so it could just as easily be solved by making the ext3 journal write in
> stripe sized chunks whenever possible (for all I know, it already does,
> I haven't checked).  Or you could do what I mentioned above.

I think you are saying that setting stripe size and fs block size to 4K
always does the trick.

> All of this sounds pretty heavy, with double copying of writes in two
> places, but it's what you have to do when in degraded mode.  In normal
> mode, you just let the journal do its job and never buffer anything
> because the write replays will always be correct.

Peter


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 18:33               ` Peter T. Breuer
@ 2005-05-20 20:01                 ` berk walker
  2005-05-20 21:00                   ` Gil
  2005-05-20 21:51                   ` Peter T. Breuer
  2005-05-20 20:05                 ` Paul Clements
  1 sibling, 2 replies; 23+ messages in thread
From: berk walker @ 2005-05-20 20:01 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter -
for us old folks, please expand "idempotent" in usage to reflect the 
relationships to which you refer.
Thx
b-


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 18:33               ` Peter T. Breuer
  2005-05-20 20:01                 ` berk walker
@ 2005-05-20 20:05                 ` Paul Clements
  1 sibling, 0 replies; 23+ messages in thread
From: Paul Clements @ 2005-05-20 20:05 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:

>    b) rewriting is not necessarily idempotent, when half of it consists
>       of using a parity to construct what you should write.

That's right. The rewrite is not idempotent. Having lost part, or all, 
of either D1 or P at the time of the crash, you no longer have any 
accurate way of reconstructing D2.

And that's assuming you can even do a rewrite. By default, only metadata 
is journalled (in ext3), so for plain old data writes, you're just plain 
out of luck...

--
Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 20:01                 ` berk walker
@ 2005-05-20 21:00                   ` Gil
  2005-05-20 21:51                   ` Peter T. Breuer
  1 sibling, 0 replies; 23+ messages in thread
From: Gil @ 2005-05-20 21:00 UTC (permalink / raw)
  To: berk walker; +Cc: Peter T. Breuer, linux-raid

berk walker wrote:
> Peter -
> for us old folks, please expand "idempotent" in usage to reflect the
> relationships to which you refer.

In common usage, an operation is idempotent when it is provably safely
repeatable.  Another way to say it is that you can repeat an idempotent
operation without side effects.

In the context of this conversation, Peter is asking that Doug prove
that the write can be repeated on restart without corrupting the state
of the RAID array.

--Gil

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 19:15                 ` Peter T. Breuer
@ 2005-05-20 21:31                   ` Doug Ledford
  0 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2005-05-20 21:31 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Fri, 2005-05-20 at 21:15 +0200, Peter T. Breuer wrote:
> Doug Ledford <dledford@redhat.com> wrote:
> > > Surely the raid won't have acked the write, so the journal won't
> > > consider the write done and will replay it next chance it gets. Mind
> > > you ... owwww! If we restart the array AGAIN without D3, and the
> > > journal is now replayed(to redo the write), then since we have already
> > > written D1, the parity in P is all wrong relative to it, and hence we
> > > will have virtual data in D3 which is all wrong, and hence when we come
> > > to write the parity info P we will get it wrong. No? (I haven't done
> > > the calculation and so there might be some idempotency here that the
> > > casual reasoning above fails to take account of).
> 
> > No.  There's no need to do any parity calculations if you are writing
> > both D1 and P (because you have D1 and D2 as the write itself, and
> 
> OK - you're right as far as this goes.  P is the old difference between
> D1 and D2.  When you write anew you want P as the new difference between
> D1 and D2.
> 
> However, sometimes one calculates the new P by calculating the parity
> difference between (cached) old and new data, and updating P with that
> info. I don't know when or if the linux raid5 algorithm does that.

Still wouldn't matter.  Since you are writing D2 from the initial write
command, it will still be correct and parity will still be correct.
Generally speak, any full strip write, whether done from cache or from
read/xor/write, or from any other mechanism will always be right.

> > therefore you are getting P from them, not from off of disk, so a full
> > stripe write should generate the right data *always*).
> 
> > If you are attempting to do a partial stripe write, and let's say you
> > are writing D2 in this case (true whenever the element you are trying to
> > write is the missing element), then you can read all available elements,
> > D1 and P, generate D2, xor D2 out of P, xor in new D2 into P, write P.
> > But, really, that's a lot of wasted time.
> 
> Depends on relative latencies. If you have the data cached in memory
> it's not so silly.  And I believe/guess some of your suggested op
> sequence  above is not needed, in the sense that it can be done in
> fewer ops.

Correct, when writing a new D2 you can just read D1, generate P from D1
and data to be written, and write P.  If you have a cached D2 and P then
you can do it faster by just doing the double xor sequence and writing
the new P.

> > Your better off to just read
> > all available D? elements, ignore the existing parity, and generate a
> > new parity off of the all the existing D elements and the missing D
> > element that you have a write for and write that out to the P element.
> 
> > Where you start to get into trouble is only with a partial stripe write
> > that doesn't write D2.  Then you have to read D1, read P, xor D1 out of
> > P, xor new D1 into P, write both.  Only in this case is a replay
> > problematic, and that's because you need the new D1 and new P writes to
> > be atomic. 
> 
> I.e. do both of D1 and P, or neither. But we are discussing precisely
> the case when the crash happened after writing D1 but not having
> written P (with D2 not present).  I suppose we could also have thought
> about P having been updated, but not D1 (it's a race).

No, the difference between the safe case and the problematic case is
whether the actual write command will rewrite both D1 and D2 (and
remember that the file system writes never write to P, that's a hidden
detail the file system doesn't see).  Let's say that the chunk size on
the array is 64k, and you have a 3 disk array, that gives you a 128k
stripe size.  If the write coming from the journal to the file system
proper is a full 128k, then you never have to worry about it because the
replay will always get it right (because the write itself is replacing
the missing D2 data with new D2 data so we don't have to generate
anything).  But, if you have a 64k write aligned at the beginning of the
stripe, then D2 must be preserved.  And even though the write is only
64k in size, we are going to have to write 128k to update the parity so
that future attempts to generate D2 from D1 and P will get the right
result.  That's the problematic case.

> 
> > If you replay with both of those complete, then you end up
> > with pristine data.  If you replay with only D1 complete, then you end
> > up xor'ing the same bit of data in and out of the P block, leaving it
> > unchanged and corrupting D2. 
> 
> Hmm. I thought you had discussed it above already, and concluded that we
> rewrite P (correctly) from the new D1 and D2.

Only if the file system level write was to both D1 and D2.

> > If you replay with only P complete then
> > you get the same thing since the net result is P xor D xor D' xor D xor
> > D' = P.
> 
> Well, cross me with a salamander, but I thought that was what I was
> discussing - I am all confuscicated...
> 
> > As far as I know, to solve this issue you have to do a minimal
> > journal in the raid device itself.
> 
> You are aiming for atomicity? Then, yes, you need the journalling
> trick.
> 
> > For example, some raid controllers
> > reserve a 200MB region at the beginning of each disk for this sort of
> > thing.  When in degraded mode, full stripe writes can be sent straight
> > through since they will always generate new, correct parity.  Any
> 
> OK.
> 
> > partial stripe writes that rewrite the missing data block are safe since
> > they can be regenerated from a combination of A) the data to be written
> > and B) the data blocks that aren't touched without relying on the parity
> > block and an xor calculation.  Partial stripe writes that actually
> > require the parity generation sequence to work, aka those that don't
> > write to the missing element and therefore the missing data *must* be
> > preserved, can basically be buffered just like a journal itself does by
> > doing something like writing the new data into a ring buffer of writes,
> > waiting for completion, then starting the final writes, then when those
> > are done, revoking the ones in the buffer.  If you crash during this
> 
> I understood journalling to be a generic technique, insensitive to
> fs structure. In that case, I don't see why you need discuss the
> mechanism.

Mainly because you don't need all the same features for this kind of
simple journal that you do for an FS journal.  It might even be possible
to use some advanced SCSI commands to really reduce the performance
bottleneck of a simplified block write journal built into the array
(things like the SCSI copy command for instance, which would allow you
to put a number of updated blocks into the ring buffer, then with a
single copy command move as many as 256 chunks from the buffer area to
the final destinations without using any bus transfer resources and
happening all internally in the drive).  This is when I point out that
sometimes being a generic OS makes things like this *much* more
difficult.  Guys working at places like EMC or NetApp get to play tricks
like this in their filers while only needing to deal with a specific
file system or raid subsystem.  In the general OS you have to build a
generic, easily usable framework, which takes much more time and effort.

> > time, then you replay those writes (prior to going read/write) from the
> > ring buffer, which gives you the updated data on disk.  If the journal
> > then replays the writes as well, you don't care because your parity will
> > be preserved.
> >  
> > > On the other hand, if the journal itself is what we are talking about,
> > > being located on the raid device, all bets are off (I've said that
> > > before, and remain to be convinced that it is not so, but it may be so
> > > - I simply see a danger that I have not been made to feel good about ..). 
> 
> > Given this specific scenario, it *could* corrupt your journal, but only
> > in the case were you have some complete and some incomplete journal
> > transactions in the same stripe.  But, then again, the journal is a ring
> > buffer, and you have the option of telling (at least ext3) how big your
> > stripe size is so that the file system layout can be optimized to that,
> > so it could just as easily be solved by making the ext3 journal write in
> > stripe sized chunks whenever possible (for all I know, it already does,
> > I haven't checked).  Or you could do what I mentioned above.
> 
> I think you are saying that setting stripe size and fs block size to 4K
> always does the trick.

Well, I'm sure that would, but that would be ugly as hell.  No, I was
referring to the fact that the -J option to mke2fs allows you to specify
the raid array stripe size so that mke2fs can do things such as
distribute inode groups and block bitmaps across different disks in the
array.  It really sucks when you array and ext3 filesystem metadata line
up such that the metadata is always on the first drive of the stripe and
your metadata updates become a serious bottleneck.  I've seen raid
arrays where the first drive in the array was dealing with twice as much
read/write activity as any other drive in the array.  Extending that a
little bit to A) align the journal itself to the start of a stripe and
B) commit journal writes in stripe sized chunks if possible would help
to eliminate the need for any fancy tricks on the part of the md layer
in regards to the journal and partial stripe writes in degraded mode.

> > All of this sounds pretty heavy, with double copying of writes in two
> > places, but it's what you have to do when in degraded mode.  In normal
> > mode, you just let the journal do its job and never buffer anything
> > because the write replays will always be correct.

One other possibility for solving the issue is to make use of the new
bitmap stuff.  A bitmap means one thing in regular mode, you need new
parity, make it mean something else in degraded mode.  Specifically, if
an array is kicked from clean to degraded mode, flush the currently
pending writes as normal (aka, update the parity, whatever), then clear
the bitmap, then switch to degraded-reliable mode.  In degraded-reliable
mode, any write to a stripe will result in the parity block for that
stripe being replaced with the data from the missing data block (or
ignored if it's the parity block that's missing) and the bitmap for that
stripe being set (and this is why you want a not too sparse bitmap) and
all other stripes residing in that same bitmap will have to read their
data blocks and parity blocks, calculate their missing data blocks, then
write out all the missing data blocks in the parity spots.  Once a
bitmap segment has been converted, it basically behaves like a raid0
array until you add a spare disk and reconstruction is started.  During
reconstruction, any stripe without its bitmap set reconstructs the data
from the other data + parity, any stripe with its bitmap set copies the
parity block to the reconstruction device's data block and then
generates new parity from the entire stripe and puts that in the parity
block.  This kind of setup would make the time frame immediately after
the device went into degraded mode pretty damn slow, but once the disks
got the active areas converted to this modified raid0 setup, speed would
be just as fast as non-degraded mode (faster actually) and you would be
once again able to rely upon replays from the journal doing the right
thing regardless of whether the replay is a full stripe replay or not.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 20:01                 ` berk walker
  2005-05-20 21:00                   ` Gil
@ 2005-05-20 21:51                   ` Peter T. Breuer
  2005-05-20 22:14                     ` berk walker
  1 sibling, 1 reply; 23+ messages in thread
From: Peter T. Breuer @ 2005-05-20 21:51 UTC (permalink / raw)
  To: linux-raid

berk walker <berk@panix.com> wrote:
> for us old folks, please expand "idempotent" in usage to reflect the 
> relationships to which you refer.

Function f is idempotent when f.f = f. I.e. Doing it twice is the same as
doing it once.

Here the question is if you do a fraction f of a write w, and then do the
whole write w again, whether you get what you are expecting. I.e. if

     w.f = w

and f is the restriction of w to some set s, f = w|s.

Peter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug report: mdadm -E oddity
  2005-05-20 21:51                   ` Peter T. Breuer
@ 2005-05-20 22:14                     ` berk walker
  0 siblings, 0 replies; 23+ messages in thread
From: berk walker @ 2005-05-20 22:14 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid



Peter T. Breuer wrote:

>berk walker <berk@panix.com> wrote:
>  
>
>>for us old folks, please expand "idempotent" in usage to reflect the 
>>relationships to which you refer.
>>    
>>
>
>Function f is idempotent when f.f = f. I.e. Doing it twice is the same as
>doing it once.
>
>Here the question is if you do a fraction f of a write w, and then do the
>whole write w again, whether you get what you are expecting. I.e. if
>
>     w.f = w
>
>and f is the restriction of w to some set s, f = w|s.
>
>Peter
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>.
>
>  
>
OK, I understand that def.  thanks, PTB.
b-

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2005-05-20 22:14 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-13 15:44 Bug report: mdadm -E oddity Doug Ledford
2005-05-13 17:11 ` Doug Ledford
2005-05-13 23:01   ` Neil Brown
2005-05-14 13:28     ` Doug Ledford
2005-05-15 17:32       ` Luca Berra
2005-05-20  7:00       ` Neil Brown
2005-05-20 12:30         ` Doug Ledford
2005-05-20 16:04           ` Paul Clements
2005-05-20 17:16             ` Peter T. Breuer
2005-05-20 18:40               ` Doug Ledford
2005-05-20 19:15                 ` Peter T. Breuer
2005-05-20 21:31                   ` Doug Ledford
2005-05-20 17:45             ` Doug Ledford
2005-05-20 18:33               ` Peter T. Breuer
2005-05-20 20:01                 ` berk walker
2005-05-20 21:00                   ` Gil
2005-05-20 21:51                   ` Peter T. Breuer
2005-05-20 22:14                     ` berk walker
2005-05-20 20:05                 ` Paul Clements
2005-05-16 16:46     ` Doug Ledford
2005-05-20  7:08       ` Neil Brown
2005-05-20 11:29         ` Doug Ledford
2005-05-16 22:11   ` Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).