Time to deprecate old RAID formats?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Time to  deprecate old RAID formats?
@ 2007-10-19 14:34 John Stoffel
  2007-10-19 15:09 ` Justin Piszcz
  0 siblings, 1 reply; 88+ messages in thread
From: John Stoffel @ 2007-10-19 14:34 UTC (permalink / raw)
  To: linux-raid

So, 

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
john@stoffel.org

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 14:34 Time to deprecate old RAID formats? John Stoffel
@ 2007-10-19 15:09 ` Justin Piszcz
  2007-10-19 15:46   ` John Stoffel
  0 siblings, 1 reply; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 15:09 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid



On Fri, 19 Oct 2007, John Stoffel wrote:

>
> So,
>
> Is it time to start thinking about deprecating the old 0.9, 1.0 and
> 1.1 formats to just standardize on the 1.2 format?  What are the
> issues surrounding this?
>
> It's certainly easy enough to change mdadm to default to the 1.2
> format and to require a --force switch to  allow use of the older
> formats.
>
> I keep seeing that we support these old formats, and it's never been
> clear to me why we have four different ones available?  Why can't we
> start defining the canonical format for Linux RAID metadata?
>
> Thanks,
> John
> john@stoffel.org
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I hope 00.90.03 is not deprecated, LILO cannot boot off of anything else!


Justin.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 15:09 ` Justin Piszcz
@ 2007-10-19 15:46   ` John Stoffel
  2007-10-19 16:15     ` Doug Ledford
  2007-10-19 16:34     ` Justin Piszcz
  0 siblings, 2 replies; 88+ messages in thread
From: John Stoffel @ 2007-10-19 15:46 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: John Stoffel, linux-raid

>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:

Justin> On Fri, 19 Oct 2007, John Stoffel wrote:

>> 
>> So,
>> 
>> Is it time to start thinking about deprecating the old 0.9, 1.0 and
>> 1.1 formats to just standardize on the 1.2 format?  What are the
>> issues surrounding this?
>> 
>> It's certainly easy enough to change mdadm to default to the 1.2
>> format and to require a --force switch to  allow use of the older
>> formats.
>> 
>> I keep seeing that we support these old formats, and it's never been
>> clear to me why we have four different ones available?  Why can't we
>> start defining the canonical format for Linux RAID metadata?
>> 
>> Thanks,
>> John
>> john@stoffel.org
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 

Justin> I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin> anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 15:46   ` John Stoffel
@ 2007-10-19 16:15     ` Doug Ledford
  2007-10-19 16:35       ` Justin Piszcz
                         ` (2 more replies)
  2007-10-19 16:34     ` Justin Piszcz
  1 sibling, 3 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-19 16:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2852 bytes --]

On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
> >>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
> 
> Justin> On Fri, 19 Oct 2007, John Stoffel wrote:
> 
> >> 
> >> So,
> >> 
> >> Is it time to start thinking about deprecating the old 0.9, 1.0 and
> >> 1.1 formats to just standardize on the 1.2 format?  What are the
> >> issues surrounding this?

1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.

> >> It's certainly easy enough to change mdadm to default to the 1.2
> >> format and to require a --force switch to  allow use of the older
> >> formats.
> >> 
> >> I keep seeing that we support these old formats, and it's never been
> >> clear to me why we have four different ones available?  Why can't we
> >> start defining the canonical format for Linux RAID metadata?
> >> 
> >> Thanks,
> >> John
> >> john@stoffel.org
> >> -
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> 
> Justin> I hope 00.90.03 is not deprecated, LILO cannot boot off of
> Justin> anything else!
> 
> Are you sure?  I find that GRUB is much easier to use and setup than
> LILO these days.  But hey, just dropping down to support 00.09.03 and
> 1.2 formats would be fine too.  Let's just lessen the confusion if at
> all possible.
> 
> John
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 15:46   ` John Stoffel
  2007-10-19 16:15     ` Doug Ledford
@ 2007-10-19 16:34     ` Justin Piszcz
  2007-10-23 23:19       ` Bill Davidsen
  1 sibling, 1 reply; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 16:34 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid



On Fri, 19 Oct 2007, John Stoffel wrote:

>>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>
> Justin> On Fri, 19 Oct 2007, John Stoffel wrote:
>
>>>
>>> So,
>>>
>>> Is it time to start thinking about deprecating the old 0.9, 1.0 and
>>> 1.1 formats to just standardize on the 1.2 format?  What are the
>>> issues surrounding this?
>>>
>>> It's certainly easy enough to change mdadm to default to the 1.2
>>> format and to require a --force switch to  allow use of the older
>>> formats.
>>>
>>> I keep seeing that we support these old formats, and it's never been
>>> clear to me why we have four different ones available?  Why can't we
>>> start defining the canonical format for Linux RAID metadata?
>>>
>>> Thanks,
>>> John
>>> john@stoffel.org
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
> Justin> I hope 00.90.03 is not deprecated, LILO cannot boot off of
> Justin> anything else!
>
> Are you sure?  I find that GRUB is much easier to use and setup than
> LILO these days.  But hey, just dropping down to support 00.09.03 and
> 1.2 formats would be fine too.  Let's just lessen the confusion if at
> all possible.
>
> John
>

I am sure, I submitted a bug report to the LILO developer, he acknowledged 
the bug but I don't know if it was fixed.

I have not tried GRUB with a RAID1 setup yet.

Justin.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:15     ` Doug Ledford
@ 2007-10-19 16:35       ` Justin Piszcz
  2007-10-19 16:38       ` John Stoffel
  2007-10-20 14:09       ` Michael Tokarev
  2 siblings, 0 replies; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 16:35 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, linux-raid



On Fri, 19 Oct 2007, Doug Ledford wrote:

> On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
>>>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>>
>> Justin> On Fri, 19 Oct 2007, John Stoffel wrote:
>>
>>>>
>>>> So,
>>>>
>>>> Is it time to start thinking about deprecating the old 0.9, 1.0 and
>>>> 1.1 formats to just standardize on the 1.2 format?  What are the
>>>> issues surrounding this?
>
> 1.0, 1.1, and 1.2 are the same format, just in different positions on
> the disk.  Of the three, the 1.1 format is the safest to use since it
> won't allow you to accidentally have some sort of metadata between the
> beginning of the disk and the raid superblock (such as an lvm2
> superblock), and hence whenever the raid array isn't up, you won't be
> able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
> case situations, I've seen lvm2 find a superblock on one RAID1 array
> member when the RAID1 array was down, the system came up, you used the
> system, the two copies of the raid array were made drastically
> inconsistent, then at the next reboot, the situation that prevented the
> RAID1 from starting was resolved, and it never know it failed to start
> last time, and the two inconsistent members we put back into a clean
> array).  So, deprecating any of these is not really helpful.  And you
> need to keep the old 0.90 format around for back compatibility with
> thousands of existing raid arrays.

Agree, what is the benefit in deprecating them?  Is there that much old 
code or?

>
>>>> It's certainly easy enough to change mdadm to default to the 1.2
>>>> format and to require a --force switch to  allow use of the older
>>>> formats.
>>>>
>>>> I keep seeing that we support these old formats, and it's never been
>>>> clear to me why we have four different ones available?  Why can't we
>>>> start defining the canonical format for Linux RAID metadata?
>>>>
>>>> Thanks,
>>>> John
>>>> john@stoffel.org
>>>> -
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> Justin> I hope 00.90.03 is not deprecated, LILO cannot boot off of
>> Justin> anything else!
>>
>> Are you sure?  I find that GRUB is much easier to use and setup than
>> LILO these days.  But hey, just dropping down to support 00.09.03 and
>> 1.2 formats would be fine too.  Let's just lessen the confusion if at
>> all possible.
>>
>> John
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Doug Ledford <dledford@redhat.com>
>              GPG KeyID: CFBFF194
>              http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>              http://people.redhat.com/dledford/Infiniband
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:15     ` Doug Ledford
  2007-10-19 16:35       ` Justin Piszcz
@ 2007-10-19 16:38       ` John Stoffel
  2007-10-19 16:40         ` Justin Piszcz
  2007-10-19 17:11         ` Time to deprecate old RAID formats? Doug Ledford
  2007-10-20 14:09       ` Michael Tokarev
  2 siblings, 2 replies; 88+ messages in thread
From: John Stoffel @ 2007-10-19 16:38 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Justin Piszcz, linux-raid

>>>>> "Doug" == Doug Ledford <dledford@redhat.com> writes:

Doug> On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
>> >>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>> 
Justin> On Fri, 19 Oct 2007, John Stoffel wrote:
>> 
>> >> 
>> >> So,
>> >> 
>> >> Is it time to start thinking about deprecating the old 0.9, 1.0 and
>> >> 1.1 formats to just standardize on the 1.2 format?  What are the
>> >> issues surrounding this?

Doug> 1.0, 1.1, and 1.2 are the same format, just in different positions on
Doug> the disk.  Of the three, the 1.1 format is the safest to use since it
Doug> won't allow you to accidentally have some sort of metadata between the
Doug> beginning of the disk and the raid superblock (such as an lvm2
Doug> superblock), and hence whenever the raid array isn't up, you won't be
Doug> able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
Doug> case situations, I've seen lvm2 find a superblock on one RAID1 array
Doug> member when the RAID1 array was down, the system came up, you used the
Doug> system, the two copies of the raid array were made drastically
Doug> inconsistent, then at the next reboot, the situation that prevented the
Doug> RAID1 from starting was resolved, and it never know it failed to start
Doug> last time, and the two inconsistent members we put back into a clean
Doug> array).  So, deprecating any of these is not really helpful.  And you
Doug> need to keep the old 0.90 format around for back compatibility with
Doug> thousands of existing raid arrays.

This is a great case for making the 1.1 format be the default.  So
what are the advantages of the 1.0 and 1.2 formats then?  Or should be
we thinking about making two copies of the data on each RAID member,
one at the beginning and one at the end, for resiliency?  

I just hate seeing this in the mag page:

    Declare the style of superblock (raid metadata) to be used.
    The default is 0.90 for --create, and to guess for other operations.
    The default can be overridden by setting the metadata value for the
    CREATE keyword in mdadm.conf.

    Options are:

    0, 0.90, default

      Use the original 0.90 format superblock.  This format limits arrays to
      28 component devices and limits compo- nent devices of levels 1 and
      greater to 2 terabytes.

    1, 1.0, 1.1, 1.2

      Use the new version-1 format superblock.  This has few restrictions.
      The different sub-versions store the superblock at different locations
      on the device, either at the end (for 1.0), at the start (for 1.1) or
      4K from the start (for 1.2).

It looks to me that the 1.1, combined with the 1.0 should be what we
use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

So at this point I'm not arguing to get rid of the 0.9 format, though
I think it should NOT be the default any more, we should be using the
1.1 combined with 1.0 format.

John

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:38       ` John Stoffel
@ 2007-10-19 16:40         ` Justin Piszcz
  2007-10-19 16:44           ` John Stoffel
  2007-10-19 17:11         ` Time to deprecate old RAID formats? Doug Ledford
  1 sibling, 1 reply; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 16:40 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, linux-raid



On Fri, 19 Oct 2007, John Stoffel wrote:

>>>>>> "Doug" == Doug Ledford <dledford@redhat.com> writes:
>
> Doug> On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
>>>>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>>>
> Justin> On Fri, 19 Oct 2007, John Stoffel wrote:
>>>
>>>>>
>>>>> So,
>>>>>
>>>>> Is it time to start thinking about deprecating the old 0.9, 1.0 and
>>>>> 1.1 formats to just standardize on the 1.2 format?  What are the
>>>>> issues surrounding this?
>
> Doug> 1.0, 1.1, and 1.2 are the same format, just in different positions on
> Doug> the disk.  Of the three, the 1.1 format is the safest to use since it
> Doug> won't allow you to accidentally have some sort of metadata between the
> Doug> beginning of the disk and the raid superblock (such as an lvm2
> Doug> superblock), and hence whenever the raid array isn't up, you won't be
> Doug> able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
> Doug> case situations, I've seen lvm2 find a superblock on one RAID1 array
> Doug> member when the RAID1 array was down, the system came up, you used the
> Doug> system, the two copies of the raid array were made drastically
> Doug> inconsistent, then at the next reboot, the situation that prevented the
> Doug> RAID1 from starting was resolved, and it never know it failed to start
> Doug> last time, and the two inconsistent members we put back into a clean
> Doug> array).  So, deprecating any of these is not really helpful.  And you
> Doug> need to keep the old 0.90 format around for back compatibility with
> Doug> thousands of existing raid arrays.
>
> This is a great case for making the 1.1 format be the default.  So
> what are the advantages of the 1.0 and 1.2 formats then?  Or should be
> we thinking about making two copies of the data on each RAID member,
> one at the beginning and one at the end, for resiliency?
>
> I just hate seeing this in the mag page:
>
>    Declare the style of superblock (raid metadata) to be used.
>    The default is 0.90 for --create, and to guess for other operations.
>    The default can be overridden by setting the metadata value for the
>    CREATE keyword in mdadm.conf.
>
>    Options are:
>
>    0, 0.90, default
>
>      Use the original 0.90 format superblock.  This format limits arrays to
>      28 component devices and limits compo- nent devices of levels 1 and
>      greater to 2 terabytes.
>
>    1, 1.0, 1.1, 1.2
>
>      Use the new version-1 format superblock.  This has few restrictions.
>      The different sub-versions store the superblock at different locations
>      on the device, either at the end (for 1.0), at the start (for 1.1) or
>      4K from the start (for 1.2).
>
>
> It looks to me that the 1.1, combined with the 1.0 should be what we
> use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*
>
> So at this point I'm not arguing to get rid of the 0.9 format, though
> I think it should NOT be the default any more, we should be using the
> 1.1 combined with 1.0 format.

Is a bitmap created by default with 1.x?  I remember seeing reports of 
15-30% performance degradation using a bitmap on a RAID5 with 1.x.

>
> John
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:40         ` Justin Piszcz
@ 2007-10-19 16:44           ` John Stoffel
  2007-10-19 16:45             ` Justin Piszcz
  0 siblings, 1 reply; 88+ messages in thread
From: John Stoffel @ 2007-10-19 16:44 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: John Stoffel, Doug Ledford, linux-raid

>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:

Justin> Is a bitmap created by default with 1.x?  I remember seeing
Justin> reports of 15-30% performance degradation using a bitmap on a
Justin> RAID5 with 1.x.

Not according to the mdadm man page.  I'd probably give up that
performance if it meant that re-syncing an array went much faster
after a crash.

I certainly use it on my RAID1 setup on my home machine.  

John

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:44           ` John Stoffel
@ 2007-10-19 16:45             ` Justin Piszcz
  2007-10-19 17:04               ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 16:45 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, linux-raid



On Fri, 19 Oct 2007, John Stoffel wrote:

>>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>
> Justin> Is a bitmap created by default with 1.x?  I remember seeing
> Justin> reports of 15-30% performance degradation using a bitmap on a
> Justin> RAID5 with 1.x.
>
> Not according to the mdadm man page.  I'd probably give up that
> performance if it meant that re-syncing an array went much faster
> after a crash.
>
> I certainly use it on my RAID1 setup on my home machine.
>
> John
>

The performance AFTER a crash yes, but in general usage I remember seeing 
someone here doing benchmarks it had a negative affect on performance.

Justin.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:45             ` Justin Piszcz
@ 2007-10-19 17:04               ` Doug Ledford
  2007-10-19 17:05                 ` Justin Piszcz
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-19 17:04 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: John Stoffel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1932 bytes --]

On Fri, 2007-10-19 at 12:45 -0400, Justin Piszcz wrote:
> 
> On Fri, 19 Oct 2007, John Stoffel wrote:
> 
> >>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
> >
> > Justin> Is a bitmap created by default with 1.x?  I remember seeing
> > Justin> reports of 15-30% performance degradation using a bitmap on a
> > Justin> RAID5 with 1.x.
> >
> > Not according to the mdadm man page.  I'd probably give up that
> > performance if it meant that re-syncing an array went much faster
> > after a crash.
> >
> > I certainly use it on my RAID1 setup on my home machine.
> >
> > John
> >
> 
> The performance AFTER a crash yes, but in general usage I remember seeing 
> someone here doing benchmarks it had a negative affect on performance. 

I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
never split up by a chunk size for stripes.  A 2mb read is a single
read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
series of stripes across all disks.  That means that on raid1 arrays,
total disk seeks < total reads/writes, where as on a raid4/5/6, total
disk seeks is usually > total reads/writes.  That in turn implies that
in a raid1 setup, disk seek time is important to performance, but not
necessarily paramount.  For raid456, disk seek time is paramount because
of how many more seeks that format uses.  When you then use an internal
bitmap, you are adding writes to every member of the raid456 array,
which adds more seeks.  The same is true for raid1, but since raid1
doesn't have the same level of dependency on seek rates that raid456
has, it doesn't show the same performance hit that raid456 does.

> 
> Justin.
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 17:04               ` Doug Ledford
@ 2007-10-19 17:05                 ` Justin Piszcz
  2007-10-19 17:23                   ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 17:05 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, linux-raid



On Fri, 19 Oct 2007, Doug Ledford wrote:

> On Fri, 2007-10-19 at 12:45 -0400, Justin Piszcz wrote:
>>
>> On Fri, 19 Oct 2007, John Stoffel wrote:
>>
>>>>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>>>
>>> Justin> Is a bitmap created by default with 1.x?  I remember seeing
>>> Justin> reports of 15-30% performance degradation using a bitmap on a
>>> Justin> RAID5 with 1.x.
>>>
>>> Not according to the mdadm man page.  I'd probably give up that
>>> performance if it meant that re-syncing an array went much faster
>>> after a crash.
>>>
>>> I certainly use it on my RAID1 setup on my home machine.
>>>
>>> John
>>>
>>
>> The performance AFTER a crash yes, but in general usage I remember seeing
>> someone here doing benchmarks it had a negative affect on performance.
>
> I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
> never split up by a chunk size for stripes.  A 2mb read is a single
> read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
> series of stripes across all disks.  That means that on raid1 arrays,
> total disk seeks < total reads/writes, where as on a raid4/5/6, total
> disk seeks is usually > total reads/writes.  That in turn implies that
> in a raid1 setup, disk seek time is important to performance, but not
> necessarily paramount.  For raid456, disk seek time is paramount because
> of how many more seeks that format uses.  When you then use an internal
> bitmap, you are adding writes to every member of the raid456 array,
> which adds more seeks.  The same is true for raid1, but since raid1
> doesn't have the same level of dependency on seek rates that raid456
> has, it doesn't show the same performance hit that raid456 does.
>
>>
>> Justin.
> -- 
> Doug Ledford <dledford@redhat.com>
>              GPG KeyID: CFBFF194
>              http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>              http://people.redhat.com/dledford/Infiniband
>

Got it, so for RAID1 it would make sense if LILO supported it (the 
later versions of the md superblock) (for those who use LILO) but for
RAID4/5/6, keep the bitmaps away :)

Justin.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:38       ` John Stoffel
  2007-10-19 16:40         ` Justin Piszcz
@ 2007-10-19 17:11         ` Doug Ledford
  2007-10-19 18:39           ` John Stoffel
  1 sibling, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-19 17:11 UTC (permalink / raw)
  To: John Stoffel; +Cc: Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]

On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote:

>     1, 1.0, 1.1, 1.2
> 
>       Use the new version-1 format superblock.  This has few restrictions.
>       The different sub-versions store the superblock at different locations
>       on the device, either at the end (for 1.0), at the start (for 1.1) or
>       4K from the start (for 1.2).
> 
> 
> It looks to me that the 1.1, combined with the 1.0 should be what we
> use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

You're somewhat misreading the man page.  You *can't* combine 1.0 with
1.1.  All of the above options: 1, 1.0, 1.1, 1.2; specifically mean to
use a version 1 superblock.  1.0 means use a version 1 superblock at the
end of the disk.  1.1 means version 1 superblock at beginning of disk.
`1.2 means version 1 at 4k offset from beginning of the disk.  There
really is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the
version *only* means where to put the version 1 superblock on the disk.
If you just say version 1, then it goes to the default location for
version 1 superblocks, and last I checked that was the end of disk (aka,
1.0).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 17:05                 ` Justin Piszcz
@ 2007-10-19 17:23                   ` Doug Ledford
  2007-10-19 17:47                     ` Justin Piszcz
  2007-10-19 22:43                     ` chunk size (was Re: Time to deprecate old RAID formats?) Michal Soltys
  0 siblings, 2 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-19 17:23 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: John Stoffel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2943 bytes --]

On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:

> > I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
> > never split up by a chunk size for stripes.  A 2mb read is a single
> > read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
> > series of stripes across all disks.  That means that on raid1 arrays,
> > total disk seeks < total reads/writes, where as on a raid4/5/6, total
> > disk seeks is usually > total reads/writes.  That in turn implies that
> > in a raid1 setup, disk seek time is important to performance, but not
> > necessarily paramount.  For raid456, disk seek time is paramount because
> > of how many more seeks that format uses.  When you then use an internal
> > bitmap, you are adding writes to every member of the raid456 array,
> > which adds more seeks.  The same is true for raid1, but since raid1
> > doesn't have the same level of dependency on seek rates that raid456
> > has, it doesn't show the same performance hit that raid456 does.

> Got it, so for RAID1 it would make sense if LILO supported it (the 
> later versions of the md superblock)

Lilo doesn't know anything about the superblock format, however, lilo
expects the raid1 device to start at the beginning of the physical
partition.  In otherwords, format 1.0 would work with lilo.

>  (for those who use LILO) but for
> RAID4/5/6, keep the bitmaps away :)

I still use an internal bitmap regardless ;-)  To help mitigate the cost
of seeks on raid456, you can specify a huge chunk size (like 256k to 2MB
or somewhere in that range).  As long as you can get 90%+ of your
reads/writes to fall into the space of a single chunk, then you start
performing more like a raid1 device without the extra seek overhead.  Of
course, this comes at the expense of peak throughput on the device.
Let's say you were building a mondo movie server, where you were
streaming out digital movie files.  In that case, you very well may care
more about throughput than seek performance since I suspect you wouldn't
have many small, random reads.  Then I would use a small chunk size,
sacrifice the seek performance, and get the throughput bonus of parallel
reads from the same stripe on multiple disks.  On the other hand, if I
was setting up a mail server then I would go with a large chunk size
because the filesystem activities themselves are going to produce lots
of random seeks, and you don't want your raid setup to make that problem
worse.  Plus, most mail doesn't come in or go out at any sort of massive
streaming speed, so you don't need the paralllel reads from multiple
disks to perform well.  It all depends on your particular use scenario.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 17:23                   ` Doug Ledford
@ 2007-10-19 17:47                     ` Justin Piszcz
  2007-10-20 18:38                       ` Michael Tokarev
  2007-10-19 22:43                     ` chunk size (was Re: Time to deprecate old RAID formats?) Michal Soltys
  1 sibling, 1 reply; 88+ messages in thread
From: Justin Piszcz @ 2007-10-19 17:47 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, linux-raid



On Fri, 19 Oct 2007, Doug Ledford wrote:

> On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
>
>>> I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
>>> never split up by a chunk size for stripes.  A 2mb read is a single
>>> read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
>>> series of stripes across all disks.  That means that on raid1 arrays,
>>> total disk seeks < total reads/writes, where as on a raid4/5/6, total
>>> disk seeks is usually > total reads/writes.  That in turn implies that
>>> in a raid1 setup, disk seek time is important to performance, but not
>>> necessarily paramount.  For raid456, disk seek time is paramount because
>>> of how many more seeks that format uses.  When you then use an internal
>>> bitmap, you are adding writes to every member of the raid456 array,
>>> which adds more seeks.  The same is true for raid1, but since raid1
>>> doesn't have the same level of dependency on seek rates that raid456
>>> has, it doesn't show the same performance hit that raid456 does.
>
>> Got it, so for RAID1 it would make sense if LILO supported it (the
>> later versions of the md superblock)
>
> Lilo doesn't know anything about the superblock format, however, lilo
> expects the raid1 device to start at the beginning of the physical
> partition.  In otherwords, format 1.0 would work with lilo.
Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it 
worked fine.

>
>>  (for those who use LILO) but for
>> RAID4/5/6, keep the bitmaps away :)
>
> I still use an internal bitmap regardless ;-)  To help mitigate the cost
> of seeks on raid456, you can specify a huge chunk size (like 256k to 2MB
> or somewhere in that range).  As long as you can get 90%+ of your
> reads/writes to fall into the space of a single chunk, then you start
> performing more like a raid1 device without the extra seek overhead.  Of
> course, this comes at the expense of peak throughput on the device.
> Let's say you were building a mondo movie server, where you were
> streaming out digital movie files.  In that case, you very well may care
> more about throughput than seek performance since I suspect you wouldn't
> have many small, random reads.  Then I would use a small chunk size,
> sacrifice the seek performance, and get the throughput bonus of parallel
> reads from the same stripe on multiple disks.  On the other hand, if I
> was setting up a mail server then I would go with a large chunk size
> because the filesystem activities themselves are going to produce lots
> of random seeks, and you don't want your raid setup to make that problem
> worse.  Plus, most mail doesn't come in or go out at any sort of massive
> streaming speed, so you don't need the paralllel reads from multiple
> disks to perform well.  It all depends on your particular use scenario.
>
> -- 
> Doug Ledford <dledford@redhat.com>
>              GPG KeyID: CFBFF194
>              http://people.redhat.com/dledford
>
> Infiniband specific RPMs available at
>              http://people.redhat.com/dledford/Infiniband
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 17:11         ` Time to deprecate old RAID formats? Doug Ledford
@ 2007-10-19 18:39           ` John Stoffel
  2007-10-19 21:23             ` Iustin Pop
  2007-10-23 23:03             ` Bill Davidsen
  0 siblings, 2 replies; 88+ messages in thread
From: John Stoffel @ 2007-10-19 18:39 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Justin Piszcz, linux-raid

>>>>> "Doug" == Doug Ledford <dledford@redhat.com> writes:

Doug> On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote:
>> 1, 1.0, 1.1, 1.2
>> 
>> Use the new version-1 format superblock.  This has few restrictions.
>> The different sub-versions store the superblock at different locations
>> on the device, either at the end (for 1.0), at the start (for 1.1) or
>> 4K from the start (for 1.2).
>> 
>> 
>> It looks to me that the 1.1, combined with the 1.0 should be what we
>> use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

Doug> You're somewhat misreading the man page. 

The man page is somewhat misleading then.  It's not clear from reading
it that the version 1 RAID superblock can be in one of three different
positions in the volume.  

Doug> You *can't* combine 1.0 with 1.1.  All of the above options: 1,
Doug> 1.0, 1.1, 1.2; specifically mean to use a version 1 superblock.
Doug> 1.0 means use a version 1 superblock at the end of the disk.
Doug> 1.1 means version 1 superblock at beginning of disk.  `1.2 means
Doug> version 1 at 4k offset from beginning of the disk.  There really
Doug> is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the
Doug> version *only* means where to put the version 1 superblock on
Doug> the disk.  If you just say version 1, then it goes to the
Doug> default location for version 1 superblocks, and last I checked
Doug> that was the end of disk (aka, 1.0).

So why not get rid of (deprecate) the version 1.0 and version 1.2
blocks, and only support the 1.1 version?  

Why do we have three different positions for storing the superblock?  

And if putting the superblock at the end is problematic, why is it the
default?  Shouldn't version 1.1 be the default?  

Or, alternatively, update the code so that we support RAID superblocks
at BOTH the beginning and end 4k of the disk, for maximum redundancy.

I guess I need to go and read the code to figure out the placement of
0.90 and 1.0 blocks to see how they are different.  It's just not
clear to me why we have such a muddle of 1.x formats to choose from
and what the advantages and tradeoffs are between them.

John

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 18:39           ` John Stoffel
@ 2007-10-19 21:23             ` Iustin Pop
  2007-10-19 21:42               ` Doug Ledford
  2007-10-23 23:03             ` Bill Davidsen
  1 sibling, 1 reply; 88+ messages in thread
From: Iustin Pop @ 2007-10-19 21:23 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, Justin Piszcz, linux-raid

On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
> And if putting the superblock at the end is problematic, why is it the
> default?  Shouldn't version 1.1 be the default?  

In my opinion, having the superblock *only* at the end (e.g. the 0.90
format) is the best option.

It allows one to mount the disk separately (in case of RAID 1), if the
MD superblock is corrupt or you just want to get easily at the raw data.

As to the people who complained exactly because of this feature, LVM has
two mechanisms to protect from accessing PVs on the raw disks (the
ignore raid components option and the filter - I always set filters when
using LVM ontop of MD).

regards,
iustin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 21:23             ` Iustin Pop
@ 2007-10-19 21:42               ` Doug Ledford
  2007-10-20  7:53                 ` Iustin Pop
  2007-10-23 23:09                 ` Bill Davidsen
  0 siblings, 2 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-19 21:42 UTC (permalink / raw)
  To: Iustin Pop; +Cc: John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3577 bytes --]

On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote:
> On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
> > And if putting the superblock at the end is problematic, why is it the
> > default?  Shouldn't version 1.1 be the default?  
> 
> In my opinion, having the superblock *only* at the end (e.g. the 0.90
> format) is the best option.
> 
> It allows one to mount the disk separately (in case of RAID 1), if the
> MD superblock is corrupt or you just want to get easily at the raw data.

Bad reasoning.  It's the reason that the default is at the end of the
device, but that was a bad decision made by Ingo long, long ago in a
galaxy far, far away.

The simple fact of the matter is there are only two type of raid devices
for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
and those that don't (raid1, linear).

For the purposes of this issue, there are only two states we care about:
the raid array works or doesn't work.

If the raid array works, then you *only* want the system to access the
data via the raid array.  If the raid array doesn't work, then for the
fragmented case you *never* want the system to see any of the data from
the raid array (such as an ext3 superblock) or a subsequent fsck could
see a valid superblock and actually start a filesystem scan on the raw
device, and end up hosing the filesystem beyond all repair after it hits
the first chunk size break (although in practice this is usually a
situation where fsck declares the filesystem so corrupt that it refuses
to touch it, that's leaving an awful lot to chance, you really don't
want fsck to *ever* see that superblock).

If the raid array is raid1, then the raid array should *never* fail to
start unless all disks are missing (in which case there is no raw device
to access anyway).  The very few failure types that will cause the raid
array to not start automatically *and* still have an intact copy of the
data usually happen when the raid array is perfectly healthy, in which
case automatically finding a constituent device when the raid array
failed to start is exactly the *wrong* thing to do (for instance, you
enable SELinux on a machine and it hasn't been relabeled and the raid
array fails to start because /dev/md<blah> can't be created because of
an SELinux denial...all the raid1 members are still there, but if you
touch a single one of them, then you run the risk of creating silent
data corruption).

It really boils down to this: for any reason that a raid array might
fail to start, you *never* want to touch the underlying data until
someone has taken manual measures to figure out why it didn't start and
corrected the problem.  Putting the superblock in front of the data does
not prevent manual measures (such as recreating superblocks) from
getting at the data.  But, putting superblocks at the end leaves the
door open for accidental access via constituent devices when you
*really* don't want that to happen.

So, no, the default should *not* be at the end of the device.

> As to the people who complained exactly because of this feature, LVM has
> two mechanisms to protect from accessing PVs on the raw disks (the
> ignore raid components option and the filter - I always set filters when
> using LVM ontop of MD).
> 
> regards,
> iustin
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* chunk size (was Re: Time to deprecate old RAID formats?)
  2007-10-19 17:23                   ` Doug Ledford
  2007-10-19 17:47                     ` Justin Piszcz
@ 2007-10-19 22:43                     ` Michal Soltys
  2007-10-20 13:29                       ` Doug Ledford
  1 sibling, 1 reply; 88+ messages in thread
From: Michal Soltys @ 2007-10-19 22:43 UTC (permalink / raw)
  To: linux-raid

Doug Ledford wrote:
> course, this comes at the expense of peak throughput on the device.
> Let's say you were building a mondo movie server, where you were
> streaming out digital movie files.  In that case, you very well may care
> more about throughput than seek performance since I suspect you wouldn't
> have many small, random reads.  Then I would use a small chunk size,
> sacrifice the seek performance, and get the throughput bonus of parallel
> reads from the same stripe on multiple disks.  On the other hand, if I
> 

Out of curiosity though - why wouldn't large chunk work well here ? If you 
stream video (I assume large files, so like a good few MBs at least), the 
reads are parallel either way.

Yes, the amount of data read from each of the disks will be in less perfect 
proportion than in small chunk size scenario, but it's pretty neglible. 
Benchamrks I've seen (like Justin's one) seem not to care much about chunk 
size in sequential read/write scenarios (and often favors larger chunks). 
Some of my own tests I did few months ago confirmed that as well.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 21:42               ` Doug Ledford
@ 2007-10-20  7:53                 ` Iustin Pop
  2007-10-20 13:11                   ` Doug Ledford
  2007-10-23 23:09                 ` Bill Davidsen
  1 sibling, 1 reply; 88+ messages in thread
From: Iustin Pop @ 2007-10-20  7:53 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Justin Piszcz, linux-raid

On Fri, Oct 19, 2007 at 05:42:09PM -0400, Doug Ledford wrote:
> The simple fact of the matter is there are only two type of raid devices
> for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
> and those that don't (raid1, linear).
> 
> For the purposes of this issue, there are only two states we care about:
> the raid array works or doesn't work.
Yes, but "doesn't work" doesn't mean only that the array fails to start.

> If the raid array works, then you *only* want the system to access the
> data via the raid array.  If the raid array doesn't work, then for the
> fragmented case you *never* want the system to see any of the data from
> the raid array (such as an ext3 superblock) or a subsequent fsck could
> see a valid superblock and actually start a filesystem scan on the raw
> device, and end up hosing the filesystem beyond all repair after it hits
> the first chunk size break (although in practice this is usually a
> situation where fsck declares the filesystem so corrupt that it refuses
> to touch it, that's leaving an awful lot to chance, you really don't
> want fsck to *ever* see that superblock).
Honestly, I don't see how a properly configured system would start
looking at the physical device by mistake. I suppose it's possible, but
I didn't have this issue.

> If the raid array is raid1, then the raid array should *never* fail to
> start unless all disks are missing (in which case there is no raw device
> to access anyway).  The very few failure types that will cause the raid
> array to not start automatically *and* still have an intact copy of the
> data usually happen when the raid array is perfectly healthy, in which
> case automatically finding a constituent device when the raid array
> failed to start is exactly the *wrong* thing to do (for instance, you
> enable SELinux on a machine and it hasn't been relabeled and the raid
> array fails to start because /dev/md<blah> can't be created because of
> an SELinux denial...all the raid1 members are still there, but if you
> touch a single one of them, then you run the risk of creating silent
> data corruption).

It's not only about the activation of the array. I'm mostly talking
about RAID1, but the fact that migrating between RAID1 and plain disk is
just a few hundred K at the end increases the flexibility very much.
With superblock at the start, you can't decide to convert a plain disk
to RAID1 without shifting all data, with the superblock at the end it's
perfectly possible.

Also, sometime you want to recover as much as possible from a not intact
copy of the data...

Of course, different people have different priorities, but as I said, I
like that this conversion is possible, and I never had the case of a
tool saying "hmm, /dev/md<something> is not there, let's look at
/dev/sdc instead".

thanks,
iustin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20  7:53                 ` Iustin Pop
@ 2007-10-20 13:11                   ` Doug Ledford
  2007-10-26  9:54                     ` Luca Berra
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-20 13:11 UTC (permalink / raw)
  To: Iustin Pop; +Cc: John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]

On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote:

> Honestly, I don't see how a properly configured system would start
> looking at the physical device by mistake. I suppose it's possible, but
> I didn't have this issue.

Mount by label support scans all devices in /proc/partitions looking for
the filesystem superblock that has the label you are trying to mount.
LVM (unless told not to) scans all devices in /proc/partitions looking
for valid LVM superblocks.  In fact, you can't build a linux system that
is resilient to device name changes without doing that.

> It's not only about the activation of the array. I'm mostly talking
> about RAID1, but the fact that migrating between RAID1 and plain disk is
> just a few hundred K at the end increases the flexibility very much.

Flexibility, no.  Convenience, yes.  You can do all the things with
superblock at the front that you can with it at the end, it just takes a
little more effort.

> Also, sometime you want to recover as much as possible from a not intact
> copy of the data...

And you can with superblock at the front.  You can create a new single
disk raid1 over the existing superblock or you can munge the partition
table to have it point at the start of your data.  There are options,
they just require manual intervention.  But if you are trying to rescue
data off of a seriously broken device, you are already doing manual
intervention anyway.

> Of course, different people have different priorities, but as I said, I
> like that this conversion is possible, and I never had the case of a
> tool saying "hmm, /dev/md<something> is not there, let's look at
> /dev/sdc instead".

mount, pvscan.

> thanks,
> iustin
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: chunk size (was Re: Time to deprecate old RAID formats?)
  2007-10-19 22:43                     ` chunk size (was Re: Time to deprecate old RAID formats?) Michal Soltys
@ 2007-10-20 13:29                       ` Doug Ledford
  2007-10-23 19:21                         ` Michal Soltys
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-20 13:29 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2180 bytes --]

On Sat, 2007-10-20 at 00:43 +0200, Michal Soltys wrote:
> Doug Ledford wrote:
> > course, this comes at the expense of peak throughput on the device.
> > Let's say you were building a mondo movie server, where you were
> > streaming out digital movie files.  In that case, you very well may care
> > more about throughput than seek performance since I suspect you wouldn't
> > have many small, random reads.  Then I would use a small chunk size,
> > sacrifice the seek performance, and get the throughput bonus of parallel
> > reads from the same stripe on multiple disks.  On the other hand, if I
> > 
> 
> Out of curiosity though - why wouldn't large chunk work well here ? If you 
> stream video (I assume large files, so like a good few MBs at least), the 
> reads are parallel either way.

Well, first I was thinking of files in the few hundreds of megabytes
each to gigabytes each, and when they are streamed, they are streamed at
a rate much lower than the full speed of the array, but still at a fast
rate.  How parallel the reads are then would tend to be a function of
chunk size versus streaming rate.  I guess I should clarify what I'm
talking about anyway.  To me, a large chunk size is 1 to 2MB or so, a
small chunk size is in the 64k to 256k range.  If you have a 10 disk
raid5 array with a 2mb chunk size, and you aren't just copying files
around, then it's hard to ever get that to do full speed parallel reads
because you simply won't access the data fast enough.

> Yes, the amount of data read from each of the disks will be in less perfect 
> proportion than in small chunk size scenario, but it's pretty neglible. 
> Benchamrks I've seen (like Justin's one) seem not to care much about chunk 
> size in sequential read/write scenarios (and often favors larger chunks). 
> Some of my own tests I did few months ago confirmed that as well.

I'm not familiar with the benchmark you are referring to.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:15     ` Doug Ledford
  2007-10-19 16:35       ` Justin Piszcz
  2007-10-19 16:38       ` John Stoffel
@ 2007-10-20 14:09       ` Michael Tokarev
  2007-10-20 14:24         ` Doug Ledford
  2007-10-20 14:52         ` John Stoffel
  2 siblings, 2 replies; 88+ messages in thread
From: Michael Tokarev @ 2007-10-20 14:09 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Justin Piszcz, linux-raid

Doug Ledford wrote:
[]
> 1.0, 1.1, and 1.2 are the same format, just in different positions on
> the disk.  Of the three, the 1.1 format is the safest to use since it
> won't allow you to accidentally have some sort of metadata between the
> beginning of the disk and the raid superblock (such as an lvm2
> superblock), and hence whenever the raid array isn't up, you won't be
> able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
> case situations, I've seen lvm2 find a superblock on one RAID1 array
> member when the RAID1 array was down, the system came up, you used the
> system, the two copies of the raid array were made drastically
> inconsistent, then at the next reboot, the situation that prevented the
> RAID1 from starting was resolved, and it never know it failed to start
> last time, and the two inconsistent members we put back into a clean
> array).  So, deprecating any of these is not really helpful.  And you
> need to keep the old 0.90 format around for back compatibility with
> thousands of existing raid arrays.

Well, I strongly, completely disagree.  You described a real-world
situation, and that's unfortunate, BUT: for at least raid1, there ARE
cases, pretty valid ones, when one NEEDS to mount the filesystem without
bringing up raid.  Raid1 allows that.

/mjt

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 14:09       ` Michael Tokarev
@ 2007-10-20 14:24         ` Doug Ledford
  2007-10-20 14:52         ` John Stoffel
  1 sibling, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-20 14:24 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1673 bytes --]

On Sat, 2007-10-20 at 18:09 +0400, Michael Tokarev wrote:
> Doug Ledford wrote:
> []
> > 1.0, 1.1, and 1.2 are the same format, just in different positions on
> > the disk.  Of the three, the 1.1 format is the safest to use since it
> > won't allow you to accidentally have some sort of metadata between the
> > beginning of the disk and the raid superblock (such as an lvm2
> > superblock), and hence whenever the raid array isn't up, you won't be
> > able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
> > case situations, I've seen lvm2 find a superblock on one RAID1 array
> > member when the RAID1 array was down, the system came up, you used the
> > system, the two copies of the raid array were made drastically
> > inconsistent, then at the next reboot, the situation that prevented the
> > RAID1 from starting was resolved, and it never know it failed to start
> > last time, and the two inconsistent members we put back into a clean
> > array).  So, deprecating any of these is not really helpful.  And you
> > need to keep the old 0.90 format around for back compatibility with
> > thousands of existing raid arrays.
> 
> Well, I strongly, completely disagree.  You described a real-world
> situation, and that's unfortunate, BUT: for at least raid1, there ARE
> cases, pretty valid ones, when one NEEDS to mount the filesystem without
> bringing up raid.  Raid1 allows that.

Name one.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 14:09       ` Michael Tokarev
  2007-10-20 14:24         ` Doug Ledford
@ 2007-10-20 14:52         ` John Stoffel
  2007-10-20 15:07           ` Iustin Pop
                             ` (2 more replies)
  1 sibling, 3 replies; 88+ messages in thread
From: John Stoffel @ 2007-10-20 14:52 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Doug Ledford, John Stoffel, Justin Piszcz, linux-raid

>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

Michael> Doug Ledford wrote:
Michael> []
>> 1.0, 1.1, and 1.2 are the same format, just in different positions on
>> the disk.  Of the three, the 1.1 format is the safest to use since it
>> won't allow you to accidentally have some sort of metadata between the
>> beginning of the disk and the raid superblock (such as an lvm2
>> superblock), and hence whenever the raid array isn't up, you won't be
>> able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
>> case situations, I've seen lvm2 find a superblock on one RAID1 array
>> member when the RAID1 array was down, the system came up, you used the
>> system, the two copies of the raid array were made drastically
>> inconsistent, then at the next reboot, the situation that prevented the
>> RAID1 from starting was resolved, and it never know it failed to start
>> last time, and the two inconsistent members we put back into a clean
>> array).  So, deprecating any of these is not really helpful.  And you
>> need to keep the old 0.90 format around for back compatibility with
>> thousands of existing raid arrays.

Michael> Well, I strongly, completely disagree.  You described a
Michael> real-world situation, and that's unfortunate, BUT: for at
Michael> least raid1, there ARE cases, pretty valid ones, when one
Michael> NEEDS to mount the filesystem without bringing up raid.
Michael> Raid1 allows that.

Please describe one such case please.  There have certainly been hacks
of various RAID systems on other OSes such as Solaris where the VxVM
and/or Solstice DiskSuite allowed you to encapsulate an existing
partition into a RAID array.  

But in my experience (and I'm a professional sysadm... :-) it's not
really all that useful, and can lead to problems liks those described
by Doug.  

If you are going to mirror an existing filesystem, then by definition
you have a second disk or partition available for the purpose.  So you
would merely setup the new RAID1, in degraded mode, using the new
partition as the base.  Then you copy the data over to the new RAID1
device, change your boot setup, and reboot.

Once that is done, you can then add the original partition into the
RAID1 array.  

As Doug says, and I agree strongly, you DO NOT want to have the
possibility of confusion and data loss, especially on bootup.  And
this leads to the heart of my initial post on this matter, that the
confusion of having four different variations of RAID superblocks is
bad.  We should deprecate them down to just two, the old 0.90 format,
and the new 1.x format at the start of the RAID volume.

John

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 14:52         ` John Stoffel
@ 2007-10-20 15:07           ` Iustin Pop
  2007-10-20 15:36             ` Doug Ledford
  2007-10-20 18:24           ` Michael Tokarev
  2007-10-23 23:18           ` Bill Davidsen
  2 siblings, 1 reply; 88+ messages in thread
From: Iustin Pop @ 2007-10-20 15:07 UTC (permalink / raw)
  To: John Stoffel; +Cc: Michael Tokarev, Doug Ledford, Justin Piszcz, linux-raid

On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote:
> Michael> Well, I strongly, completely disagree.  You described a
> Michael> real-world situation, and that's unfortunate, BUT: for at
> Michael> least raid1, there ARE cases, pretty valid ones, when one
> Michael> NEEDS to mount the filesystem without bringing up raid.
> Michael> Raid1 allows that.
> 
> Please describe one such case please.

Boot from a raid1 array, such that everything - including the partition
table itself - is mirrored.

iustin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 15:07           ` Iustin Pop
@ 2007-10-20 15:36             ` Doug Ledford
  0 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-20 15:36 UTC (permalink / raw)
  To: Iustin Pop; +Cc: John Stoffel, Michael Tokarev, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2002 bytes --]

On Sat, 2007-10-20 at 17:07 +0200, Iustin Pop wrote:
> On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote:
> > Michael> Well, I strongly, completely disagree.  You described a
> > Michael> real-world situation, and that's unfortunate, BUT: for at
> > Michael> least raid1, there ARE cases, pretty valid ones, when one
> > Michael> NEEDS to mount the filesystem without bringing up raid.
> > Michael> Raid1 allows that.
> > 
> > Please describe one such case please.
> 
> Boot from a raid1 array, such that everything - including the partition
> table itself - is mirrored.

That's a *really* bad idea.  If you want to subpartition a raid array,
you really should either run lvm on top of raid or use a partitionable
raid array embedded in a raid partition.  If you don't, there are a
whole slew of failure cases that would result in the same sort of
accidental access and data corruption that I talked about.  For
instance, if you ever ran fdisk on the disk itself instead of the raid
array, fdisk would happily create a partition that runs off the end of
the raid device and into the superblock area.  The raid subsystem
autodetect only works on partitions labeled as type 0xfd, so it would
never search for a raid superblock at the end of the actual device, and
that means that if you boot from a rescue CD that doesn't contain an
mdadm.conf file that specifies the whole disk device as a search device,
then it is guaranteed to not start the device and possibly try and
modify the underlying constituent devices.  All around, it's just a
*really* bad idea.

I've heard several descriptions of things you *could* do with the
superblock at the end, but as of yet, not one of them is a good idea if
you really care about your data.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 14:52         ` John Stoffel
  2007-10-20 15:07           ` Iustin Pop
@ 2007-10-20 18:24           ` Michael Tokarev
  2007-10-22 20:39             ` John Stoffel
  2007-10-24  0:36             ` Doug Ledford
  2007-10-23 23:18           ` Bill Davidsen
  2 siblings, 2 replies; 88+ messages in thread
From: Michael Tokarev @ 2007-10-20 18:24 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, Justin Piszcz, linux-raid

John Stoffel wrote:
>>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:
[]
> Michael> Well, I strongly, completely disagree.  You described a
> Michael> real-world situation, and that's unfortunate, BUT: for at
> Michael> least raid1, there ARE cases, pretty valid ones, when one
> Michael> NEEDS to mount the filesystem without bringing up raid.
> Michael> Raid1 allows that.
> 
> Please describe one such case please.  There have certainly been hacks
> of various RAID systems on other OSes such as Solaris where the VxVM
> and/or Solstice DiskSuite allowed you to encapsulate an existing
> partition into a RAID array.  
> 
> But in my experience (and I'm a professional sysadm... :-) it's not
> really all that useful, and can lead to problems liks those described
> by Doug.  

I'm doing a sysadmin work for about 15 or 20 years.

> If you are going to mirror an existing filesystem, then by definition
> you have a second disk or partition available for the purpose.  So you
> would merely setup the new RAID1, in degraded mode, using the new
> partition as the base.  Then you copy the data over to the new RAID1
> device, change your boot setup, and reboot.
[...]

And you have to copy the data twice as a result, instead of copying
it only once to the second disk.

> As Doug says, and I agree strongly, you DO NOT want to have the
> possibility of confusion and data loss, especially on bootup.  And

There are different point of views, and different settings etc.
For example, I once dealt with a linux user who was unable to
use his disk partition, because his system (it was RedHat if I
remember correctly) recognized some LVM volume on his disk (it
was previously used with Windows) and tried to automatically
activate it, thus making it "busy".  What I'm talking about here
is that any automatic activation of anything should be done with
extreme care, using smart logic in the startup scripts if at
all.

The Doug's example - in my opinion anyway - shows wrong tools
or bad logic in the startup sequence, not a general flaw in
superblock location.

Another example is ext[234]fs - it does not touch first 512
bytes of the device, so if there was an msdos filesystem there
before, it will be recognized as such by many tools, and an
attempt to mount it automatically will lead to at least scary
output and nothing mounted, or in fsck doing fatal things to
it in worst scenario.  Sure thing the first 512 bytes should
be just cleared.. but that's another topic.

Speaking of cases where it was really helpful to have an ability
to mount individual raid components directly without the raid
level - most of them was due to one or another operator errors,
usually together with bugs and/or omissions in software.  I don't
remember exact scenarious anymore (last time it was more than 2
years ago).  Most of the time it was one or another sort of
system recovery.

In almost all machines I maintain, there's a raid1 for the root
filesystem built of all the drives (be it 2 or 4 or even 6 of
them) - the key point is to be able to boot off any of them
in case some cable/drive/controller rearrangement has to be
done.  Root filesystem is quite small (256 or 512 Mb here),
and it's not too dynamic either -- so it's not a big deal to
waste space for it.

Problem occurs - obviously - when something goes wrong.
And most of the time issues we had happened on a remote site,
where there was no expirienced operator/sysadmin handy.

For example, when one drive was almost dead, and mdadm tried
to bring the array up, machine just hanged for unknown amount
of time.  An unexpirienced operator was there.  Instead of
trying to teach him how to pass parameter to the initramfs
to stop trying to assemble root array and next assembling
it manually, I told him to pass "root=/dev/sda1" to the
kernel.  Root mounts read-only, so it should be a safe thing
to do - I only needed root fs and minimal set of services
(which are even in initramfs) just for it to boot up to SOME
state where I can log in remotely and fix things later.
(no I didn't want to remove the drive yet, I wanted to
examine it first, and it turned to be a good idea because
the hang was happening only at the beginning of it, and
while we tried to install replacement and fill it up with
data, there was an unreadable sector found on another
drive, so this old but not removed drive was really handy).

Another situation - after some weird crash I had to examine
the filesystems found on both components - I want to look
at the filesystems and compare them, WITHOUT messing up
with raid superblocks (later on I wrote a tiny program to
save/restore 0.90 superblocks), and without attempting a
reconstruction attempts.  In fact, this very case - examining
the contents - is something I've been doing many times for
one or another reason.  There's just no need to involve
raid layer here at all, but it doesn't disturb things either
(in some cases anyway).

Yet another - many times we had to copy an old system to
a new one - new machine boots with 3 drives in it, 2 new,
and 3rd (the boot one) from the old machine.  I boot it off
the non-raided config from the 3rd drive (using only the
halves of md devices), create new arrays on the 2 new
drives (note - had I started raid on the 3rd machine, there'd
be a problem with md device numbering, -- for consistency I
number all the partitions and raid arrays similarily on all
machines), and copy data over.  There's no need to do the
complex procedure of adding components to the existing raid
arrays, dropping the old drive from them and resizing the
stuff - because of the latter step (and because there's no
need to resync in the first place - the 2 new drives are
new, hence I use --no-resync because they're filled with
zeros anyway).

Another case - we had to copy large amount of data from one
machine to another, from a raid array.  I just pulled off the
disk (bitmaps=yes, and i remounted the filesystem readonly),
inserted it into another machine, mounted it - without raid -
here and did a copy.  Superblock was preserved, and when I
returned the drive back, everything was ok.

And so on.  There was countless number of cases like that,
something I forgot already too.

Well.  I know about a loop device which has "offset=XXX" parameter,
so one can actually see and use the "internals" component of a
raid1 array, even if the superblock is at the beginning.  But
see above, the very first case - go tell to that operator how
to do it all ;)

> this leads to the heart of my initial post on this matter, that the
> confusion of having four different variations of RAID superblocks is
> bad.  We should deprecate them down to just two, the old 0.90 format,
> and the new 1.x format at the start of the RAID volume.

It's confusing for sure.  But see: 0.90 format is the most commonly used
one, and the most important is that it's historical - it was here for
many years, many systems are using it.  I don't want to come across
a situation when, some years later, I'll need to grab a data from my
old disk and be unable to, because 0.90 format isn't supported anymore.

0.90 has some real limitations (like 26 components at max etc), hence
1.x format appeared.  And various flavours of 1.x format are all useful
too.  For example, if you're concerned about safety of your data due to
defects(*) in your startup scripts, -- use whatever 1.x format which puts
the metadata at the beginning.  That's just it, I think ;)

/mjt

(*) Note: software like libvolume-id (part of udev) is able to recognize
parts of raid 0.90 arrays just fine.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 17:47                     ` Justin Piszcz
@ 2007-10-20 18:38                       ` Michael Tokarev
  2007-10-20 20:02                         ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Michael Tokarev @ 2007-10-20 18:38 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Doug Ledford, John Stoffel, linux-raid

Justin Piszcz wrote:
> 
> On Fri, 19 Oct 2007, Doug Ledford wrote:
> 
>> On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
[]
>>> Got it, so for RAID1 it would make sense if LILO supported it (the
>>> later versions of the md superblock)
>>
>> Lilo doesn't know anything about the superblock format, however, lilo
>> expects the raid1 device to start at the beginning of the physical
>> partition.  In otherwords, format 1.0 would work with lilo.
> Did not work when I tried 1.x with LILO, switched back to 00.90.03 and
> it worked fine.

There are different 1.x - and the difference is exactly this -- location
of the superblock.  In 1.0, superblock is located at the end, just like
with 0.90, and lilo works just fine with it.  It gets confused somehow
(however I don't see how really, because it uses bmap() to get a list
of physical blocks for the files it wants to access - those should be
in absolute numbers, regardless of the superblock locaction) when the
superblock is at the beginning (v 1.1 or 1.2).

/mjt

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 18:38                       ` Michael Tokarev
@ 2007-10-20 20:02                         ` Doug Ledford
  0 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-20 20:02 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Justin Piszcz, John Stoffel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]

On Sat, 2007-10-20 at 22:38 +0400, Michael Tokarev wrote:
> Justin Piszcz wrote:
> > 
> > On Fri, 19 Oct 2007, Doug Ledford wrote:
> > 
> >> On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
> []
> >>> Got it, so for RAID1 it would make sense if LILO supported it (the
> >>> later versions of the md superblock)
> >>
> >> Lilo doesn't know anything about the superblock format, however, lilo
> >> expects the raid1 device to start at the beginning of the physical
> >> partition.  In otherwords, format 1.0 would work with lilo.
> > Did not work when I tried 1.x with LILO, switched back to 00.90.03 and
> > it worked fine.
> 
> There are different 1.x - and the difference is exactly this -- location
> of the superblock.  In 1.0, superblock is located at the end, just like
> with 0.90, and lilo works just fine with it.  It gets confused somehow
> (however I don't see how really, because it uses bmap() to get a list
> of physical blocks for the files it wants to access - those should be
> in absolute numbers, regardless of the superblock locaction) when the
> superblock is at the beginning (v 1.1 or 1.2).
> 
> /mjt

It's been a *long* time since I looked at the lilo raid1 support (I
wrote the original patch that Red Hat used, I have no idea if that's
what the lilo maintainer integrated though).  However, IIRC, it uses
bmap on the file, which implies it's via the filesystem mounted on the
raid device.  And the numbers are not absolute I don't think except with
respect to the file system.  So, I think the situation could be made to
work if you just taught lilo that on version 1.1 or version 1.2
superblock raids that it should add the data offset of the raid to the
bmap numbers (which I think are already added to the partition offset
numbers).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 18:24           ` Michael Tokarev
@ 2007-10-22 20:39             ` John Stoffel
  2007-10-22 22:29               ` Michael Tokarev
  2007-10-24  0:42               ` Doug Ledford
  2007-10-24  0:36             ` Doug Ledford
  1 sibling, 2 replies; 88+ messages in thread
From: John Stoffel @ 2007-10-22 20:39 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: John Stoffel, Doug Ledford, Justin Piszcz, linux-raid

[ I was going to reply to this earlier, but the Red Sox and good
weather got into the way this weekend.  ;-]

>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

Michael> I'm doing a sysadmin work for about 15 or 20 years.

Welcome to the club!  It's a fun career, always something new to
learn. 

>> If you are going to mirror an existing filesystem, then by definition
>> you have a second disk or partition available for the purpose.  So you
>> would merely setup the new RAID1, in degraded mode, using the new
>> partition as the base.  Then you copy the data over to the new RAID1
>> device, change your boot setup, and reboot.

Michael> And you have to copy the data twice as a result, instead of
Michael> copying it only once to the second disk.

So?  Why is this such a big deal?  As I see it, there are two seperate
ways to setup a RAID1 setup, on an OS.

1.  The mirror is built ahead of time and you install onto the
    mirror.  And twice as much data gets written, half to each disk.
    *grin* 

2.  You are encapsulating an existing OS install and you need to do a
    reboot from the un-mirrored OS to the mirrored setup.  So yes, you
    do have to copy the data from the orig to the mirror, reboot, then
    resync back onto the original disk whish has been added into the the
    RAID set.  

Neither case is really that big a deal.  And with the RAID super block
at the front of the disk, you don't have to worry about mixing up
which disk is which.  It's not fun when you boot one disk, thinking
it's the RAID disk, but end up booting the original disk.  

>> As Doug says, and I agree strongly, you DO NOT want to have the
>> possibility of confusion and data loss, especially on bootup.  And

Michael> There are different point of views, and different settings
Michael> etc.  For example, I once dealt with a linux user who was
Michael> unable to use his disk partition, because his system (it was
Michael> RedHat if I remember correctly) recognized some LVM volume on
Michael> his disk (it was previously used with Windows) and tried to
Michael> automatically activate it, thus making it "busy".  What I'm
Michael> talking about here is that any automatic activation of
Michael> anything should be done with extreme care, using smart logic
Michael> in the startup scripts if at all.

Ah... but you can also de-active LVM partitions as well if you like.  

Michael> The Doug's example - in my opinion anyway - shows wrong tools
Michael> or bad logic in the startup sequence, not a general flaw in
Michael> superblock location.

I don't agree completely.  I think the superblock location is a key
issue, because if you have a superblock location which moves depending
the filesystem or LVM you use to look at the partition (or full disk)
then you need to be even more careful about how to poke at things.

This is really true when you use the full disk for the mirror, because
then you don't have the partition table to base some initial
guestimates on.  Since there is an explicit Linux RAID partition type,
as well as an explicit linux filesystem (filesystem is then decoded
from the first Nk of the partition), you have a modicum of safety.

If ext3 has the superblock in the first 4k of the disk, but you've
setup the disk to use RAID1 with the LVM superblock at the end of the
disk, you now need to be careful about how the disk is detected and
then mounted.

To the ext3 detection logic, it looks like an ext3 filesystem, to LVM,
it looks like a RAID partition.  Which is correct?  Which is wrong?
How do you tell programmatically?  

That's what I think that all superblocks should be in the SAME
location on the disk and/or partitions if used.  It keeps down
problems like this.  

Michael> Another example is ext[234]fs - it does not touch first 512
Michael> bytes of the device, so if there was an msdos filesystem
Michael> there before, it will be recognized as such by many tools,
Michael> and an attempt to mount it automatically will lead to at
Michael> least scary output and nothing mounted, or in fsck doing
Michael> fatal things to it in worst scenario.  Sure thing the first
Michael> 512 bytes should be just cleared.. but that's another topic.

I would argue that ext[234] should be clearing those 512 bytes.  Why
aren't they cleared  

Michael> Speaking of cases where it was really helpful to have an
Michael> ability to mount individual raid components directly without
Michael> the raid level - most of them was due to one or another
Michael> operator errors, usually together with bugs and/or omissions
Michael> in software.  I don't remember exact scenarious anymore (last
Michael> time it was more than 2 years ago).  Most of the time it was
Michael> one or another sort of system recovery.

In this case, you're only talking about RAID1 mirrors, no other RAID
configuration fits this scenario.  And while this might look to be
helpful, I would strongly argue that it's not, because it's a special
case of the RAID code and can lead to all kinds of bugs and problems
if it's not exercised properly. 

Michael> In almost all machines I maintain, there's a raid1 for the
Michael> root filesystem built of all the drives (be it 2 or 4 or even
Michael> 6 of them) - the key point is to be able to boot off any of
Michael> them in case some cable/drive/controller rearrangement has to
Michael> be done.  Root filesystem is quite small (256 or 512 Mb
Michael> here), and it's not too dynamic either -- so it's not a big
Michael> deal to waste space for it.

Sure, I agree.  I agree 100% here that mirroring your /root partition
across two or more disks is good.  

Michael> Problem occurs - obviously - when something goes wrong.  And
Michael> most of the time issues we had happened on a remote site,
Michael> where there was no expirienced operator/sysadmin handy.

That's why I like Rackable's RMM, I get full serial console access to
the BIOS and the system.  Very handy.  Or get a PC Weasel PCI card and
install it into such systems, or a remote KVM, etc.  If you have to
support remote systems, you need the infrastructure to properly
support it.  

Michael> For example, when one drive was almost dead, and mdadm tried
Michael> to bring the array up, machine just hanged for unknown amount
Michael> of time.  An unexpirienced operator was there.  Instead of
Michael> trying to teach him how to pass parameter to the initramfs to
Michael> stop trying to assemble root array and next assembling it
Michael> manually, I told him to pass "root=/dev/sda1" to the kernel.

I really don't like this, because you've now broken the RAID from the
underside and it's not clear which is the clean mirror half now.  

Michael> Root mounts read-only, so it should be a safe thing to do - I
Michael> only needed root fs and minimal set of services (which are
Michael> even in initramfs) just for it to boot up to SOME state where
Michael> I can log in remotely and fix things later.  (no I didn't
Michael> want to remove the drive yet, I wanted to examine it first,
Michael> and it turned to be a good idea because the hang was
Michael> happening only at the beginning of it, and while we tried to
Michael> install replacement and fill it up with data, there was an
Michael> unreadable sector found on another drive, so this old but not
Michael> removed drive was really handy).

Heh. I can see that but I honestly think this points back to a problem
LVM/MD have with failing disks, and that is they don't time out
properly when one half of the mirror is having problems.  

Michael> Another situation - after some weird crash I had to examine
Michael> the filesystems found on both components - I want to look
Michael> at the filesystems and compare them, WITHOUT messing up
Michael> with raid superblocks (later on I wrote a tiny program to
Michael> save/restore 0.90 superblocks), and without attempting a
Michael> reconstruction attempts.  In fact, this very case - examining
Michael> the contents - is something I've been doing many times for
Michael> one or another reason.  There's just no need to involve
Michael> raid layer here at all, but it doesn't disturb things either
Michael> (in some cases anyway).

This is where a netboot would be my preferred setup, or a LiveCD.
Booting off an OS half like this might seem like a good way to quickly
work around the problem, but I feel like you're breaking the
assumptions of the RAID setup and it's going to bite you worse one
day. 

Michael> Yet another - many times we had to copy an old system to
Michael> a new one - new machine boots with 3 drives in it, 2 new,
Michael> and 3rd (the boot one) from the old machine.  I boot it off
Michael> the non-raided config from the 3rd drive (using only the
Michael> halves of md devices), create new arrays on the 2 new
Michael> drives (note - had I started raid on the 3rd machine, there'd
Michael> be a problem with md device numbering, -- for consistency I
Michael> number all the partitions and raid arrays similarily on all
Michael> machines), and copy data over.  There's no need to do the
Michael> complex procedure of adding components to the existing raid
Michael> arrays, dropping the old drive from them and resizing the
Michael> stuff - because of the latter step (and because there's no
Michael> need to resync in the first place - the 2 new drives are
Michael> new, hence I use --no-resync because they're filled with
Michael> zeros anyway).

Michael> Another case - we had to copy large amount of data from one
Michael> machine to another, from a raid array.  I just pulled off the
Michael> disk (bitmaps=yes, and i remounted the filesystem readonly),
Michael> inserted it into another machine, mounted it - without raid -
Michael> here and did a copy.  Superblock was preserved, and when I
Michael> returned the drive back, everything was ok.

Michael> And so on.  There was countless number of cases like that,
Michael> something I forgot already too.

Michael> Well.  I know about a loop device which has "offset=XXX"
Michael> parameter, so one can actually see and use the "internals"
Michael> component of a raid1 array, even if the superblock is at the
Michael> beginning.  But see above, the very first case - go tell to
Michael> that operator how to do it all ;)

That's the real solution to me, using the loop device with an offset!
I keep forgetting about this too.  And I think it's a key thing. 

>> this leads to the heart of my initial post on this matter, that the
>> confusion of having four different variations of RAID superblocks is
>> bad.  We should deprecate them down to just two, the old 0.90 format,
>> and the new 1.x format at the start of the RAID volume.

Michael> It's confusing for sure.  But see: 0.90 format is the most
Michael> commonly used one, and the most important is that it's
Michael> historical - it was here for many years, many systems are
Michael> using it.  I don't want to come across a situation when, some
Michael> years later, I'll need to grab a data from my old disk and be
Michael> unable to, because 0.90 format isn't supported anymore.

I never said that we should drop *read* support for 0.90 format, I was
just suggesting that we make the 1.0 format with the superblock in the
first 4k of the partition be the default from now on.

Michael> 0.90 has some real limitations (like 26 components at max
Michael> etc), hence 1.x format appeared.  And various flavours of 1.x
Michael> format are all useful too.  For example, if you're concerned
Michael> about safety of your data due to defects(*) in your startup
Michael> scripts, -- use whatever 1.x format which puts the metadata
Michael> at the beginning.  That's just it, I think ;)

This should be the default in my mind, having the new RAID 1.x format
with the first 4k of the partition being the default and only version
of 1.x that's used going forward.  

Michael> (*) Note: software like libvolume-id (part of udev) is able
Michael> to recognize parts of raid 0.90 arrays just fine.

Michael> !DSPAM:471a4837229653661656215!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-22 20:39             ` John Stoffel
@ 2007-10-22 22:29               ` Michael Tokarev
  2007-10-24  0:42               ` Doug Ledford
  1 sibling, 0 replies; 88+ messages in thread
From: Michael Tokarev @ 2007-10-22 22:29 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, Justin Piszcz, linux-raid

John Stoffel wrote:

>>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

>>> If you are going to mirror an existing filesystem, then by definition
>>> you have a second disk or partition available for the purpose.  So you
>>> would merely setup the new RAID1, in degraded mode, using the new
>>> partition as the base.  Then you copy the data over to the new RAID1
>>> device, change your boot setup, and reboot.
> 
> Michael> And you have to copy the data twice as a result, instead of
> Michael> copying it only once to the second disk.
> 
> So?  Why is this such a big deal?  As I see it, there are two seperate
> ways to setup a RAID1 setup, on an OS.
[..]
that was just a tiny nitpick, so to say, about a particular way to
convert existing system into raid1 - not something which's done every
day anyway.  Still, double the time for copying your terabyte-sized
drive is something to consider.

[]
> Michael> automatically activate it, thus making it "busy".  What I'm
> Michael> talking about here is that any automatic activation of
> Michael> anything should be done with extreme care, using smart logic
> Michael> in the startup scripts if at all.
> 
> Ah... but you can also de-active LVM partitions as well if you like.  

Yes, esp. being a newbie user who first installed linux on his PC just
to see that he can't use his disk.. ;)  That was a real situation - I
helped someone who had never heard of LVM and did little of anything
with filesystems/disks before.

> Michael> The Doug's example - in my opinion anyway - shows wrong tools
> Michael> or bad logic in the startup sequence, not a general flaw in
> Michael> superblock location.
> 
> I don't agree completely.  I think the superblock location is a key
> issue, because if you have a superblock location which moves depending
> the filesystem or LVM you use to look at the partition (or full disk)
> then you need to be even more careful about how to poke at things.

Superblock location does not depend on the filesystem.  Raid exports
the "inside" space only, excluding superblocks, to the next level
(filesystem or else).

> This is really true when you use the full disk for the mirror, because
> then you don't have the partition table to base some initial
> guestimates on.  Since there is an explicit Linux RAID partition type,
> as well as an explicit linux filesystem (filesystem is then decoded
> from the first Nk of the partition), you have a modicum of safety.

Speaking of whole disks - first, don't do that (for reasons suitable
for another topic), and second, using the whole disk or partitions
makes no real difference whatsoever to the topic being discussed.

There's just no need for the guesswork, except for the first install
(to automatically recognize existing devices, and to use them after
confirmation), and maybe for rescue systems, which again is a different
topic.

In any case, for a tool that does a guesswork (like libvolume-id, to
create /dev/ symlinks), it's as easy to look at the end of the device
as to the beginning or to any other fixed place - since the tool has
to know the superblock format, it knows superblock location as well).

Maybe "manual guesswork", based on hexdump of first several kilobytes
of data, is a bit more difficult in case where superblock is located
at the end.  But if one has to analyze hexdump, he doesn't care about
raid anymore.

> If ext3 has the superblock in the first 4k of the disk, but you've
> setup the disk to use RAID1 with the LVM superblock at the end of the
> disk, you now need to be careful about how the disk is detected and
> then mounted.

See above.  For tools, it's trivial to distinguish a component of a
raid volume from the volume itself, by looking for superblock at whatever
location.  Including stuff like mkfs, which - like mdadm does - may
warn one about previous filesystem/volume information on the device
in question.

> Michael> Speaking of cases where it was really helpful to have an
> Michael> ability to mount individual raid components directly without
> Michael> the raid level - most of them was due to one or another
> Michael> operator errors, usually together with bugs and/or omissions
> Michael> in software.  I don't remember exact scenarious anymore (last
> Michael> time it was more than 2 years ago).  Most of the time it was
> Michael> one or another sort of system recovery.
> 
> In this case, you're only talking about RAID1 mirrors, no other RAID
> configuration fits this scenario.  And while this might look to be

Definitely.  However, linear - to some extent - can be used partially.
But sure with much less usefulness.

However, raid1 is much more common setup than anything else - IMHO anyway.
It's the cheapest and the most reliable thing for an average user anyway -
it's cheaper to get 2 large drives than to, say, 3 a bit smaller drives.
Yes, raid1 has 1/2 space "wasted", compared with, say, raid5 on top of 3
drives (only 1/3 wasted), but still 3 smallish drives costs more than
2 larger drives.

> helpful, I would strongly argue that it's not, because it's a special
> case of the RAID code and can lead to all kinds of bugs and problems
> if it's not exercised properly. 

I'd say the key here is "if not excercised properly", not the superblock
location...

But we're now discussing somewhat different thing.  It's historical: unix
always allowed to `rm -rf /' (well, almost - there used to be cases when
it wasn't possible to remove a file of a running executable - EBUSY).
For example, windows does not allow one to do such evil things, and alot
of other similar stuff.  And for some strange reason I find unix to be
much more flexible and useful... oh well.  The question here is whenever
an OS should think for/instead of the user or not.  Sure thing, tools
should not be dumb, and mdadm is probably a nice example of an intelligent
tool (i mean mdadm --create, which looks at the devices and warns you if
it thinks there may be something sensible in there).  But what's nice about
it is that when necessary, I'm able to do whatever I want to (and I actually
used `rm -rf /' once, for good.. but that was more for fun than for real,
removing some old test install while running it).

> Michael> Problem occurs - obviously - when something goes wrong.  And
> Michael> most of the time issues we had happened on a remote site,
> Michael> where there was no expirienced operator/sysadmin handy.
> 
> That's why I like Rackable's RMM, I get full serial console access to
> the BIOS and the system.  Very handy.  Or get a PC Weasel PCI card and
> install it into such systems, or a remote KVM, etc.  If you have to
> support remote systems, you need the infrastructure to properly
> support it.  

It isn't always possible.  A situation we've here is - alot of tiny
remote offices all around, in the city and around it, some 100Km away.
There's a single machine in each, who does communication tasks too -
by means of second ethernet card or a dialup modem when a good
connectivity isn't an option for whatever reason.  Maybe it'd be a
good idea to install tiny routers in each location, like e.g
linksys wrts ($80 or so each), but it doesn't buy much really --
because the downtime of servers is very small (for about 10 years
this setup is working at ~60 places, only 3 or 4 times we needed
to go to the place to fix things).

> Michael> For example, when one drive was almost dead, and mdadm tried
> Michael> to bring the array up, machine just hanged for unknown amount
> Michael> of time.  An unexpirienced operator was there.  Instead of
> Michael> trying to teach him how to pass parameter to the initramfs to
> Michael> stop trying to assemble root array and next assembling it
> Michael> manually, I told him to pass "root=/dev/sda1" to the kernel.
> 
> I really don't like this, because you've now broken the RAID from the
> underside and it's not clear which is the clean mirror half now.  

Not clear?  Why, in God's sake??

Well.  One can speculate here about instability of device nodes and
other such things, that after next reboot sda may suddenly switch
its device node with sdb...  But that's irrelevant.  Even if the fs
is mounted readwrite and one component of raid has been modified behind
md code, I can trivially fix things after the fact, even after the next
reboot the md device will come back out of all components, not noticing
some components aren't the same.  Yes, thing will be badly broken in
case i'd change the filesystem significantly when it's mounted on the
component of raid.  But the thing is that I know what's going on, and
I will ensure things will be ok.

> Michael> Root mounts read-only, so it should be a safe thing to do - I
> Michael> only needed root fs and minimal set of services (which are
> Michael> even in initramfs) just for it to boot up to SOME state where
> Michael> I can log in remotely and fix things later.  (no I didn't
> Michael> want to remove the drive yet, I wanted to examine it first,
> Michael> and it turned to be a good idea because the hang was
> Michael> happening only at the beginning of it, and while we tried to
> Michael> install replacement and fill it up with data, there was an
> Michael> unreadable sector found on another drive, so this old but not
> Michael> removed drive was really handy).
> 
> Heh. I can see that but I honestly think this points back to a problem
> LVM/MD have with failing disks, and that is they don't time out
> properly when one half of the mirror is having problems.  

Yes - see above.  "Things works when everything works.  But problem
occurs than something doesn't work as intended - be it hardware,
software bugs or operator errors"- something like that.

But the thing is - the bug - let's assume it was an error handling bug -
prevented the system from operating correctly.  The system provided some
business-critical tasks, and it had to be bought up.  I had a very simple
way to do it, to bring it up to a state when people was able to work with
it as usual, and WHEN to look at bugs/patches/whatever.  Without even
going on-site, after 15-minute phone talk, and within 20 minutes it was
running.

> Michael> Another situation - after some weird crash I had to examine
> Michael> the filesystems found on both components - I want to look
> Michael> at the filesystems and compare them, WITHOUT messing up
> Michael> with raid superblocks (later on I wrote a tiny program to
> Michael> save/restore 0.90 superblocks), and without attempting a
> Michael> reconstruction attempts.  In fact, this very case - examining
> Michael> the contents - is something I've been doing many times for
> Michael> one or another reason.  There's just no need to involve
> Michael> raid layer here at all, but it doesn't disturb things either
> Michael> (in some cases anyway).
> 
> This is where a netboot would be my preferred setup, or a LiveCD.
> Booting off an OS half like this might seem like a good way to quickly
> work around the problem, but I feel like you're breaking the
> assumptions of the RAID setup and it's going to bite you worse one
> day. 

"Be careful when you use force".  You can cut your finger or even
kill yourself or someone else with a sharp knife, but it doesn't mean
we should forbid knifes.

> Michael> Well.  I know about a loop device which has "offset=XXX"
> Michael> parameter, so one can actually see and use the "internals"
> Michael> component of a raid1 array, even if the superblock is at the
> Michael> beginning.  But see above, the very first case - go tell to
> Michael> that operator how to do it all ;)
> 
> That's the real solution to me, using the loop device with an offset!
> I keep forgetting about this too.  And I think it's a key thing. 

Till this very thread, I didn't think about putting loop.ko into
my typical initramfs image... ;)

[]
> I never said that we should drop *read* support for 0.90 format, I was
> just suggesting that we make the 1.0 format with the superblock in the
> first 4k of the partition be the default from now on.

That's probably ok.  Not really sure what it buys us for real (again,
looking at superblock at the end of the device like libvolume-id does
is trivial, but sure, not every "block device guesser" out there
implemented this logic yet.  That to say: there are bugs still in
other components (startup scripts, those guessers, ..) which may
confuse something and your components will be used improperly, and
placing the supeblock at the beginning prevents some of them from
triggering.  Ditto for user by mistake doing evil things - with
superblock in front it becomes FAR less risky for the user -- who
uses force -- to do whatever he wants to, be it accidentially or
for real.

By the way.  The ability to mount/use component device of a raid1
array independently of raid was a key point when I first used it.
The code (circa 1998) was new and buggy, it wasn't part of the
official kernel, and sure I was afraid to use it.  But since I
knew I can always backoff trivially by just removing md layer and
one drive, I went on and deployed this stuff instead of expensive
hardware raid solution.  This in turn allowed us to complete the
project, instead of rolling it back due to lack of money...

> Michael> 0.90 has some real limitations (like 26 components at max
> Michael> etc), hence 1.x format appeared.  And various flavours of 1.x
> Michael> format are all useful too.  For example, if you're concerned
> Michael> about safety of your data due to defects(*) in your startup
> Michael> scripts, -- use whatever 1.x format which puts the metadata
> Michael> at the beginning.  That's just it, I think ;)
> 
> This should be the default in my mind, having the new RAID 1.x format
> with the first 4k of the partition being the default and only version
> of 1.x that's used going forward.  

ok, at least 1.1 should be supported -- v.1 with superblock at the end.
I for one will use it when I come across limitations/whatever of 0.90
format. ;)

Thanks for the reply!

/mjt

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: chunk size (was Re: Time to deprecate old RAID formats?)
  2007-10-20 13:29                       ` Doug Ledford
@ 2007-10-23 19:21                         ` Michal Soltys
  2007-10-24  0:14                           ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Michal Soltys @ 2007-10-23 19:21 UTC (permalink / raw)
  To: linux-raid; +Cc: Doug Ledford

Doug Ledford wrote:
> 
> Well, first I was thinking of files in the few hundreds of megabytes
> each to gigabytes each, and when they are streamed, they are streamed at
> a rate much lower than the full speed of the array, but still at a fast
> rate.  How parallel the reads are then would tend to be a function of
> chunk size versus streaming rate. 

Ahh, I see now. Thanks for explanation.

I wonder though, if setting large readahead would help, if you used larger 
chunk size. Assuming other options are not possible - i.e. streaming from 
larger buffer, while reading to it in a full stripe width at least.

> 
> I'm not familiar with the benchmark you are referring to.
> 

I was thinking about 
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html

with small discussion that happend after that.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 18:39           ` John Stoffel
  2007-10-19 21:23             ` Iustin Pop
@ 2007-10-23 23:03             ` Bill Davidsen
  2007-10-24  0:09               ` Doug Ledford
  2007-10-24 14:00               ` John Stoffel
  1 sibling, 2 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-23 23:03 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, Justin Piszcz, linux-raid

John Stoffel wrote:
> Why do we have three different positions for storing the superblock?  
>   
Why do you suggest changing anything until you get the answer to this 
question? If you don't understand why there are three locations, perhaps 
that would be a good initial investigation.

Clearly the short answer is that they reflect three stages of Neil's 
thinking on the topic, and I would bet that he had a good reason for 
moving the superblock when he did it.

Since you have to support all of them or break existing arrays, and they 
all use the same format so there's no saving of code size to mention, 
why even bring this up?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 21:42               ` Doug Ledford
  2007-10-20  7:53                 ` Iustin Pop
@ 2007-10-23 23:09                 ` Bill Davidsen
  1 sibling, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-23 23:09 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Iustin Pop, John Stoffel, Justin Piszcz, linux-raid

Doug Ledford wrote:
> On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote:
>   
>> On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
>>     
>>> And if putting the superblock at the end is problematic, why is it the
>>> default?  Shouldn't version 1.1 be the default?  
>>>       
>> In my opinion, having the superblock *only* at the end (e.g. the 0.90
>> format) is the best option.
>>
>> It allows one to mount the disk separately (in case of RAID 1), if the
>> MD superblock is corrupt or you just want to get easily at the raw data.
>>     
>
> Bad reasoning.  It's the reason that the default is at the end of the
> device, but that was a bad decision made by Ingo long, long ago in a
> galaxy far, far away.
>
> The simple fact of the matter is there are only two type of raid devices
> for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
> and those that don't (raid1, linear).
>
> For the purposes of this issue, there are only two states we care about:
> the raid array works or doesn't work.
>
> If the raid array works, then you *only* want the system to access the
> data via the raid array.  If the raid array doesn't work, then for the
> fragmented case you *never* want the system to see any of the data from
> the raid array (such as an ext3 superblock) or a subsequent fsck could
> see a valid superblock and actually start a filesystem scan on the raw
> device, and end up hosing the filesystem beyond all repair after it hits
> the first chunk size break (although in practice this is usually a
> situation where fsck declares the filesystem so corrupt that it refuses
> to touch it, that's leaving an awful lot to chance, you really don't
> want fsck to *ever* see that superblock).
>
> If the raid array is raid1, then the raid array should *never* fail to
> start unless all disks are missing (in which case there is no raw device
> to access anyway).  The very few failure types that will cause the raid
> array to not start automatically *and* still have an intact copy of the
> data usually happen when the raid array is perfectly healthy, in which
> case automatically finding a constituent device when the raid array
> failed to start is exactly the *wrong* thing to do (for instance, you
> enable SELinux on a machine and it hasn't been relabeled and the raid
> array fails to start because /dev/md<blah> can't be created because of
> an SELinux denial...all the raid1 members are still there, but if you
> touch a single one of them, then you run the risk of creating silent
> data corruption).
>
> It really boils down to this: for any reason that a raid array might
> fail to start, you *never* want to touch the underlying data until
> someone has taken manual measures to figure out why it didn't start and
> corrected the problem.  Putting the superblock in front of the data does
> not prevent manual measures (such as recreating superblocks) from
> getting at the data.  But, putting superblocks at the end leaves the
> door open for accidental access via constituent devices when you
> *really* don't want that to happen.
>   

You didn't mention some ill-behaved application using the raw device 
(ie. database) writing just a little more than it should and destroying 
the superblock.
> So, no, the default should *not* be at the end of the device.
>
>   
You make a convincing argulemt.
>> As to the people who complained exactly because of this feature, LVM has
>> two mechanisms to protect from accessing PVs on the raw disks (the
>> ignore raid components option and the filter - I always set filters when
>> using LVM ontop of MD).
>>
>> regards,
>> iustin
>>     


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 14:52         ` John Stoffel
  2007-10-20 15:07           ` Iustin Pop
  2007-10-20 18:24           ` Michael Tokarev
@ 2007-10-23 23:18           ` Bill Davidsen
  2 siblings, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-23 23:18 UTC (permalink / raw)
  To: John Stoffel; +Cc: Michael Tokarev, Doug Ledford, Justin Piszcz, linux-raid

John Stoffel wrote:
>>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:
>>>>>>             
>
> Michael> Doug Ledford wrote:
> Michael> []
>   
>>> 1.0, 1.1, and 1.2 are the same format, just in different positions on
>>> the disk.  Of the three, the 1.1 format is the safest to use since it
>>> won't allow you to accidentally have some sort of metadata between the
>>> beginning of the disk and the raid superblock (such as an lvm2
>>> superblock), and hence whenever the raid array isn't up, you won't be
>>> able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
>>> case situations, I've seen lvm2 find a superblock on one RAID1 array
>>> member when the RAID1 array was down, the system came up, you used the
>>> system, the two copies of the raid array were made drastically
>>> inconsistent, then at the next reboot, the situation that prevented the
>>> RAID1 from starting was resolved, and it never know it failed to start
>>> last time, and the two inconsistent members we put back into a clean
>>> array).  So, deprecating any of these is not really helpful.  And you
>>> need to keep the old 0.90 format around for back compatibility with
>>> thousands of existing raid arrays.
>>>       
>
> Michael> Well, I strongly, completely disagree.  You described a
> Michael> real-world situation, and that's unfortunate, BUT: for at
> Michael> least raid1, there ARE cases, pretty valid ones, when one
> Michael> NEEDS to mount the filesystem without bringing up raid.
> Michael> Raid1 allows that.
>
> Please describe one such case please.  There have certainly been hacks
> of various RAID systems on other OSes such as Solaris where the VxVM
> and/or Solstice DiskSuite allowed you to encapsulate an existing
> partition into a RAID array.  
>
> But in my experience (and I'm a professional sysadm... :-) it's not
> really all that useful, and can lead to problems liks those described
> by Doug.  
>
> If you are going to mirror an existing filesystem, then by definition
> you have a second disk or partition available for the purpose.  So you
> would merely setup the new RAID1, in degraded mode, using the new
> partition as the base.  Then you copy the data over to the new RAID1
> device, change your boot setup, and reboot.
>
> Once that is done, you can then add the original partition into the
> RAID1 array.  
>
> As Doug says, and I agree strongly, you DO NOT want to have the
> possibility of confusion and data loss, especially on bootup.  And
> this leads to the heart of my initial post on this matter, that the
> confusion of having four different variations of RAID superblocks is
> bad.  We should deprecate them down to just two, the old 0.90 format,
> and the new 1.x format at the start of the RAID volume.
>   

Perhaps I am misreading you here, when you say "depreciate them down" do 
you mean the Adrian Bunk method of putting in a printk scolding the 
administrator, and then remove the feature a version later, or did you 
mean "depreciate all but two" which clearly doesn't suggest removing the 
capability at all?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-19 16:34     ` Justin Piszcz
@ 2007-10-23 23:19       ` Bill Davidsen
  0 siblings, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-23 23:19 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: John Stoffel, linux-raid

Justin Piszcz wrote:
>
>
> On Fri, 19 Oct 2007, John Stoffel wrote:
>
>>>>>>> "Justin" == Justin Piszcz <jpiszcz@lucidpixels.com> writes:
>>
>> Justin> On Fri, 19 Oct 2007, John Stoffel wrote:
>>
>>>>
>>>> So,
>>>>
>>>> Is it time to start thinking about deprecating the old 0.9, 1.0 and
>>>> 1.1 formats to just standardize on the 1.2 format?  What are the
>>>> issues surrounding this?
>>>>
>>>> It's certainly easy enough to change mdadm to default to the 1.2
>>>> format and to require a --force switch to  allow use of the older
>>>> formats.
>>>>
>>>> I keep seeing that we support these old formats, and it's never been
>>>> clear to me why we have four different ones available?  Why can't we
>>>> start defining the canonical format for Linux RAID metadata?
>>>>
>>>> Thanks,
>>>> John
>>>> john@stoffel.org
>>>> -
>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>> linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> Justin> I hope 00.90.03 is not deprecated, LILO cannot boot off of
>> Justin> anything else!
>>
>> Are you sure?  I find that GRUB is much easier to use and setup than
>> LILO these days.  But hey, just dropping down to support 00.09.03 and
>> 1.2 formats would be fine too.  Let's just lessen the confusion if at
>> all possible.
>>
>> John
>>
>
> I am sure, I submitted a bug report to the LILO developer, he 
> acknowledged the bug but I don't know if it was fixed.
>
> I have not tried GRUB with a RAID1 setup yet.

Works fine.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-23 23:03             ` Bill Davidsen
@ 2007-10-24  0:09               ` Doug Ledford
  2007-10-24 23:55                 ` Neil Brown
  2007-10-24 14:00               ` John Stoffel
  1 sibling, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-24  0:09 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote:
> John Stoffel wrote:
> > Why do we have three different positions for storing the superblock?  
> >   
> Why do you suggest changing anything until you get the answer to this 
> question? If you don't understand why there are three locations, perhaps 
> that would be a good initial investigation.
> 
> Clearly the short answer is that they reflect three stages of Neil's 
> thinking on the topic, and I would bet that he had a good reason for 
> moving the superblock when he did it.

I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of
the device) is to satisfy people that want to get at their raid1 data
without bringing up the device or using a loop mount with an offset.
Version 1.1, at the beginning of the device, is to prevent accidental
access to a device when the raid array doesn't come up.  And version 1.2
(4k from the beginning of the device) would be suitable for those times
when you want to embed a boot sector at the very beginning of the device
(which really only needs 512 bytes, but a 4k offset is as easy to deal
with as anything else).  From the standpoint of wanting to make sure an
array is suitable for embedding a boot sector, the 1.2 superblock may be
the best default.

> Since you have to support all of them or break existing arrays, and they 
> all use the same format so there's no saving of code size to mention, 
> why even bring this up?
> 
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: chunk size (was Re: Time to deprecate old RAID formats?)
  2007-10-23 19:21                         ` Michal Soltys
@ 2007-10-24  0:14                           ` Doug Ledford
  0 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-24  0:14 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1957 bytes --]

On Tue, 2007-10-23 at 21:21 +0200, Michal Soltys wrote:
> Doug Ledford wrote:
> > 
> > Well, first I was thinking of files in the few hundreds of megabytes
> > each to gigabytes each, and when they are streamed, they are streamed at
> > a rate much lower than the full speed of the array, but still at a fast
> > rate.  How parallel the reads are then would tend to be a function of
> > chunk size versus streaming rate. 
> 
> Ahh, I see now. Thanks for explanation.
> 
> I wonder though, if setting large readahead would help, if you used larger 
> chunk size. Assuming other options are not possible - i.e. streaming from 
> larger buffer, while reading to it in a full stripe width at least.

Probably not.  All my trial and error in the past with raid5 arrays and
various situations that would cause pathological worst case behavior
showed that once reads themselves reach 16k in size, and are sequential
in nature, then the disk firmware's read ahead kicks in and your
performance stays about the same regardless of increasing your OS read
ahead.  In a nutshell, once you've convinced the disk firmware that you
are going to be reading some data sequentially, it does the rest.  With
a large stripe size (say 256k+), you'll trigger this firmware read ahead
fairly early on in reading any given stripe, so you really don't buy
much by reading the next stripe before you need it, and in fact can end
up wasting a lot of RAM trying to do so, hurting overall performance.

> > 
> > I'm not familiar with the benchmark you are referring to.
> > 
> 
> I was thinking about 
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html
> 
> with small discussion that happend after that.
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 18:24           ` Michael Tokarev
  2007-10-22 20:39             ` John Stoffel
@ 2007-10-24  0:36             ` Doug Ledford
  1 sibling, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-24  0:36 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4625 bytes --]

On Sat, 2007-10-20 at 22:24 +0400, Michael Tokarev wrote:
> John Stoffel wrote:
> >>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

> > As Doug says, and I agree strongly, you DO NOT want to have the
> > possibility of confusion and data loss, especially on bootup.  And
> 
> There are different point of views, and different settings etc.

Indeed, there are different points of view.  And with that in mind, I'll
just point out that my point of view is that of an engineer who is
responsible for all the legitimate md bugs in our products once tech
support has weeded out the "you tried to do what?" cases.  From that
point of view, I deal with *every* user's preferred use case, not any
single use case.

> For example, I once dealt with a linux user who was unable to
> use his disk partition, because his system (it was RedHat if I
> remember correctly) recognized some LVM volume on his disk (it
> was previously used with Windows) and tried to automatically
> activate it, thus making it "busy".

Yep, that can still happen today under certain circumstances.

>   What I'm talking about here
> is that any automatic activation of anything should be done with
> extreme care, using smart logic in the startup scripts if at
> all.

We do.  Unfortunately, there is no logic smart enough to recognize all
the possible user use cases that we've seen given the way things are
created now.

> The Doug's example - in my opinion anyway - shows wrong tools
> or bad logic in the startup sequence, not a general flaw in
> superblock location.

Well, one of the problems is that you can both use an md device as an
LVM physical volume and use an LVM logical volume as an md constituent
device.  Users have done both.

> For example, when one drive was almost dead, and mdadm tried
> to bring the array up, machine just hanged for unknown amount
> of time.  An unexpirienced operator was there.  Instead of
> trying to teach him how to pass parameter to the initramfs
> to stop trying to assemble root array and next assembling
> it manually, I told him to pass "root=/dev/sda1" to the
> kernel.  Root mounts read-only, so it should be a safe thing
> to do - I only needed root fs and minimal set of services
> (which are even in initramfs) just for it to boot up to SOME
> state where I can log in remotely and fix things later.

Umm, no.  Generally speaking (I can't speak for other distros) but both
Fedora and RHEL remount root rw even when coming up in single user mode.
The only time the fs is left in ro mode is when it drops to a shell
during rc.sysinit as a result of a failed fs check.  And if you are
using an ext3 filesystem and things didn't go down clean, then you also
get a journal replay.  So, then what happens when you think you've fixed
things, and you reboot, and then due to random chance, the ext3 fs check
gets the journal off the drive that wasn't mounted and replays things
again?  Will this overwrite your fixes possibly?  Yep.  Could do all
sorts of bad things.  In fact, unless you do a full binary compare of
your constituent devices, you could have silent data corruption and just
never know about it.  You may get off lucky and never *see* the
corruption, but it could well be there.  The only safe way to
reintegrate your raid after doing what you suggest is to kick the
unmounted drive out of the array before rebooting by using mdadm to zero
its superblock, boot up with a degraded raid1 array, and readd the
kicked device back in.

So, while you list several more examples of times when it was convenient
to do as you suggest, these times can be handled in other ways (although
it may mean keeping a rescue CD handy at each location just for
situations like this) that are far safer IMO.

Now, putting all this back into the point of view I have to take, which
is what's the best default action to take for my customers, I'm sure you
can understand how a default setup and recommendation of use that leaves
silent data corruption is simply a non-starter for me.  If someone wants
to do this manually, then go right ahead.  But as for what we do by
default when the user asks us to create a raid array, we really need to
be on superblock 1.1 or 1.2 (although we aren't yet, we've waited for
the version 1 superblock issues to iron out and will do so in a future
release).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-22 20:39             ` John Stoffel
  2007-10-22 22:29               ` Michael Tokarev
@ 2007-10-24  0:42               ` Doug Ledford
  2007-10-24  9:40                 ` David Greaves
                                   ` (2 more replies)
  1 sibling, 3 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-24  0:42 UTC (permalink / raw)
  To: John Stoffel; +Cc: Michael Tokarev, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2044 bytes --]

On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:

> I don't agree completely.  I think the superblock location is a key
> issue, because if you have a superblock location which moves depending
> the filesystem or LVM you use to look at the partition (or full disk)
> then you need to be even more careful about how to poke at things.

This is the heart of the matter.  When you consider that each file
system and each volume management stack has a superblock, and they some
store their superblocks at the end of devices and some at the beginning,
and they can be stacked, then it becomes next to impossible to make sure
a stacked setup is never recognized incorrectly under any circumstance.
It might be possible if you use static device names, but our users
*long* ago complained very loudly when adding a new disk or removing a
bad disk caused their setup to fail to boot.  So, along came mount by
label and auto scans for superblocks.  Once you do that, you *really*
need all the superblocks at the same end of a device so when you stack
things, it always works properly.

> Michael> Another example is ext[234]fs - it does not touch first 512
> Michael> bytes of the device, so if there was an msdos filesystem
> Michael> there before, it will be recognized as such by many tools,
> Michael> and an attempt to mount it automatically will lead to at
> Michael> least scary output and nothing mounted, or in fsck doing
> Michael> fatal things to it in worst scenario.  Sure thing the first
> Michael> 512 bytes should be just cleared.. but that's another topic.
> 
> I would argue that ext[234] should be clearing those 512 bytes.  Why
> aren't they cleared  

Actually, I didn't think msdos used the first 512 bytes for the same
reason ext3 doesn't: space for a boot sector.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24  0:42               ` Doug Ledford
@ 2007-10-24  9:40                 ` David Greaves
  2007-10-24 20:22                 ` Bill Davidsen
  2007-11-01 21:02                 ` H. Peter Anvin
  2 siblings, 0 replies; 88+ messages in thread
From: David Greaves @ 2007-10-24  9:40 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Michael Tokarev, Justin Piszcz, linux-raid

Doug Ledford wrote:
> On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
> 
>> I don't agree completely.  I think the superblock location is a key
>> issue, because if you have a superblock location which moves depending
>> the filesystem or LVM you use to look at the partition (or full disk)
>> then you need to be even more careful about how to poke at things.
> 
> This is the heart of the matter.  When you consider that each file
> system and each volume management stack has a superblock, and they some
> store their superblocks at the end of devices and some at the beginning,
> and they can be stacked, then it becomes next to impossible to make sure
> a stacked setup is never recognized incorrectly under any circumstance.

I wonder if we should not really be talking about superblock versions 1.0, 1.1,
1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k


mdadm --detail /dev/md0

/dev/md0:
        Version : 01.0
  Metadata-locn : End-of-device
  Creation Time : Fri Aug  4 23:05:02 2006
     Raid Level : raid0


And there you have the deprecation... only two superblock versions and no real
changes to code etc

David

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-23 23:03             ` Bill Davidsen
  2007-10-24  0:09               ` Doug Ledford
@ 2007-10-24 14:00               ` John Stoffel
  2007-10-24 15:18                 ` Mike Snitzer
  2007-10-24 15:32                 ` Bill Davidsen
  1 sibling, 2 replies; 88+ messages in thread
From: John Stoffel @ 2007-10-24 14:00 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: John Stoffel, Doug Ledford, Justin Piszcz, linux-raid

>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:

Bill> John Stoffel wrote:
>> Why do we have three different positions for storing the superblock?  

Bill> Why do you suggest changing anything until you get the answer to
Bill> this question? If you don't understand why there are three
Bill> locations, perhaps that would be a good initial investigation.

Because I've asked this question before and not gotten an answer, nor
is it answered in the man page for mdadm on why we have this setup. 

Bill> Clearly the short answer is that they reflect three stages of
Bill> Neil's thinking on the topic, and I would bet that he had a good
Bill> reason for moving the superblock when he did it.

So let's hear Neil's thinking about all this?  Or should I just work
up a patch to do what I suggest and see how that flies? 

Bill> Since you have to support all of them or break existing arrays,
Bill> and they all use the same format so there's no saving of code
Bill> size to mention, why even bring this up?

Because of the confusion factor.  Again, since noone has been able to
articulate a reason why we have three different versions of the 1.x
superblock, nor have I seen any good reasons for why we should have
them, I'm going by the KISS principle to reduce the options to the
best one.

And no, I'm not advocating getting rid of legacy support, but I AM
advocating that we settle on ONE standard format going forward as the
default for all new RAID superblocks.

John

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to deprecate old RAID formats?
  2007-10-24 14:00               ` John Stoffel
@ 2007-10-24 15:18                 ` Mike Snitzer
  2007-10-24 15:32                 ` Bill Davidsen
  1 sibling, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2007-10-24 15:18 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid

On 10/24/07, John Stoffel <john@stoffel.org> wrote:
> >>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:
>
> Bill> John Stoffel wrote:
> >> Why do we have three different positions for storing the superblock?
>
> Bill> Why do you suggest changing anything until you get the answer to
> Bill> this question? If you don't understand why there are three
> Bill> locations, perhaps that would be a good initial investigation.
>
> Because I've asked this question before and not gotten an answer, nor
> is it answered in the man page for mdadm on why we have this setup.
>
> Bill> Clearly the short answer is that they reflect three stages of
> Bill> Neil's thinking on the topic, and I would bet that he had a good
> Bill> reason for moving the superblock when he did it.
>
> So let's hear Neil's thinking about all this?  Or should I just work
> up a patch to do what I suggest and see how that flies?
>
> Bill> Since you have to support all of them or break existing arrays,
> Bill> and they all use the same format so there's no saving of code
> Bill> size to mention, why even bring this up?
>
> Because of the confusion factor.  Again, since noone has been able to
> articulate a reason why we have three different versions of the 1.x
> superblock, nor have I seen any good reasons for why we should have
> them, I'm going by the KISS principle to reduce the options to the
> best one.
>
> And no, I'm not advocating getting rid of legacy support, but I AM
> advocating that we settle on ONE standard format going forward as the
> default for all new RAID superblocks.

Why exactly are you on this crusade to find the one "best" v1
superblock location?  Giving people the freedom to place the
superblock where they choose isn't a bad thing.  Would adding
something like "If in doubt, 1.1 is the safest choice." to the mdadm
man page give you the KISS warm-fuzzies you're pining for?

The fact that, after you read the manpage, you didn't even know that
the only difference between the v1.x variants is the location that the
superblock is placed indicates that you're not in a position to be so
tremendously evangelical about affecting code changes that limit
existing options.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24 14:00               ` John Stoffel
  2007-10-24 15:18                 ` Mike Snitzer
@ 2007-10-24 15:32                 ` Bill Davidsen
  1 sibling, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-24 15:32 UTC (permalink / raw)
  To: John Stoffel; +Cc: Doug Ledford, Justin Piszcz, linux-raid

John Stoffel wrote:
>>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:
>>>>>>             
>
> Bill> John Stoffel wrote:
>   
>>> Why do we have three different positions for storing the superblock?  
>>>       
>
> Bill> Why do you suggest changing anything until you get the answer to
> Bill> this question? If you don't understand why there are three
> Bill> locations, perhaps that would be a good initial investigation.
>
> Because I've asked this question before and not gotten an answer, nor
> is it answered in the man page for mdadm on why we have this setup. 
>
> Bill> Clearly the short answer is that they reflect three stages of
> Bill> Neil's thinking on the topic, and I would bet that he had a good
> Bill> reason for moving the superblock when he did it.
>
> So let's hear Neil's thinking about all this?  Or should I just work
> up a patch to do what I suggest and see how that flies? 
>   

If you are only going to change the default, I think you're done, since 
people report problems with bootloaders starting versions other than 
0.90. And until I hear Neil's thinking on this, I'm not sure that I know 
what the default location and type should be. In fact, reading the 
discussion I suspect it should be different for RAID-0 (should be at the 
end) and all other types (should be near the front). That retains the 
ability to mount one part of the mirror as a single partition, while 
minimizing the possibility of bad applications seeing something which 
looks like a filesystem at the start of a partition and trying to run 
fsck on it.
> Bill> Since you have to support all of them or break existing arrays,
> Bill> and they all use the same format so there's no saving of code
> Bill> size to mention, why even bring this up?
>
> Because of the confusion factor.  Again, since noone has been able to
> articulate a reason why we have three different versions of the 1.x
> superblock, nor have I seen any good reasons for why we should have
> them, I'm going by the KISS principle to reduce the options to the
> best one.
>
> And no, I'm not advocating getting rid of legacy support, but I AM
> advocating that we settle on ONE standard format going forward as the
> default for all new RAID superblocks.
>   

Unfortunately the solution can't be any simpler than the problem, and 
that's why I'm dubious that anything but the documentation should be 
changed, or an additional metadata target added per the discussion 
above, perhaps "best1" for best 1.x format based on the raid level.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24  0:42               ` Doug Ledford
  2007-10-24  9:40                 ` David Greaves
@ 2007-10-24 20:22                 ` Bill Davidsen
  2007-10-25 16:29                   ` Doug Ledford
  2007-11-01 21:02                 ` H. Peter Anvin
  2 siblings, 1 reply; 88+ messages in thread
From: Bill Davidsen @ 2007-10-24 20:22 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Michael Tokarev, Justin Piszcz, linux-raid

Doug Ledford wrote:
> On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
>
>   
>> I don't agree completely.  I think the superblock location is a key
>> issue, because if you have a superblock location which moves depending
>> the filesystem or LVM you use to look at the partition (or full disk)
>> then you need to be even more careful about how to poke at things.
>>     
>
> This is the heart of the matter.  When you consider that each file
> system and each volume management stack has a superblock, and they some
> store their superblocks at the end of devices and some at the beginning,
> and they can be stacked, then it becomes next to impossible to make sure
> a stacked setup is never recognized incorrectly under any circumstance.
> It might be possible if you use static device names, but our users
> *long* ago complained very loudly when adding a new disk or removing a
> bad disk caused their setup to fail to boot.  So, along came mount by
> label and auto scans for superblocks.  Once you do that, you *really*
> need all the superblocks at the same end of a device so when you stack
> things, it always works properly.
Let me be devil's advocate, I noted in another post that location might 
be raid level dependent. For raid-1 putting the superblock at the end 
allows the BIOS to treat a single partition as a bootable unit. For all 
other arrangements the end location puts the superblock where it is 
slightly more likely to be overwritten, and where it must be moved if 
the partition grows or whatever.

There really may be no "right" answer.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24  0:09               ` Doug Ledford
@ 2007-10-24 23:55                 ` Neil Brown
  2007-10-25  0:09                   ` Jeff Garzik
                                     ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Neil Brown @ 2007-10-24 23:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Bill Davidsen, John Stoffel, Justin Piszcz, linux-raid

On Tuesday October 23, dledford@redhat.com wrote:
> On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote:
> > John Stoffel wrote:
> > > Why do we have three different positions for storing the superblock?  
> > >   
> > Why do you suggest changing anything until you get the answer to this 
> > question? If you don't understand why there are three locations, perhaps 
> > that would be a good initial investigation.
> > 
> > Clearly the short answer is that they reflect three stages of Neil's 
> > thinking on the topic, and I would bet that he had a good reason for 
> > moving the superblock when he did it.
> 
> I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of
> the device) is to satisfy people that want to get at their raid1 data
> without bringing up the device or using a loop mount with an offset.
> Version 1.1, at the beginning of the device, is to prevent accidental
> access to a device when the raid array doesn't come up.  And version 1.2
> (4k from the beginning of the device) would be suitable for those times
> when you want to embed a boot sector at the very beginning of the device
> (which really only needs 512 bytes, but a 4k offset is as easy to deal
> with as anything else).  From the standpoint of wanting to make sure an
> array is suitable for embedding a boot sector, the 1.2 superblock may be
> the best default.
> 

Exactly correct.

Another perspective is that I chickened out of making a decision and
chose to support all the credible possibilities that I could think of.
And showed that I didn't have enough imagination.  The other
possibility that I should have included (as has been suggested in this
conversation, and previously on this list) is to store the superblock
both at the beginning and the end for redundancy.  However I cannot
decide whether to combine the 1.0 and 1.1 locations, or the 1.0 and
1.2.  And I don't think I want to support both (maybe I've learned my
lesson).

As for where the metadata "should" be placed, it is interesting to
observe that the SNIA's "DDFv1.2" puts it at the end of the device.
And as DDF is an industry standard sponsored by multiple companies it
must be ......
Sorry.  I had intended to say "correct", but when it came to it, my
fingers refused to type that word in that context.

DDF is in a somewhat different situation though.  It assumes that the
components are whole devices, and that the controller has exclusive
access - there is no way another controller could interpret the
devices differently before the DDF controller has a chance.

DDF is also interesting in that it uses 512 byte alignment for
metadata.  The 'anchor' block is in the last sector of the device.
This contrasts with current md metadata which is all 4K aligned.
Given that the drive manufacturers seem to be telling us that "4096 is
the new 512", I think 4K alignment was a good idea.
It could be that DDF actually specifies the anchor to reside in the
last "block" rather than the last "sector", and it could be that the
spec allows for block size to be device specific - I'd have to hunt
through the spec again to be sure.

For the record, I have no intention of deprecating any of the metadata
formats, not even 0.90.
It is conceivable that I could change the default, though that would
require a decision as to what the new default would be.  I think it
would have to be 1.0 or it would cause too much confusion.

I think it would be entirely appropriate for a distro (especially an
'enterprise' distro) to choose a format and location that it was going
to standardise on and support, and make that the default on that
distro (by using a CREATE line in mdadm.conf).  Debian has already
done this by making 1.0 the default.

I certainly accept that the documentation is probably less that
perfect (by a large margin).  I am more than happy to accept patches
or concrete suggestions on how to improve that.  I always think it is
best if a non-developer writes documentation (and a developer reviews
it) as then it is more likely to address the issues that a
non-developer will want to read about, and in a way that will make
sense to a non-developer. (i.e. I'm to close to the subject to write
good doco).

NeilBrown

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24 23:55                 ` Neil Brown
@ 2007-10-25  0:09                   ` Jeff Garzik
  2007-10-25  8:09                     ` David Greaves
  2007-10-25  7:01                   ` Doug Ledford
  2007-10-25 14:49                   ` Bill Davidsen
  2 siblings, 1 reply; 88+ messages in thread
From: Jeff Garzik @ 2007-10-25  0:09 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Bill Davidsen, John Stoffel, Justin Piszcz,
	linux-raid

Neil Brown wrote:
> On Tuesday October 23, dledford@redhat.com wrote:
> As for where the metadata "should" be placed, it is interesting to
> observe that the SNIA's "DDFv1.2" puts it at the end of the device.
> And as DDF is an industry standard sponsored by multiple companies it
> must be ......
> Sorry.  I had intended to say "correct", but when it came to it, my
> fingers refused to type that word in that context.
> 
> DDF is in a somewhat different situation though.  It assumes that the
> components are whole devices, and that the controller has exclusive
> access - there is no way another controller could interpret the
> devices differently before the DDF controller has a chance.

<grin> agreed.


> DDF is also interesting in that it uses 512 byte alignment for
> metadata.  The 'anchor' block is in the last sector of the device.
> This contrasts with current md metadata which is all 4K aligned.
> Given that the drive manufacturers seem to be telling us that "4096 is
> the new 512", I think 4K alignment was a good idea.
> It could be that DDF actually specifies the anchor to reside in the
> last "block" rather than the last "sector", and it could be that the
> spec allows for block size to be device specific - I'd have to hunt
> through the spec again to be sure.

Its a bit of a mess.

Yes, with 1K and 4K sector devices starting to appear, as long as the 
underlying partitioning gets the initial partition alignment correct, 
this /should/ continue functioning as normal.

If for whatever reason you wind up with an odd-aligned 1K sector device 
and your data winds up aligned to even numbered [hard] sectors, 
performance will definitely suffer.

Mostly this is out of MD's hands, and up to the sysadmin and 
partitioning tools to get hard-sector alignment right.


> For the record, I have no intention of deprecating any of the metadata
> formats, not even 0.90.

strongly agreed


> It is conceivable that I could change the default, though that would
> require a decision as to what the new default would be.  I think it
> would have to be 1.0 or it would cause too much confusion.

A newer default would be nice.

	Jeff



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24 23:55                 ` Neil Brown
  2007-10-25  0:09                   ` Jeff Garzik
@ 2007-10-25  7:01                   ` Doug Ledford
  2007-10-25 14:49                   ` Bill Davidsen
  2 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-25  7:01 UTC (permalink / raw)
  To: Neil Brown; +Cc: Bill Davidsen, John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1695 bytes --]

On Thu, 2007-10-25 at 09:55 +1000, Neil Brown wrote:

> As for where the metadata "should" be placed, it is interesting to
> observe that the SNIA's "DDFv1.2" puts it at the end of the device.
> And as DDF is an industry standard sponsored by multiple companies it
> must be ......
> Sorry.  I had intended to say "correct", but when it came to it, my
> fingers refused to type that word in that context.
> 
> DDF is in a somewhat different situation though.  It assumes that the
> components are whole devices, and that the controller has exclusive
> access - there is no way another controller could interpret the
> devices differently before the DDF controller has a chance.

Putting a superblock at the end of a device works around OS
compatibility issues and other things related to transitioning the
device from part of an array to not, etc.  But, it works if and only if
you have the guarantee you mention.  Long, long ago I tinkered with the
idea of md multipath devices using an end of device superblock on the
whole device to allow reliable multipath detection and autostart,
failover of all partitions on a device when a command to any partition
failed, ability to use standard partition tables, etc. while being 100%
transparent to the rest of the OS.  The second you considered FC
connected devices and multi-OS access, that fell apart in a big way.
Very analogous.

So, I wouldn't necessarily call it wrong, but it's fragile.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-25  0:09                   ` Jeff Garzik
@ 2007-10-25  8:09                     ` David Greaves
  2007-10-26  6:16                       ` Neil Brown
  0 siblings, 1 reply; 88+ messages in thread
From: David Greaves @ 2007-10-25  8:09 UTC (permalink / raw)
  To: Jeff Garzik, Neil Brown
  Cc: Doug Ledford, Bill Davidsen, John Stoffel, Justin Piszcz,
	linux-raid

Jeff Garzik wrote:
> Neil Brown wrote:
>> As for where the metadata "should" be placed, it is interesting to
>> observe that the SNIA's "DDFv1.2" puts it at the end of the device.
>> And as DDF is an industry standard sponsored by multiple companies it
>> must be ......
>> Sorry.  I had intended to say "correct", but when it came to it, my
>> fingers refused to type that word in that context.

>> For the record, I have no intention of deprecating any of the metadata
>> formats, not even 0.90.
> 
> strongly agreed

I didn't get a reply to my suggestion of separating the data and location...

ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new (and old!) users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location start
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location end

resulting in:
mdadm --detail /dev/md0

/dev/md0:
        Version : 01.0
  Metadata-locn : End-of-device
  Creation Time : Fri Aug  4 23:05:02 2006
     Raid Level : raid0

You provide rational defaults for mortals and this approach allows people like
Doug to do wacky HA things explicitly.

I'm not sure you need any changes to the kernel code - probably just the docs
and mdadm.

>> It is conceivable that I could change the default, though that would
>> require a decision as to what the new default would be.  I think it
>> would have to be 1.0 or it would cause too much confusion.
> 
> A newer default would be nice.

I also suspect that a *lot* of people will assume that the highest superblock
version is the best and should be used for new installs etc.

So if you make 1.0 the default then how many users will try 'the bleeding edge'
and use 1.2? So then you have 1.3 which is the same as 1.0? Hmmmm? So to quote
from an old Soap: "Confused, you  will be..."

David

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24 23:55                 ` Neil Brown
  2007-10-25  0:09                   ` Jeff Garzik
  2007-10-25  7:01                   ` Doug Ledford
@ 2007-10-25 14:49                   ` Bill Davidsen
  2007-10-25 15:00                     ` David Greaves
  2007-10-26  5:56                     ` Neil Brown
  2 siblings, 2 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-25 14:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: Doug Ledford, John Stoffel, Justin Piszcz, linux-raid

Neil Brown wrote:
> I certainly accept that the documentation is probably less that
> perfect (by a large margin).  I am more than happy to accept patches
> or concrete suggestions on how to improve that.  I always think it is
> best if a non-developer writes documentation (and a developer reviews
> it) as then it is more likely to address the issues that a
> non-developer will want to read about, and in a way that will make
> sense to a non-developer. (i.e. I'm to close to the subject to write
> good doco).

Patches against what's in 2.6.4 I assume? I can't promise to write 
anything which pleases even me, but I will take a look at it.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-25 14:49                   ` Bill Davidsen
@ 2007-10-25 15:00                     ` David Greaves
  2007-10-26  5:56                     ` Neil Brown
  1 sibling, 0 replies; 88+ messages in thread
From: David Greaves @ 2007-10-25 15:00 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Neil Brown, Doug Ledford, John Stoffel, Justin Piszcz, linux-raid

Bill Davidsen wrote:
> Neil Brown wrote:
>> I certainly accept that the documentation is probably less that
>> perfect (by a large margin).  I am more than happy to accept patches
>> or concrete suggestions on how to improve that.  I always think it is
>> best if a non-developer writes documentation (and a developer reviews
>> it) as then it is more likely to address the issues that a
>> non-developer will want to read about, and in a way that will make
>> sense to a non-developer. (i.e. I'm to close to the subject to write
>> good doco).
> 
> Patches against what's in 2.6.4 I assume? I can't promise to write
> anything which pleases even me, but I will take a look at it.
> 

The man page is a great place for describing, eg, the superblock location; but
don't forget we have
  http://linux-raid.osdl.org/index.php/Main_Page
which is probably a better place for *discussions* (or essays) about the
superblock location (eg the LVM / v1.1 comment Janek picked up on)

In fact I was going to take some of the writings from this thread and put them
up there.

David

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24 20:22                 ` Bill Davidsen
@ 2007-10-25 16:29                   ` Doug Ledford
  0 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-25 16:29 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: John Stoffel, Michael Tokarev, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2678 bytes --]

On Wed, 2007-10-24 at 16:22 -0400, Bill Davidsen wrote:
> Doug Ledford wrote:
> > On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
> >
> >   
> >> I don't agree completely.  I think the superblock location is a key
> >> issue, because if you have a superblock location which moves depending
> >> the filesystem or LVM you use to look at the partition (or full disk)
> >> then you need to be even more careful about how to poke at things.
> >>     
> >
> > This is the heart of the matter.  When you consider that each file
> > system and each volume management stack has a superblock, and they some
> > store their superblocks at the end of devices and some at the beginning,
> > and they can be stacked, then it becomes next to impossible to make sure
> > a stacked setup is never recognized incorrectly under any circumstance.
> > It might be possible if you use static device names, but our users
> > *long* ago complained very loudly when adding a new disk or removing a
> > bad disk caused their setup to fail to boot.  So, along came mount by
> > label and auto scans for superblocks.  Once you do that, you *really*
> > need all the superblocks at the same end of a device so when you stack
> > things, it always works properly.
> Let me be devil's advocate, I noted in another post that location might 
> be raid level dependent. For raid-1 putting the superblock at the end 
> allows the BIOS to treat a single partition as a bootable unit.

This is true for both the 1.0 and 1.2 superblock formats.  The BIOS
couldn't care less if there is an offset to the filesystem because it
doesn't try to read from the filesystem.  It just jumps to the first 512
byte sector and that's it.  Grub/Lilo are the ones that have to know
about the offset, and they would be made aware of the offset at install
time.

So, we are back to the exact same thing I was talking about.  With the
superblock at the beginning of the device, you don't hinder bootability
with or without the raid working, the raid would be bootable regardless
as long as you made it bootable, it only hinders accessing the
filesystem via a running linux installation without bringing up the
raid.

>  For all 
> other arrangements the end location puts the superblock where it is 
> slightly more likely to be overwritten, and where it must be moved if 
> the partition grows or whatever.
> 
> There really may be no "right" answer.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-25 14:49                   ` Bill Davidsen
  2007-10-25 15:00                     ` David Greaves
@ 2007-10-26  5:56                     ` Neil Brown
  1 sibling, 0 replies; 88+ messages in thread
From: Neil Brown @ 2007-10-26  5:56 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Doug Ledford, John Stoffel, Justin Piszcz, linux-raid

On Thursday October 25, davidsen@tmr.com wrote:
> Neil Brown wrote:
> > I certainly accept that the documentation is probably less that
> > perfect (by a large margin).  I am more than happy to accept patches
> > or concrete suggestions on how to improve that.  I always think it is
> > best if a non-developer writes documentation (and a developer reviews
> > it) as then it is more likely to address the issues that a
> > non-developer will want to read about, and in a way that will make
> > sense to a non-developer. (i.e. I'm to close to the subject to write
> > good doco).
> 
> Patches against what's in 2.6.4 I assume? I can't promise to write 
> anything which pleases even me, but I will take a look at it.

Any text at all would be welcome, but yes; patches against 2.6.4 would
be easiest.

Thanks
NeilBrown

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-25  8:09                     ` David Greaves
@ 2007-10-26  6:16                       ` Neil Brown
  2007-10-26 14:18                         ` Bill Davidsen
  0 siblings, 1 reply; 88+ messages in thread
From: Neil Brown @ 2007-10-26  6:16 UTC (permalink / raw)
  To: David Greaves
  Cc: Jeff Garzik, Doug Ledford, Bill Davidsen, John Stoffel,
	Justin Piszcz, linux-raid

On Thursday October 25, david@dgreaves.com wrote:
> 
> I didn't get a reply to my suggestion of separating the data and location...

No. Sorry.

> 
> ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
> format (0.9 vs 1.0) and a location (end,start,offset4k)?
> 
> This would certainly make things a lot clearer to new (and old!) users:
> 
> mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
> or
> mdadm --create /dev/md0 --metadata 1.0 --meta-location start
> or
> mdadm --create /dev/md0 --metadata 1.0 --meta-location end

I'm happy to support synonyms.  How about

   --metadata 1-end
   --metadata 1-start

??

> 
> resulting in:
> mdadm --detail /dev/md0
> 
> /dev/md0:
>         Version : 01.0
>   Metadata-locn : End-of-device

It already lists the superblock location as a sector offset, but I
don't have a problem with reporting:

          Version : 1.0 (metadata at end of device)
	  Version : 1.1 (metadata at start of device)

Would that help?


>   Creation Time : Fri Aug  4 23:05:02 2006
>      Raid Level : raid0
> 
> You provide rational defaults for mortals and this approach allows people like
> Doug to do wacky HA things explicitly.
> 
> I'm not sure you need any changes to the kernel code - probably just the docs
> and mdadm.

True.

> 
> >> It is conceivable that I could change the default, though that would
> >> require a decision as to what the new default would be.  I think it
> >> would have to be 1.0 or it would cause too much confusion.
> > 
> > A newer default would be nice.
> 
> I also suspect that a *lot* of people will assume that the highest superblock
> version is the best and should be used for new installs etc.

Grumble... why can't people expect what I want them to expect?

> 
> So if you make 1.0 the default then how many users will try 'the bleeding edge'
> and use 1.2? So then you have 1.3 which is the same as 1.0? Hmmmm? So to quote
> from an old Soap: "Confused, you  will be..."
:-)

NeilBrown

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-20 13:11                   ` Doug Ledford
@ 2007-10-26  9:54                     ` Luca Berra
  2007-10-26 16:22                       ` Gabor Gombas
  2007-10-26 18:52                       ` Doug Ledford
  0 siblings, 2 replies; 88+ messages in thread
From: Luca Berra @ 2007-10-26  9:54 UTC (permalink / raw)
  To: linux-raid

On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
>On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote:
>
>> Honestly, I don't see how a properly configured system would start
>> looking at the physical device by mistake. I suppose it's possible, but
>> I didn't have this issue.
>
>Mount by label support scans all devices in /proc/partitions looking for
>the filesystem superblock that has the label you are trying to mount.
it could probably be smarter, but in any case there is no point in
mounting by label an md device.
>LVM (unless told not to) scans all devices in /proc/partitions looking
yes, but lvm unless told to, will ignore devices having a valid md
superblock.
>for valid LVM superblocks.  In fact, you can't build a linux system that
>is resilient to device name changes without doing that.
i dislike labels, especially for devices that contain the os. we should
ensure great care that these are identified correctly, and
mount-by-label does not (usb drive that migrate from one system to
another are so common that you can't ignore them)

you forgot udev ;)

but the fix is easy.
remove the partition detection code from the kernel and start working on
a smart userspace replacement for device detection. we already have
vol_id from udev and blkid from ext3 which support detection of many
device formats.
just apply some rules, so if you find a partition table _AND_ an md
superblock at the end, read both and you can tell if it is an md on a
partition or a partitioned md raid1 device.

>And you can with superblock at the front.  You can create a new single
>disk raid1 over the existing superblock or you can munge the partition
>table to have it point at the start of your data.  There are options,
Please don't do that,
use device-mapper to set the device up, without mucking with partition
tables.

L.


-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26  6:16                       ` Neil Brown
@ 2007-10-26 14:18                         ` Bill Davidsen
  2007-10-26 18:41                           ` Doug Ledford
  2007-10-30  3:25                           ` Neil Brown
  0 siblings, 2 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-26 14:18 UTC (permalink / raw)
  To: Neil Brown
  Cc: David Greaves, Jeff Garzik, Doug Ledford, John Stoffel,
	Justin Piszcz, linux-raid

Neil Brown wrote:
> On Thursday October 25, david@dgreaves.com wrote:
>   
>> I didn't get a reply to my suggestion of separating the data and location...
>>     
>
> No. Sorry.
>
>   
>> ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
>> format (0.9 vs 1.0) and a location (end,start,offset4k)?
>>
>> This would certainly make things a lot clearer to new (and old!) users:
>>
>> mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
>> or
>> mdadm --create /dev/md0 --metadata 1.0 --meta-location start
>> or
>> mdadm --create /dev/md0 --metadata 1.0 --meta-location end
>>     
>
> I'm happy to support synonyms.  How about
>
>    --metadata 1-end
>    --metadata 1-start
>
> ??
>   
Offset? Do you like "1-offset4k" or maybe "1-start4k" or even 
"1-start+4k" for that? The last is most intuitive but I don't know how 
you feel about the + in there.
>   
>> resulting in:
>> mdadm --detail /dev/md0
>>
>> /dev/md0:
>>         Version : 01.0
>>   Metadata-locn : End-of-device
>>     
>
> It already lists the superblock location as a sector offset, but I
> don't have a problem with reporting:
>
>           Version : 1.0 (metadata at end of device)
> 	  Version : 1.1 (metadata at start of device)
>
> Would that help?
>
>   
Same comments on the reporting, "metadata at block 4k" or something.
>   
>>   Creation Time : Fri Aug  4 23:05:02 2006
>>      Raid Level : raid0
>>
>> You provide rational defaults for mortals and this approach allows people like
>> Doug to do wacky HA things explicitly.
>>
>> I'm not sure you need any changes to the kernel code - probably just the docs
>> and mdadm.
>>     
>
> True.
>
>   
>>>> It is conceivable that I could change the default, though that would
>>>> require a decision as to what the new default would be.  I think it
>>>> would have to be 1.0 or it would cause too much confusion.
>>>>         
>>> A newer default would be nice.
>>>       
>> I also suspect that a *lot* of people will assume that the highest superblock
>> version is the best and should be used for new installs etc.
>>     
>
> Grumble... why can't people expect what I want them to expect?
>
>   
I confess that I thought 1.x was a series of solutions reflecting your 
evolving opinion on what was best, so maybe in retrospect you made a 
non-intuitive choice of nomenclature. Or bluntly, you picked confusing 
names for this and confused people. If 1.0 meant start, 1.1 meant 4k, 
and 1.2 meant end, at least it would be easy to remember for people who 
only create a new array a few times a year, or once in the lifetime of a 
new computer.
>> So if you make 1.0 the default then how many users will try 'the bleeding edge'
>> and use 1.2? So then you have 1.3 which is the same as 1.0? Hmmmm? So to quote
>> from an old Soap: "Confused, you  will be..."
>>     

Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
beginning? Isn't hindsight wonderful?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26  9:54                     ` Luca Berra
@ 2007-10-26 16:22                       ` Gabor Gombas
  2007-10-26 17:06                         ` Gabor Gombas
  2007-10-26 18:52                       ` Doug Ledford
  1 sibling, 1 reply; 88+ messages in thread
From: Gabor Gombas @ 2007-10-26 16:22 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 26, 2007 at 11:54:18AM +0200, Luca Berra wrote:

> but the fix is easy.
> remove the partition detection code from the kernel and start working on
> a smart userspace replacement for device detection. we already have
> vol_id from udev and blkid from ext3 which support detection of many
> device formats.

You got the ordering wrong. You should get userspace support ready and
accepted _first_, and then you can start the
flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
configurable. But even if you have the perfect userspace solution ready
today, removing partitioning support from the kernel is a pretty
invasive ABI change so it will take many years if it ever happens at
all.

I saw the "let's move partition detection to user space" argument
several times on l-k in the past years but it never gained support...
So if you want to make it happen, stop talking and start coding, and
persuade all major distros to accept your changes. _Then_ you can start
arguing to remove partition detection from the kernel, and even then it
won't be easy.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 16:22                       ` Gabor Gombas
@ 2007-10-26 17:06                         ` Gabor Gombas
  2007-10-27 10:34                           ` Luca Berra
  0 siblings, 1 reply; 88+ messages in thread
From: Gabor Gombas @ 2007-10-26 17:06 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote:

> You got the ordering wrong. You should get userspace support ready and
> accepted _first_, and then you can start the
> flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
> configurable.

Oh wait that is possible even today. So you can build your own kernel
without any partition table format support - problem solved.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 14:18                         ` Bill Davidsen
@ 2007-10-26 18:41                           ` Doug Ledford
  2007-10-26 22:20                             ` Gabor Gombas
                                               ` (2 more replies)
  2007-10-30  3:25                           ` Neil Brown
  1 sibling, 3 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-26 18:41 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Neil Brown, David Greaves, Jeff Garzik, John Stoffel,
	Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5103 bytes --]

On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote:
> Neil Brown wrote:
> > On Thursday October 25, david@dgreaves.com wrote:
> >   
> >> I didn't get a reply to my suggestion of separating the data and location...
> >>     
> >
> > No. Sorry.
> >
> >   
> >> ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
> >> format (0.9 vs 1.0) and a location (end,start,offset4k)?
> >>
> >> This would certainly make things a lot clearer to new (and old!) users:
> >>
> >> mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
> >> or
> >> mdadm --create /dev/md0 --metadata 1.0 --meta-location start
> >> or
> >> mdadm --create /dev/md0 --metadata 1.0 --meta-location end
> >>     
> >
> > I'm happy to support synonyms.  How about
> >
> >    --metadata 1-end
> >    --metadata 1-start
> >
> > ??
> >   
> Offset? Do you like "1-offset4k" or maybe "1-start4k" or even 
> "1-start+4k" for that? The last is most intuitive but I don't know how 
> you feel about the + in there.

Actually, after doing some research, here's what I've found:

* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.

* When using grub to boot from a raid device, only 0.90 and 1.0
superblocks are supported[1] (because grub is ignorant of the raid and
it requires the fs to start at the start of the partition).  You can use
either MBR or partition based installs of grub.  However, partition
based installs require that all bootable partitions be in exactly the
same logical block address across all devices.  This limitation can be
an extremely hazardous limitation in the event a drive dies and you have
to replace it with a new drive as newer drives may not share the older
drive's geometry and will require starting your boot partition in an odd
location to make the logical block addresses match.

* When using grub2, there is supposedly already support for raid/lvm
devices.  However, I do not know if this includes version 1.0, 1.1, or
1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
install to an md device, it searches out all constituent devices and
installs to the MBR on each device[2].  This can't be changed (at least
right now, probably not ever though).

So, given the above situations, really, superblock format 1.2 is likely
to never be needed.  None of the shipping boot loaders work with 1.2
regardless, and the boot loader under development won't install to the
partition in the event of an md device and therefore doesn't need that
4k buffer that 1.2 provides.

[1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
person could probably hack it to work, but since grub development has
stopped in preference to the still under development grub2, they won't
take the patches upstream unless they are bug fixes, not new features.

[2] There are two ways to install to a master boot record.  The first is
to use the first 512 bytes *only* and hardcode the location of the
remainder of the boot loader into those 512 bytes.  The second way is to
use the free space between the MBR and the start of the first partition
to embed the remainder of the boot loader.  When you point grub2 at an
md device, they automatically only use the second method of boot loader
installation.  This gives them the freedom to be able to modify the
second stage boot loader on a boot disk by boot disk basis.  The
downside to this is that they need lots of room after the MBR and before
the first partition in order to put their core.img file in place.  I
*think*, and I'll know for sure later today, that the core.img file is
generated during grub install from the list of optional modules you
specify during setup.  Eg., the pc module gives partition table support,
the lvm module lvm support, etc.  You list the modules you need, and
grub then builds a core.img out of all those modules.  The normal amount
of space between the MBR and the first partition is (sectors_per_track -
1).  For standard disk geometries, that basically leaves 254 sectors, or
127k of space.  This might not be enough for your particular needs if
you have a complex boot environment.  In that case, you would need to
bump at least the starting track of your first partition to make room
for your boot loader.  Unfortunately, how is a person to know how much
room their setup needs until after they've installed and it's too late
to bump the partition table start?  They can't.  So, that's another
thing I think I will check out today, what the maximum size of grub2
might be with all modules included, and what a common size might be.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26  9:54                     ` Luca Berra
  2007-10-26 16:22                       ` Gabor Gombas
@ 2007-10-26 18:52                       ` Doug Ledford
  2007-10-26 22:30                         ` Gabor Gombas
  2007-10-27  8:00                         ` Luca Berra
  1 sibling, 2 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-26 18:52 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 773 bytes --]

On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
> On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
> just apply some rules, so if you find a partition table _AND_ an md
> superblock at the end, read both and you can tell if it is an md on a
> partition or a partitioned md raid1 device.

In fact, no you can't.  I know, because I've created a device that had
both but wasn't a raid device.  And it's matching partner still existed
too.  What you are talking about would have misrecognized this
situation, guaranteed.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 18:41                           ` Doug Ledford
@ 2007-10-26 22:20                             ` Gabor Gombas
  2007-10-26 22:58                               ` Doug Ledford
  2007-10-27 11:11                               ` Luca Berra
  2007-10-27 15:20                             ` Bill Davidsen
  2007-10-27 21:11                             ` Doug Ledford
  2 siblings, 2 replies; 88+ messages in thread
From: Gabor Gombas @ 2007-10-26 22:20 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Bill Davidsen, Neil Brown, David Greaves, Jeff Garzik,
	John Stoffel, Justin Piszcz, linux-raid

On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote:

> * When using lilo to boot from a raid device, it automatically installs
> itself to the mbr, not to the partition.  This can not be changed.  Only
> 0.90 and 1.0 superblock types are supported because lilo doesn't
> understand the offset to the beginning of the fs otherwise.

Huh? I have several machines that boot with LILO and the root is on
RAID1. All install LILO to the boot sector of the mdX device (having
"boot=/dev/mdX" in lilo.conf), while the MBR is installed by
install-mbr. Since install-mbr has its own prompt that is displayed
before LILO's prompt on boot, I can be pretty sure that LILO did not
write anything to the MBR...

What you say is only true for "skewed" RAID setups, but I always
considered such a setup too risky for anything critical (not because of
LILO, but because of the increased administrative complexity).

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 18:52                       ` Doug Ledford
@ 2007-10-26 22:30                         ` Gabor Gombas
  2007-10-28  0:26                           ` Doug Ledford
  2007-10-27  8:00                         ` Luca Berra
  1 sibling, 1 reply; 88+ messages in thread
From: Gabor Gombas @ 2007-10-26 22:30 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Luca Berra, linux-raid

On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:

> In fact, no you can't.  I know, because I've created a device that had
> both but wasn't a raid device.  And it's matching partner still existed
> too.  What you are talking about would have misrecognized this
> situation, guaranteed.

Maybe we need a 2.0 superblock that contains the physical size of every
component, not just the logical size that is used for RAID. That way if
the size read from the superblock does not match the size of the device,
you know that this device should be ignored.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 22:20                             ` Gabor Gombas
@ 2007-10-26 22:58                               ` Doug Ledford
  2007-10-27 11:11                               ` Luca Berra
  1 sibling, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-26 22:58 UTC (permalink / raw)
  To: Gabor Gombas
  Cc: Bill Davidsen, Neil Brown, David Greaves, Jeff Garzik,
	John Stoffel, Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1511 bytes --]

On Sat, 2007-10-27 at 00:20 +0200, Gabor Gombas wrote:
> On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote:
> 
> > * When using lilo to boot from a raid device, it automatically installs
> > itself to the mbr, not to the partition.  This can not be changed.  Only
> > 0.90 and 1.0 superblock types are supported because lilo doesn't
> > understand the offset to the beginning of the fs otherwise.
> 
> Huh? I have several machines that boot with LILO and the root is on
> RAID1. All install LILO to the boot sector of the mdX device (having
> "boot=/dev/mdX" in lilo.conf), while the MBR is installed by
> install-mbr. Since install-mbr has its own prompt that is displayed
> before LILO's prompt on boot, I can be pretty sure that LILO did not
> write anything to the MBR...

Then this has changed.  It used to only install lilo to the mbr.
However, even if it installs to the partition, it doesn't change the
rest of what I said about it not understanding the offset from partition
to file system on 1.1 and 1.2 superblocks.

> What you say is only true for "skewed" RAID setups, but I always
> considered such a setup too risky for anything critical (not because of
> LILO, but because of the increased administrative complexity).
> 
> Gabor
> 
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 18:52                       ` Doug Ledford
  2007-10-26 22:30                         ` Gabor Gombas
@ 2007-10-27  8:00                         ` Luca Berra
  2007-10-27 20:09                           ` Doug Ledford
  1 sibling, 1 reply; 88+ messages in thread
From: Luca Berra @ 2007-10-27  8:00 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
>On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
>> On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
>> just apply some rules, so if you find a partition table _AND_ an md
>> superblock at the end, read both and you can tell if it is an md on a
>> partition or a partitioned md raid1 device.
>
>In fact, no you can't.  I know, because I've created a device that had
>both but wasn't a raid device.  And it's matching partner still existed
>too.  What you are talking about would have misrecognized this
>situation, guaranteed.
then just ignore the device and log a warning, instead of doing a random
choice.
L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 17:06                         ` Gabor Gombas
@ 2007-10-27 10:34                           ` Luca Berra
  0 siblings, 0 replies; 88+ messages in thread
From: Luca Berra @ 2007-10-27 10:34 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 26, 2007 at 07:06:46PM +0200, Gabor Gombas wrote:
>On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote:
>
>> You got the ordering wrong. You should get userspace support ready and
>> accepted _first_, and then you can start the
>> flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
>> configurable.
sorry, i did not intend to start a flamewar.

>Oh wait that is possible even today. So you can build your own kernel
>without any partition table format support - problem solved.
yes, i can build my own, i just tought it could be useful for someone
but myself. maybe even Doug's enterprise customers....

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 22:20                             ` Gabor Gombas
  2007-10-26 22:58                               ` Doug Ledford
@ 2007-10-27 11:11                               ` Luca Berra
  1 sibling, 0 replies; 88+ messages in thread
From: Luca Berra @ 2007-10-27 11:11 UTC (permalink / raw)
  To: linux-raid

On Sat, Oct 27, 2007 at 12:20:12AM +0200, Gabor Gombas wrote:
>On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote:
>
>> * When using lilo to boot from a raid device, it automatically installs
>> itself to the mbr, not to the partition.  This can not be changed.  Only
>> 0.90 and 1.0 superblock types are supported because lilo doesn't
>> understand the offset to the beginning of the fs otherwise.
>
>Huh? I have several machines that boot with LILO and the root is on
>RAID1. All install LILO to the boot sector of the mdX device (having
>"boot=/dev/mdX" in lilo.conf), while the MBR is installed by
>install-mbr. Since install-mbr has its own prompt that is displayed
>before LILO's prompt on boot, I can be pretty sure that LILO did not
>write anything to the MBR...

the behaviour is documented in lilo man page, for the
raid-extra-boot option.


-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 18:41                           ` Doug Ledford
  2007-10-26 22:20                             ` Gabor Gombas
@ 2007-10-27 15:20                             ` Bill Davidsen
  2007-10-28  0:18                               ` Doug Ledford
  2007-10-27 21:11                             ` Doug Ledford
  2 siblings, 1 reply; 88+ messages in thread
From: Bill Davidsen @ 2007-10-27 15:20 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Neil Brown, David Greaves, Jeff Garzik, John Stoffel,
	Justin Piszcz, linux-raid

Doug Ledford wrote:
> On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote:
>   
> [___snip___]
>   

> Actually, after doing some research, here's what I've found:
>
> * When using lilo to boot from a raid device, it automatically installs
> itself to the mbr, not to the partition.  This can not be changed.  Only
> 0.90 and 1.0 superblock types are supported because lilo doesn't
> understand the offset to the beginning of the fs otherwise.
>   

I'm reasonably sure that's wrong, I used to set up dual boot machines by 
putting LILO in the partition and making that the boot partition, by 
changing the active partition flag I could just have the machine boot 
Windows, to keep people from getting confused.
> * When using grub to boot from a raid device, only 0.90 and 1.0
> superblocks are supported[1] (because grub is ignorant of the raid and
> it requires the fs to start at the start of the partition).  You can use
> either MBR or partition based installs of grub.  However, partition
> based installs require that all bootable partitions be in exactly the
> same logical block address across all devices.  This limitation can be
> an extremely hazardous limitation in the event a drive dies and you have
> to replace it with a new drive as newer drives may not share the older
> drive's geometry and will require starting your boot partition in an odd
> location to make the logical block addresses match.
>
> * When using grub2, there is supposedly already support for raid/lvm
> devices.  However, I do not know if this includes version 1.0, 1.1, or
> 1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
> install to an md device, it searches out all constituent devices and
> installs to the MBR on each device[2].  This can't be changed (at least
> right now, probably not ever though).
>   

That sounds like a good reason to avoid grub2, frankly. Software which 
decides that it knows what to do better than the user isn't my 
preference. If I wanted software which fores me to do things "their way" 
I'd be running Windows.
> So, given the above situations, really, superblock format 1.2 is likely
> to never be needed.  None of the shipping boot loaders work with 1.2
> regardless, and the boot loader under development won't install to the
> partition in the event of an md device and therefore doesn't need that
> 4k buffer that 1.2 provides.
>   

Sounds right, although it may have other uses for clever people.
> [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
> person could probably hack it to work, but since grub development has
> stopped in preference to the still under development grub2, they won't
> take the patches upstream unless they are bug fixes, not new features.
>   

If the patches were available, "doesn't work with existing raid formats" 
would probably qualify as a bug.
> [2] There are two ways to install to a master boot record.  The first is
> to use the first 512 bytes *only* and hardcode the location of the
> remainder of the boot loader into those 512 bytes.  The second way is to
> use the free space between the MBR and the start of the first partition
> to embed the remainder of the boot loader.  When you point grub2 at an
> md device, they automatically only use the second method of boot loader
> installation.  This gives them the freedom to be able to modify the
> second stage boot loader on a boot disk by boot disk basis.  The
> downside to this is that they need lots of room after the MBR and before
> the first partition in order to put their core.img file in place.  I
> *think*, and I'll know for sure later today, that the core.img file is
> generated during grub install from the list of optional modules you
> specify during setup.  Eg., the pc module gives partition table support,
> the lvm module lvm support, etc.  You list the modules you need, and
> grub then builds a core.img out of all those modules.  The normal amount
> of space between the MBR and the first partition is (sectors_per_track -
> 1).  For standard disk geometries, that basically leaves 254 sectors, or
> 127k of space.  This might not be enough for your particular needs if
> you have a complex boot environment.  In that case, you would need to
> bump at least the starting track of your first partition to make room
> for your boot loader.  Unfortunately, how is a person to know how much
> room their setup needs until after they've installed and it's too late
> to bump the partition table start?  They can't.  So, that's another
> thing I think I will check out today, what the maximum size of grub2
> might be with all modules included, and what a common size might be.
>
>   
Based on your description, it sounds as if grub2 may not have given 
adequate thought to what users other than the authors might need (that 
may be a premature conclusion). I have multiple installs on several of 
my machines, and I assume that the grub2 for 32 and 64 bit will be 
different. Thanks for the research.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-27  8:00                         ` Luca Berra
@ 2007-10-27 20:09                           ` Doug Ledford
  2007-10-28 13:46                             ` Luca Berra
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-27 20:09 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote:
> On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
> >On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
> >> On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
> >> just apply some rules, so if you find a partition table _AND_ an md
> >> superblock at the end, read both and you can tell if it is an md on a
> >> partition or a partitioned md raid1 device.
> >
> >In fact, no you can't.  I know, because I've created a device that had
> >both but wasn't a raid device.  And it's matching partner still existed
> >too.  What you are talking about would have misrecognized this
> >situation, guaranteed.
> then just ignore the device and log a warning, instead of doing a random
> choice.
> L.

It also happened to be my OS drive pair.  Ignoring it would have
rendered the machine unusable.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 18:41                           ` Doug Ledford
  2007-10-26 22:20                             ` Gabor Gombas
  2007-10-27 15:20                             ` Bill Davidsen
@ 2007-10-27 21:11                             ` Doug Ledford
  2007-10-29  0:48                               ` Bill Davidsen
  2 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-27 21:11 UTC (permalink / raw)
  To: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 2935 bytes --]

On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote:
> Actually, after doing some research, here's what I've found:

> * When using grub2, there is supposedly already support for raid/lvm
> devices.  However, I do not know if this includes version 1.0, 1.1, or
> 1.2 superblocks.  I intend to find that out today.

It does not include support for any version 1 superblocks.  It's noted
in the code that it should, but doesn't yet.  However, the interesting
bit is that they rearchitected grub so that any reads from a device
during boot are filtered through the stack that provides the device.
So, when you tell grub2 to set root=md0, then all reads from md0 are
filtered through the raid module, and the raid module then calls the
reads from the IO module, which then does the actual int 13 call.  This
allows the raid module to read superblocks, detect the raid level and
layout, and actually attempt to work on raid0/1/5 devices (at the
moment).  It also means that all the calls from the ext2 module when it
attempts to read from the md device are filtered through the md module
and therefore it would be simple for it to implement an offset into the
real device to get past the version 1.1/1.2 superblocks.

In terms of resilience, the raid module actually tries to utilize the
raid itself during any failure.  On raid1 devices, if it gets a read
failure on any block it attempts to read, then it goes to the next
device in the raid1 array and attempts to read from it.  So, in the
event that your normal boot disk suffers a sector failure in your actual
kernel image, but the raid disk is otherwise fine, grub2 should be able
to boot from the kernel image on the next raid device.  Similarly, on
raid5 it will attempt to recover from a block read failure by using the
parity to generate the missing data unless the array is already in
degraded mode at which point it will bail on any read failure.

The lvm module attempts to properly map extents to physical volumes and
allows you to have your bootable files in lvm logical volume.  In that
case you set root=logical-volume-name-as-it-appears-in-/dev/mapper and
the lvm module then figures out what physical volumes contain that
logical volume and where the extents are mapped and goes from there.

I should note that both the lvm code and raid code are simplistic at the
moment.  For example, the raid5 mapping only supports the default raid5
layout.  If you use any other layout, game over.  Getting it to work
with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but
getting it to the point where it handles all the relevant setups
properly would require a reasonable amount of coding.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-27 15:20                             ` Bill Davidsen
@ 2007-10-28  0:18                               ` Doug Ledford
  2007-10-29  0:44                                 ` Bill Davidsen
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-28  0:18 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Neil Brown, David Greaves, Jeff Garzik, John Stoffel,
	Justin Piszcz, linux-raid

[-- Attachment #1: Type: text/plain, Size: 8957 bytes --]

On Sat, 2007-10-27 at 11:20 -0400, Bill Davidsen wrote:
> > * When using lilo to boot from a raid device, it automatically installs
> > itself to the mbr, not to the partition.  This can not be changed.  Only
> > 0.90 and 1.0 superblock types are supported because lilo doesn't
> > understand the offset to the beginning of the fs otherwise.
> >   
> 
> I'm reasonably sure that's wrong, I used to set up dual boot machines by 
> putting LILO in the partition and making that the boot partition, by 
> changing the active partition flag I could just have the machine boot 
> Windows, to keep people from getting confused.

Yeah, someone else pointed this out too.  The original patch to lilo
*did* do as I suggest, so they must have improved on the patch later.

> > * When using grub to boot from a raid device, only 0.90 and 1.0
> > superblocks are supported[1] (because grub is ignorant of the raid and
> > it requires the fs to start at the start of the partition).  You can use
> > either MBR or partition based installs of grub.  However, partition
> > based installs require that all bootable partitions be in exactly the
> > same logical block address across all devices.  This limitation can be
> > an extremely hazardous limitation in the event a drive dies and you have
> > to replace it with a new drive as newer drives may not share the older
> > drive's geometry and will require starting your boot partition in an odd
> > location to make the logical block addresses match.
> >
> > * When using grub2, there is supposedly already support for raid/lvm
> > devices.  However, I do not know if this includes version 1.0, 1.1, or
> > 1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
> > install to an md device, it searches out all constituent devices and
> > installs to the MBR on each device[2].  This can't be changed (at least
> > right now, probably not ever though).
> >   
> 
> That sounds like a good reason to avoid grub2, frankly. Software which 
> decides that it knows what to do better than the user isn't my 
> preference. If I wanted software which fores me to do things "their way" 
> I'd be running Windows.

It's not really all that unreasonable of a restriction.  Most people
aren't aware than when you put a boot sector at the beginning of a
partition, you only have 512 bytes of space, so the boot loader that you
put there is basically nothing more than code to read the remainder of
the boot loader from the file system space.  Now, traditionally, most
boot loaders have had to hard code the block addresses of certain key
components into these second stage boot loaders.  If a user isn't aware
of the fact that the boot loader does this at install time (or at kernel
selection update time in the case of lilo), then they aren't aware that
the files must reside at exactly the same logical block address on all
devices.  Without that knowledge, they can easily create an unbootable
setup by having the various boot partitions in slightly different
locations on the disks.  And intelligent partition editors like parted
can compound the problem because as they insulate the user from having
to pick which partition number is used for what partition, etc., they
can end up placing the various boot partitions in different areas of
different drives.  The requirement above is a means of making sure that
users aren't surprise by a non-working setup.  The whole element of
least surprise thing.  Of course, if they keep that requirement, then I
would expect it to be well documented so that people know this going
into putting the boot loader in place, but I would argue that this is at
least better than finding out when a drive dies that your system isn't
bootable.

> > So, given the above situations, really, superblock format 1.2 is likely
> > to never be needed.  None of the shipping boot loaders work with 1.2
> > regardless, and the boot loader under development won't install to the
> > partition in the event of an md device and therefore doesn't need that
> > 4k buffer that 1.2 provides.
> >   
> 
> Sounds right, although it may have other uses for clever people.
> > [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
> > person could probably hack it to work, but since grub development has
> > stopped in preference to the still under development grub2, they won't
> > take the patches upstream unless they are bug fixes, not new features.
> >   
> 
> If the patches were available, "doesn't work with existing raid formats" 
> would probably qualify as a bug.

Possibly.  I'm a bit overbooked on other work at the moment, but I may
try to squeeze in some work on grub/grub2 to support version 1.1 or 1.2
superblocks.

> > [2] There are two ways to install to a master boot record.  The first is
> > to use the first 512 bytes *only* and hardcode the location of the
> > remainder of the boot loader into those 512 bytes.  The second way is to
> > use the free space between the MBR and the start of the first partition
> > to embed the remainder of the boot loader.  When you point grub2 at an
> > md device, they automatically only use the second method of boot loader
> > installation.  This gives them the freedom to be able to modify the
> > second stage boot loader on a boot disk by boot disk basis.  The
> > downside to this is that they need lots of room after the MBR and before
> > the first partition in order to put their core.img file in place.  I
> > *think*, and I'll know for sure later today, that the core.img file is
> > generated during grub install from the list of optional modules you
> > specify during setup.  Eg., the pc module gives partition table support,
> > the lvm module lvm support, etc.  You list the modules you need, and
> > grub then builds a core.img out of all those modules.  The normal amount
> > of space between the MBR and the first partition is (sectors_per_track -
> > 1).  For standard disk geometries, that basically leaves 254 sectors, or
> > 127k of space.  This might not be enough for your particular needs if
> > you have a complex boot environment.  In that case, you would need to
> > bump at least the starting track of your first partition to make room
> > for your boot loader.  Unfortunately, how is a person to know how much
> > room their setup needs until after they've installed and it's too late
> > to bump the partition table start?  They can't.  So, that's another
> > thing I think I will check out today, what the maximum size of grub2
> > might be with all modules included, and what a common size might be.
> >
> >   
> Based on your description, it sounds as if grub2 may not have given 
> adequate thought to what users other than the authors might need (that 
> may be a premature conclusion). I have multiple installs on several of 
> my machines, and I assume that the grub2 for 32 and 64 bit will be 
> different. Thanks for the research.

No, not really.  The grub command on the two is different, but they
actually build the boot sector out of 16 bit non-protected mode code,
just like DOS.  So either one would build the same boot sector given the
same config.  And you can always use the same trick I've used in the
past of creating a large /boot partition (say 250MB) and using that same
partition as /boot in all of your installs.  Then they share a single
grub config (while the grub binaries are in the individual / partitions)
and from the single grub instance you can boot to any of the installs,
as well as a kernel update in any install updates that global grub
config.  The other option is to use separate /boot partitions and chain
load the grub instances, but I find that clunky in comparison.  Of
course, in my case I also made /lib/modules its own partition and also
shared it between all the installs so that I could manually edit the
various kernel boot params to specify different root partitions and in
so doing I could boot a RHEL5 kernel using a RHEL4 install and vice
versa.  But if you do that, you have to manually
patch /etc/rc.d/rc.sysinit to mount the /lib/modules partition before
ever trying to do anything with modules (and you have to mount it rw so
they can do a depmod if needed), then remount it ro for the fsck, then
it gets remounted rw again after the fs check.  It was a pain in the ass
to maintain because every update to initscripts would wipe out the patch
and if you forgot to repatch the file, the system wouldn't boot and
you'd have to boot into another install, mount the / partition of the
broken install, patch the file, then it would work again in that
install.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 22:30                         ` Gabor Gombas
@ 2007-10-28  0:26                           ` Doug Ledford
  2007-10-28 14:13                             ` Luca Berra
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-28  0:26 UTC (permalink / raw)
  To: Gabor Gombas; +Cc: Luca Berra, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2763 bytes --]

On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
> On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
> 
> > In fact, no you can't.  I know, because I've created a device that had
> > both but wasn't a raid device.  And it's matching partner still existed
> > too.  What you are talking about would have misrecognized this
> > situation, guaranteed.
> 
> Maybe we need a 2.0 superblock that contains the physical size of every
> component, not just the logical size that is used for RAID. That way if
> the size read from the superblock does not match the size of the device,
> you know that this device should be ignored.

In my case that wouldn't have helped.  What actually happened was I
create a two disk raid1 device using whole devices and a version 1.0
superblock.  I know a version 1.1 wouldn't work because it would be
where the boot sector needed to be, and wasn't sure if a 1.2 would work
either.  Then I tried to make the whole disk raid device a partitioned
device.  This obviously put a partition table right where the BIOS and
the kernel would look for it whether the raid was up or not.  I also
tried doing an lvm setup to split the raid up into chunks and that
didn't work either.  So, then I redid the partition table and created
individual raid devices from the partitions.  But, I didn't think to
zero the old whole disk superblock.  When I made the individual raid
devices, I used all 1.1 superblocks.  So, when it was all said and done,
I had a bunch of partitions that looked like a valid set of partitions
for the whole disk raid device and a whole disk raid superblock, but I
also had superblocks in each partition with their own bitmaps and so on.
It was only because I wasn't using mdadm in the initrd and specifying
uuids that it found the right devices to start and ignored the whole
disk devices.  But, when I later made some more devices and went to
update the mdadm.conf file using mdadm -Eb, it found the devices and
added it to the mdadm.conf.  If I hadn't checked it before remaking my
initrd, it would have hosed the system.  And it would have passed all
the tests you can throw at it.  Quite simply, there is no way to tell
the difference between those two situations with 100% certainty.  Mdadm
tries to be smart and start the newest devices, but Luca's original
suggestion of skip the partition scanning in the kernel and figure it
out from user space would not have shown mdadm the new devices and would
have gotten it wrong every time.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-27 20:09                           ` Doug Ledford
@ 2007-10-28 13:46                             ` Luca Berra
  0 siblings, 0 replies; 88+ messages in thread
From: Luca Berra @ 2007-10-28 13:46 UTC (permalink / raw)
  To: linux-raid

On Sat, Oct 27, 2007 at 04:09:03PM -0400, Doug Ledford wrote:
>On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote:
>> On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
>> >On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
>> >> On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
>> >> just apply some rules, so if you find a partition table _AND_ an md
>> >> superblock at the end, read both and you can tell if it is an md on a
>> >> partition or a partitioned md raid1 device.
>> >
>> >In fact, no you can't.  I know, because I've created a device that had
>> >both but wasn't a raid device.  And it's matching partner still existed
>> >too.  What you are talking about would have misrecognized this
>> >situation, guaranteed.
>> then just ignore the device and log a warning, instead of doing a random
>> choice.
>> L.
>
>It also happened to be my OS drive pair.  Ignoring it would have
>rendered the machine unusable.

I wonder what would have happened if it got it wrong

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-28  0:26                           ` Doug Ledford
@ 2007-10-28 14:13                             ` Luca Berra
  2007-10-28 17:47                               ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Luca Berra @ 2007-10-28 14:13 UTC (permalink / raw)
  To: linux-raid

On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
>On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
>> On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
>> 
>> > In fact, no you can't.  I know, because I've created a device that had
>> > both but wasn't a raid device.  And it's matching partner still existed
>> > too.  What you are talking about would have misrecognized this
>> > situation, guaranteed.
>> 
>> Maybe we need a 2.0 superblock that contains the physical size of every
>> component, not just the logical size that is used for RAID. That way if
>> the size read from the superblock does not match the size of the device,
>> you know that this device should be ignored.
>
>In my case that wouldn't have helped.  What actually happened was I
>create a two disk raid1 device using whole devices and a version 1.0
>superblock.  I know a version 1.1 wouldn't work because it would be
>where the boot sector needed to be, and wasn't sure if a 1.2 would work
>either.  Then I tried to make the whole disk raid device a partitioned
>device.  This obviously put a partition table right where the BIOS and
>the kernel would look for it whether the raid was up or not.  I also
the only reason i can think for the above setup not working is udev
mucking with your device too early.

>tried doing an lvm setup to split the raid up into chunks and that
>didn't work either.  So, then I redid the partition table and created
>individual raid devices from the partitions.  But, I didn't think to
>zero the old whole disk superblock.  When I made the individual raid
>devices, I used all 1.1 superblocks.  So, when it was all said and done,
>I had a bunch of partitions that looked like a valid set of partitions
>for the whole disk raid device and a whole disk raid superblock, but I
>also had superblocks in each partition with their own bitmaps and so on.
OK

>It was only because I wasn't using mdadm in the initrd and specifying
>uuids that it found the right devices to start and ignored the whole
>disk devices.  But, when I later made some more devices and went to
>update the mdadm.conf file using mdadm -Eb, it found the devices and
>added it to the mdadm.conf.  If I hadn't checked it before remaking my
>initrd, it would have hosed the system.  And it would have passed all
the above is not clear to me, afair redhat initrd still uses
raidautorun, which iirc does not works with recent superblocks,
so you used uuids on kernel command line?
or you use something else for initrd?
why would remaking the initrd break it?

>the tests you can throw at it.  Quite simply, there is no way to tell
>the difference between those two situations with 100% certainty.  Mdadm
>tries to be smart and start the newest devices, but Luca's original
>suggestion of skip the partition scanning in the kernel and figure it
>out from user space would not have shown mdadm the new devices and would
>have gotten it wrong every time.
yes, in this particular case it would have, congratulation you found a new
creative way of shooting yourself in the feet.

maybe mdadm should do checks when creating a device to prevent this kind
of mistakes.
i.e.
if creating an array on a partition, check the whole device for a
superblock and refuse in case it finds one

if creating an array on a whole device that has a partition table,
either require --force, or check for superblocks in every possible
partition.

L.
-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-28 14:13                             ` Luca Berra
@ 2007-10-28 17:47                               ` Doug Ledford
  2007-10-29  8:41                                 ` Luca Berra
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-28 17:47 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5006 bytes --]

On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote:
> On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
> >On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
> >> On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
> >> 
> >> > In fact, no you can't.  I know, because I've created a device that had
> >> > both but wasn't a raid device.  And it's matching partner still existed
> >> > too.  What you are talking about would have misrecognized this
> >> > situation, guaranteed.
> >> 
> >> Maybe we need a 2.0 superblock that contains the physical size of every
> >> component, not just the logical size that is used for RAID. That way if
> >> the size read from the superblock does not match the size of the device,
> >> you know that this device should be ignored.
> >
> >In my case that wouldn't have helped.  What actually happened was I
> >create a two disk raid1 device using whole devices and a version 1.0
> >superblock.  I know a version 1.1 wouldn't work because it would be
> >where the boot sector needed to be, and wasn't sure if a 1.2 would work
> >either.  Then I tried to make the whole disk raid device a partitioned
> >device.  This obviously put a partition table right where the BIOS and
> >the kernel would look for it whether the raid was up or not.  I also
> the only reason i can think for the above setup not working is udev
> mucking with your device too early.

It was a combination of boot loader issues and an inability to get this
device partitioned up the way I needed.  I went with a totally different
setup in the end because I essentially started out with a two drive
raid1 for the OS and another 2 drive raid1 for data, but I wanted to
span them and I was attempting to do so with a mixture of md raid and
lvm physical volume striping.  Didn't work.

> >tried doing an lvm setup to split the raid up into chunks and that
> >didn't work either.  So, then I redid the partition table and created
> >individual raid devices from the partitions.  But, I didn't think to
> >zero the old whole disk superblock.  When I made the individual raid
> >devices, I used all 1.1 superblocks.  So, when it was all said and done,
> >I had a bunch of partitions that looked like a valid set of partitions
> >for the whole disk raid device and a whole disk raid superblock, but I
> >also had superblocks in each partition with their own bitmaps and so on.
> OK
> 
> >It was only because I wasn't using mdadm in the initrd and specifying
> >uuids that it found the right devices to start and ignored the whole
> >disk devices.  But, when I later made some more devices and went to
> >update the mdadm.conf file using mdadm -Eb, it found the devices and
> >added it to the mdadm.conf.  If I hadn't checked it before remaking my
> >initrd, it would have hosed the system.  And it would have passed all
> the above is not clear to me, afair redhat initrd still uses
> raidautorun,

RHEL does, but this is on a personal machine I installed Fedora an and
latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and
starts the needed devices using the UUID.  My first sentence above
should have read that I *was* using mdadm.

>  which iirc does not works with recent superblocks,
> so you used uuids on kernel command line?
> or you use something else for initrd?
> why would remaking the initrd break it?

Remaking the initrd installs the new mdadm.conf file, which would have
then contained the whole disk devices and it's UUID.  There in would
have been the problem.

> >the tests you can throw at it.  Quite simply, there is no way to tell
> >the difference between those two situations with 100% certainty.  Mdadm
> >tries to be smart and start the newest devices, but Luca's original
> >suggestion of skip the partition scanning in the kernel and figure it
> >out from user space would not have shown mdadm the new devices and would
> >have gotten it wrong every time.
> yes, in this particular case it would have, congratulation you found a new
> creative way of shooting yourself in the feet.

Creative, not so much.  I just backed out of what I started and tried
something else.  Lots of people do that.

> maybe mdadm should do checks when creating a device to prevent this kind
> of mistakes.
> i.e.
> if creating an array on a partition, check the whole device for a
> superblock and refuse in case it finds one
> 
> if creating an array on a whole device that has a partition table,
> either require --force, or check for superblocks in every possible
> partition.

What happens if you add the partition table *after* you make the whole
disk device and there are stale superblocks in the partitions?  This
still isn't infallible.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-28  0:18                               ` Doug Ledford
@ 2007-10-29  0:44                                 ` Bill Davidsen
  0 siblings, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-29  0:44 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Neil Brown, David Greaves, Jeff Garzik, John Stoffel,
	Justin Piszcz, linux-raid

Doug Ledford wrote:
> On Sat, 2007-10-27 at 11:20 -0400, Bill Davidsen wrote:
>   
>>> * When using lilo to boot from a raid device, it automatically installs
>>> itself to the mbr, not to the partition.  This can not be changed.  Only
>>> 0.90 and 1.0 superblock types are supported because lilo doesn't
>>> understand the offset to the beginning of the fs otherwise.
>>>   
>>>       
>> I'm reasonably sure that's wrong, I used to set up dual boot machines by 
>> putting LILO in the partition and making that the boot partition, by 
>> changing the active partition flag I could just have the machine boot 
>> Windows, to keep people from getting confused.
>>     
>
> Yeah, someone else pointed this out too.  The original patch to lilo
> *did* do as I suggest, so they must have improved on the patch later.
>
>   
>>> * When using grub to boot from a raid device, only 0.90 and 1.0
>>> superblocks are supported[1] (because grub is ignorant of the raid and
>>> it requires the fs to start at the start of the partition).  You can use
>>> either MBR or partition based installs of grub.  However, partition
>>> based installs require that all bootable partitions be in exactly the
>>> same logical block address across all devices.  This limitation can be
>>> an extremely hazardous limitation in the event a drive dies and you have
>>> to replace it with a new drive as newer drives may not share the older
>>> drive's geometry and will require starting your boot partition in an odd
>>> location to make the logical block addresses match.
>>>
>>> * When using grub2, there is supposedly already support for raid/lvm
>>> devices.  However, I do not know if this includes version 1.0, 1.1, or
>>> 1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
>>> install to an md device, it searches out all constituent devices and
>>> installs to the MBR on each device[2].  This can't be changed (at least
>>> right now, probably not ever though).
>>>   
>>>       
>> That sounds like a good reason to avoid grub2, frankly. Software which 
>> decides that it knows what to do better than the user isn't my 
>> preference. If I wanted software which fores me to do things "their way" 
>> I'd be running Windows.
>>     
>
> It's not really all that unreasonable of a restriction.  Most people
> aren't aware than when you put a boot sector at the beginning of a
> partition, you only have 512 bytes of space, so the boot loader that you
> put there is basically nothing more than code to read the remainder of
> the boot loader from the file system space.  Now, traditionally, most
> boot loaders have had to hard code the block addresses of certain key
> components into these second stage boot loaders.  If a user isn't aware
> of the fact that the boot loader does this at install time (or at kernel
> selection update time in the case of lilo), then they aren't aware that
> the files must reside at exactly the same logical block address on all
> devices.  Without that knowledge, they can easily create an unbootable
> setup by having the various boot partitions in slightly different
> locations on the disks.  And intelligent partition editors like parted
> can compound the problem because as they insulate the user from having
> to pick which partition number is used for what partition, etc., they
> can end up placing the various boot partitions in different areas of
> different drives.  The requirement above is a means of making sure that
> users aren't surprise by a non-working setup.  The whole element of
> least surprise thing.  Of course, if they keep that requirement, then I
> would expect it to be well documented so that people know this going
> into putting the boot loader in place, but I would argue that this is at
> least better than finding out when a drive dies that your system isn't
> bootable.
>
>   
>>> So, given the above situations, really, superblock format 1.2 is likely
>>> to never be needed.  None of the shipping boot loaders work with 1.2
>>> regardless, and the boot loader under development won't install to the
>>> partition in the event of an md device and therefore doesn't need that
>>> 4k buffer that 1.2 provides.
>>>   
>>>       
>> Sounds right, although it may have other uses for clever people.
>>     
>>> [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
>>> person could probably hack it to work, but since grub development has
>>> stopped in preference to the still under development grub2, they won't
>>> take the patches upstream unless they are bug fixes, not new features.
>>>   
>>>       
>> If the patches were available, "doesn't work with existing raid formats" 
>> would probably qualify as a bug.
>>     
>
> Possibly.  I'm a bit overbooked on other work at the moment, but I may
> try to squeeze in some work on grub/grub2 to support version 1.1 or 1.2
> superblocks.
>
>   
>>> [2] There are two ways to install to a master boot record.  The first is
>>> to use the first 512 bytes *only* and hardcode the location of the
>>> remainder of the boot loader into those 512 bytes.  The second way is to
>>> use the free space between the MBR and the start of the first partition
>>> to embed the remainder of the boot loader.  When you point grub2 at an
>>> md device, they automatically only use the second method of boot loader
>>> installation.  This gives them the freedom to be able to modify the
>>> second stage boot loader on a boot disk by boot disk basis.  The
>>> downside to this is that they need lots of room after the MBR and before
>>> the first partition in order to put their core.img file in place.  I
>>> *think*, and I'll know for sure later today, that the core.img file is
>>> generated during grub install from the list of optional modules you
>>> specify during setup.  Eg., the pc module gives partition table support,
>>> the lvm module lvm support, etc.  You list the modules you need, and
>>> grub then builds a core.img out of all those modules.  The normal amount
>>> of space between the MBR and the first partition is (sectors_per_track -
>>> 1).  For standard disk geometries, that basically leaves 254 sectors, or
>>> 127k of space.  This might not be enough for your particular needs if
>>> you have a complex boot environment.  In that case, you would need to
>>> bump at least the starting track of your first partition to make room
>>> for your boot loader.  Unfortunately, how is a person to know how much
>>> room their setup needs until after they've installed and it's too late
>>> to bump the partition table start?  They can't.  So, that's another
>>> thing I think I will check out today, what the maximum size of grub2
>>> might be with all modules included, and what a common size might be.
>>>
>>>   
>>>       
>> Based on your description, it sounds as if grub2 may not have given 
>> adequate thought to what users other than the authors might need (that 
>> may be a premature conclusion). I have multiple installs on several of 
>> my machines, and I assume that the grub2 for 32 and 64 bit will be 
>> different. Thanks for the research.
>>     
>
> No, not really.  The grub command on the two is different, but they
> actually build the boot sector out of 16 bit non-protected mode code,
> just like DOS.  So either one would build the same boot sector given the
> same config.  And you can always use the same trick I've used in the
> past of creating a large /boot partition (say 250MB) and using that same
> partition as /boot in all of your installs.  Then they share a single
> grub config (while the grub binaries are in the individual / partitions)
> and from the single grub instance you can boot to any of the installs,
> as well as a kernel update in any install updates that global grub
> config.  The other option is to use separate /boot partitions and chain
> load the grub instances, but I find that clunky in comparison.  Of
>   

I just copy a stanza of the 64 bit grub file into the 32 bit grub file, 
and that seems to work okay, the 32 bit boot mounts /mnt/boot64, and the 
64 bit boot mounts /mnt/boot64 so I can just copy the data. I confess 
that the 64 bit stuff has little use recently, nothing I'm doing runs 
appreciably faster, and I know the 32 bit code is more used and 
therefore likely to be better debugged. Note "likely" in that. ;-)

> course, in my case I also made /lib/modules its own partition and also
> shared it between all the installs so that I could manually edit the
> various kernel boot params to specify different root partitions and in
> so doing I could boot a RHEL5 kernel using a RHEL4 install and vice
> versa.  But if you do that, you have to manually
> patch /etc/rc.d/rc.sysinit to mount the /lib/modules partition before
> ever trying to do anything with modules (and you have to mount it rw so
> they can do a depmod if needed), then remount it ro for the fsck, then
> it gets remounted rw again after the fs check.  It was a pain in the ass
> to maintain because every update to initscripts would wipe out the patch
> and if you forgot to repatch the file, the system wouldn't boot and
> you'd have to boot into another install, mount the / partition of the
> broken install, patch the file, then it would work again in that
> install.
>
>   
That sounds like *way* more complexity than appeals to me. I stand in 
awe, but have no urge to join you.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-27 21:11                             ` Doug Ledford
@ 2007-10-29  0:48                               ` Bill Davidsen
  0 siblings, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-10-29  0:48 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Linux RAID Mailing List

Doug Ledford wrote:
> On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote:
>   
>> Actually, after doing some research, here's what I've found:
>>     
> I should note that both the lvm code and raid code are simplistic at the
> moment.  For example, the raid5 mapping only supports the default raid5
> layout.  If you use any other layout, game over.  Getting it to work
> with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but
> getting it to the point where it handles all the relevant setups
> properly would require a reasonable amount of coding.
>
>   
My first thought is that after the /boot partition is read (assuming you 
use one) restrictions go away. Performance of /boot is not much of an 
issue, for me at least, but more complex setups are sometimes need for 
the rest of the system.

Thanks for the research.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-28 17:47                               ` Doug Ledford
@ 2007-10-29  8:41                                 ` Luca Berra
  2007-10-29 15:30                                   ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Luca Berra @ 2007-10-29  8:41 UTC (permalink / raw)
  To: linux-raid

On Sun, Oct 28, 2007 at 01:47:55PM -0400, Doug Ledford wrote:
>On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote:
>> On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
>> >It was only because I wasn't using mdadm in the initrd and specifying
>> >uuids that it found the right devices to start and ignored the whole
>> >disk devices.  But, when I later made some more devices and went to
>> >update the mdadm.conf file using mdadm -Eb, it found the devices and
>> >added it to the mdadm.conf.  If I hadn't checked it before remaking my
>> >initrd, it would have hosed the system.  And it would have passed all
>> the above is not clear to me, afair redhat initrd still uses
>> raidautorun,
>
>RHEL does, but this is on a personal machine I installed Fedora an and
>latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and
>starts the needed devices using the UUID.  My first sentence above
>should have read that I *was* using mdadm.
ah, ok i should look again at fedora's mkinitrd, last one i checked was
6.0.9-1 and i see mdadm was added in 6.0.9-2

>>  which iirc does not works with recent superblocks,
>> so you used uuids on kernel command line?
>> or you use something else for initrd?
>> why would remaking the initrd break it?
>
>Remaking the initrd installs the new mdadm.conf file, which would have
>then contained the whole disk devices and it's UUID.  There in would
>have been the problem.
yes, i read the patch, i don't like that code, as i don't like most of
what has been put in mkinitrd from 5.0 onward.
Imho the correct thing here would not have been copying the existing
mdadm.conf but generating a safe one from output of mdadm -D (note -D,
not -E)

>> >the tests you can throw at it.  Quite simply, there is no way to tell
>> >the difference between those two situations with 100% certainty.  Mdadm
>> >tries to be smart and start the newest devices, but Luca's original
>> >suggestion of skip the partition scanning in the kernel and figure it
>> >out from user space would not have shown mdadm the new devices and would
>> >have gotten it wrong every time.
>> yes, in this particular case it would have, congratulation you found a new
>> creative way of shooting yourself in the feet.
>
>Creative, not so much.  I just backed out of what I started and tried
>something else.  Lots of people do that.
>
>> maybe mdadm should do checks when creating a device to prevent this kind
>> of mistakes.
>> i.e.
>> if creating an array on a partition, check the whole device for a
>> superblock and refuse in case it finds one
>> 
>> if creating an array on a whole device that has a partition table,
>> either require --force, or check for superblocks in every possible
>> partition.
>
>What happens if you add the partition table *after* you make the whole
>disk device and there are stale superblocks in the partitions?  This
>still isn't infallible.
It depends on what you do with that partitioned device *after* having
created the partition table.
- If you try again to run mdadm on it (and the above is implemented it
would fail, and you will be given a chance to wipe the stale sb)
- If you don't and use them as plain devices, _and_ leave the line in
mdadm.conf you will suffer a lot of pain. Since the problem is known and
since fdisk/sfdisk/parted already do a lot of checks on the device, this
could be another useful one.

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-29  8:41                                 ` Luca Berra
@ 2007-10-29 15:30                                   ` Doug Ledford
  2007-10-29 21:44                                     ` Luca Berra
  0 siblings, 1 reply; 88+ messages in thread
From: Doug Ledford @ 2007-10-29 15:30 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 822 bytes --]

On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:

> >Remaking the initrd installs the new mdadm.conf file, which would have
> >then contained the whole disk devices and it's UUID.  There in would
> >have been the problem.
> yes, i read the patch, i don't like that code, as i don't like most of
> what has been put in mkinitrd from 5.0 onward.
> Imho the correct thing here would not have been copying the existing
> mdadm.conf but generating a safe one from output of mdadm -D (note -D,
> not -E)

I'm not sure I'd want that.  Besides, what makes you say -D is safer
than -E?

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-29 15:30                                   ` Doug Ledford
@ 2007-10-29 21:44                                     ` Luca Berra
  2007-10-29 23:05                                       ` Doug Ledford
  0 siblings, 1 reply; 88+ messages in thread
From: Luca Berra @ 2007-10-29 21:44 UTC (permalink / raw)
  To: linux-raid

On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:
>On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:
>
>> >Remaking the initrd installs the new mdadm.conf file, which would have
>> >then contained the whole disk devices and it's UUID.  There in would
>> >have been the problem.
>> yes, i read the patch, i don't like that code, as i don't like most of
>> what has been put in mkinitrd from 5.0 onward.
in case you wonder i am referring to things like

emit dm create "$1" $UUID $(/sbin/dmsetup table "$1")

>> Imho the correct thing here would not have been copying the existing
>> mdadm.conf but generating a safe one from output of mdadm -D (note -D,
>> not -E)
>
>I'm not sure I'd want that.  Besides, what makes you say -D is safer
>than -E?

"mdadm -D  /dev/mdX" works on an active md device, so i strongly doubt the information
gathered from there would be stale
while "mdadm -Es" will scan disk devices for md superblock, thus
possibly even finding stale superblocks or leftovers.
I would strongly recommend against blindly doing "mdadm -Es >>
/etc/mdadm.conf" and not supervising the result.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-29 21:44                                     ` Luca Berra
@ 2007-10-29 23:05                                       ` Doug Ledford
  2007-10-30  3:10                                         ` Neil Brown
  2007-10-30  6:55                                         ` Luca Berra
  0 siblings, 2 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-29 23:05 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]

On Mon, 2007-10-29 at 22:44 +0100, Luca Berra wrote:
> On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:
> >On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:
> >
> >> >Remaking the initrd installs the new mdadm.conf file, which would have
> >> >then contained the whole disk devices and it's UUID.  There in would
> >> >have been the problem.
> >> yes, i read the patch, i don't like that code, as i don't like most of
> >> what has been put in mkinitrd from 5.0 onward.
> in case you wonder i am referring to things like
> 
> emit dm create "$1" $UUID $(/sbin/dmsetup table "$1")

I make no judgments on the dm setup stuff, I know too little about the
dm stack to be qualified.

> >> Imho the correct thing here would not have been copying the existing
> >> mdadm.conf but generating a safe one from output of mdadm -D (note -D,
> >> not -E)
> >
> >I'm not sure I'd want that.  Besides, what makes you say -D is safer
> >than -E?
> 
> "mdadm -D  /dev/mdX" works on an active md device, so i strongly doubt the information
> gathered from there would be stale
> while "mdadm -Es" will scan disk devices for md superblock, thus
> possibly even finding stale superblocks or leftovers.
> I would strongly recommend against blindly doing "mdadm -Es >>
> /etc/mdadm.conf" and not supervising the result.

Well, I agree that blindly doing mdadm -Esb >> mdadm.conf would be bad,
but that's not what mkinitrd is doing, it's using the mdadm.conf that's
in place so you can update the mdadm.conf whenever you find it
appropriate.

And I agree -D has less chance of finding a stale superblock, but it's
also true that it has no chance of finding non-stale superblocks on
devices that aren't even started.  So, as a method of getting all the
right information in the event of system failure and rescuecd boot, it
leaves something to be desired ;-)  In other words, I'd rather use a
mode that finds everything and lets me remove the stale than a mode that
might miss something.  But, that's a matter of personal choice.
Considering that we only ever update mdadm.conf automatically during
installs, after that the user makes manual mdadm.conf changes
themselves, they are free to use whichever they prefer.

The one thing I *do* like about mdadm -E above -D is it includes the
superblock format in its output.  The one thing I don't like, is it
almost universally gets the name wrong.  What I really want is a brief
query format that both gives me the right name (-D) and the superblock
format (-E).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-29 23:05                                       ` Doug Ledford
@ 2007-10-30  3:10                                         ` Neil Brown
  2007-10-30  6:55                                         ` Luca Berra
  1 sibling, 0 replies; 88+ messages in thread
From: Neil Brown @ 2007-10-30  3:10 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Luca Berra, linux-raid

On Monday October 29, dledford@redhat.com wrote:
> 
> The one thing I *do* like about mdadm -E above -D is it includes the
> superblock format in its output.  The one thing I don't like, is it
> almost universally gets the name wrong.  What I really want is a brief
> query format that both gives me the right name (-D) and the superblock
> format (-E).
> 


You need only ask:-)

Following patch will be in next release.  Thanks for the suggestion.

NeilBrown


### Diffstat output
 ./Detail.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff .prev/Detail.c ./Detail.c
--- .prev/Detail.c	2007-10-30 14:04:25.000000000 +1100
+++ ./Detail.c	2007-10-30 14:08:28.000000000 +1100
@@ -143,7 +143,10 @@ int Detail(char *dev, int brief, int exp
 	}
 
 	if (brief)
-		printf("ARRAY %s level=%s num-devices=%d", dev, c?c:"-unknown-",array.raid_disks );
+		printf("ARRAY %s level=%s metadata=%d.%d num-devices=%d", dev,
+		       c?c:"-unknown-",
+		       array.major_version, array.minor_version,
+		       array.raid_disks );
 	else {
 		mdu_bitmap_file_t bmf;
 		unsigned long long larray_size;


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-26 14:18                         ` Bill Davidsen
  2007-10-26 18:41                           ` Doug Ledford
@ 2007-10-30  3:25                           ` Neil Brown
  2007-11-02 12:31                             ` Bill Davidsen
  1 sibling, 1 reply; 88+ messages in thread
From: Neil Brown @ 2007-10-30  3:25 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: David Greaves, Jeff Garzik, Doug Ledford, John Stoffel,
	Justin Piszcz, linux-raid

On Friday October 26, davidsen@tmr.com wrote:
> 
> Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
> beginning? Isn't hindsight wonderful?
> 

Those names seem good to me.  I wonder if it is safe to generate them
in "-Eb" output....

Maybe the key confusion here is between "version" numbers and
"revision" numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. "Here is my version of what happened, now
let's hear yours".
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...

NeilBrown

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-29 23:05                                       ` Doug Ledford
  2007-10-30  3:10                                         ` Neil Brown
@ 2007-10-30  6:55                                         ` Luca Berra
  2007-10-30 16:48                                           ` Doug Ledford
  1 sibling, 1 reply; 88+ messages in thread
From: Luca Berra @ 2007-10-30  6:55 UTC (permalink / raw)
  To: linux-raid

On Mon, Oct 29, 2007 at 07:05:42PM -0400, Doug Ledford wrote:
>And I agree -D has less chance of finding a stale superblock, but it's
>also true that it has no chance of finding non-stale superblocks on
Well it might be a matter of personal preference, but i would prefer
an initrd doing just the minumum necessary to mount the root filesystem
(and/or activating resume from a swap device), and leaving all the rest
to initscripts, then an initrd that tries to do everything.

>devices that aren't even started.  So, as a method of getting all the
>right information in the event of system failure and rescuecd boot, it
>leaves something to be desired ;-)  In other words, I'd rather use a
>mode that finds everything and lets me remove the stale than a mode that
>might miss something.  But, that's a matter of personal choice.
In case of a rescuecd boot, you will probably not have any md devices
activated, and you will probably run "mdadm -Es" to check what md are
available, the data should be still on the disk, else you would be hosed
anyway.

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-30  6:55                                         ` Luca Berra
@ 2007-10-30 16:48                                           ` Doug Ledford
  0 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-10-30 16:48 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Tue, 2007-10-30 at 07:55 +0100, Luca Berra wrote:

> Well it might be a matter of personal preference, but i would prefer
> an initrd doing just the minumum necessary to mount the root filesystem
> (and/or activating resume from a swap device), and leaving all the rest
> to initscripts, then an initrd that tries to do everything.

The initrd does exactly that.  The rescan for superblocks does not
happen in initrd or mkinitrd, it must be done manually.  The code in
mkinitrd uses the mdadm.conf file as it stands, but in the initrd image
it doesn't start all the arrays, just the needed arrays to get booted
into your / partition.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-24  0:42               ` Doug Ledford
  2007-10-24  9:40                 ` David Greaves
  2007-10-24 20:22                 ` Bill Davidsen
@ 2007-11-01 21:02                 ` H. Peter Anvin
  2007-11-02 15:50                   ` Doug Ledford
  2 siblings, 1 reply; 88+ messages in thread
From: H. Peter Anvin @ 2007-11-01 21:02 UTC (permalink / raw)
  To: Doug Ledford; +Cc: John Stoffel, Michael Tokarev, linux-raid

Doug Ledford wrote:
>>
>> I would argue that ext[234] should be clearing those 512 bytes.  Why
>> aren't they cleared  
> 
> Actually, I didn't think msdos used the first 512 bytes for the same
> reason ext3 doesn't: space for a boot sector.
> 

The creators of MS-DOS put the superblock in the bootsector, so that the 
BIOS loads them both.  It made sense in some diseased Microsoft 
programmer's mind.

Either way, for RAID-1 booting, the boot sector really should be part of 
the protected area (and go through the MD stack.)  The bootloader should 
deal with the offset problem by storing partition/filesystem-relative 
pointers, not absolute ones.

	-hpa

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-10-30  3:25                           ` Neil Brown
@ 2007-11-02 12:31                             ` Bill Davidsen
  0 siblings, 0 replies; 88+ messages in thread
From: Bill Davidsen @ 2007-11-02 12:31 UTC (permalink / raw)
  To: Neil Brown
  Cc: David Greaves, Jeff Garzik, Doug Ledford, John Stoffel,
	Justin Piszcz, linux-raid

Neil Brown wrote:
> On Friday October 26, davidsen@tmr.com wrote:
>   
>> Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
>> beginning? Isn't hindsight wonderful?
>>
>>     
>
> Those names seem good to me.  I wonder if it is safe to generate them
> in "-Eb" output....
>
>   
If you agree that they are better, using them in the obvious places 
would be better now than later. Are you going to put them in the 
metadata options as well? Let me know, I have looking at the 
documentation on my list for next week, and could include some text.
> Maybe the key confusion here is between "version" numbers and
> "revision" numbers.
> When you have multiple versions, there is no implicit assumption that
> one is better than another. "Here is my version of what happened, now
> let's hear yours".
> When you have multiple revisions, you do assume ongoing improvement.
>
> v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
> itself is a revision of the v0...
>   

Like kernel releases, people assume that the first number means *big* 
changes, the second incremental change.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: Time to  deprecate old RAID formats?
  2007-11-01 21:02                 ` H. Peter Anvin
@ 2007-11-02 15:50                   ` Doug Ledford
  0 siblings, 0 replies; 88+ messages in thread
From: Doug Ledford @ 2007-11-02 15:50 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: John Stoffel, Michael Tokarev, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1355 bytes --]

On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote:
> Doug Ledford wrote:
> >>
> >> I would argue that ext[234] should be clearing those 512 bytes.  Why
> >> aren't they cleared  
> > 
> > Actually, I didn't think msdos used the first 512 bytes for the same
> > reason ext3 doesn't: space for a boot sector.
> > 
> 
> The creators of MS-DOS put the superblock in the bootsector, so that the 
> BIOS loads them both.  It made sense in some diseased Microsoft 
> programmer's mind.
> 
> Either way, for RAID-1 booting, the boot sector really should be part of 
> the protected area (and go through the MD stack.)

It depends on what you are calling the protected area.  If by that you
mean outside the filesystem itself, and in a non-replicated area like
where the superblock and internal bitmaps go, then yes, that would be
ideal.  If you mean in the file system proper, then that depends on the
boot loader.

>   The bootloader should 
> deal with the offset problem by storing partition/filesystem-relative 
> pointers, not absolute ones.

Grub2 is on the way to this, but it isn't there yet.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2007-11-02 15:50 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-19 14:34 Time to deprecate old RAID formats? John Stoffel
2007-10-19 15:09 ` Justin Piszcz
2007-10-19 15:46   ` John Stoffel
2007-10-19 16:15     ` Doug Ledford
2007-10-19 16:35       ` Justin Piszcz
2007-10-19 16:38       ` John Stoffel
2007-10-19 16:40         ` Justin Piszcz
2007-10-19 16:44           ` John Stoffel
2007-10-19 16:45             ` Justin Piszcz
2007-10-19 17:04               ` Doug Ledford
2007-10-19 17:05                 ` Justin Piszcz
2007-10-19 17:23                   ` Doug Ledford
2007-10-19 17:47                     ` Justin Piszcz
2007-10-20 18:38                       ` Michael Tokarev
2007-10-20 20:02                         ` Doug Ledford
2007-10-19 22:43                     ` chunk size (was Re: Time to deprecate old RAID formats?) Michal Soltys
2007-10-20 13:29                       ` Doug Ledford
2007-10-23 19:21                         ` Michal Soltys
2007-10-24  0:14                           ` Doug Ledford
2007-10-19 17:11         ` Time to deprecate old RAID formats? Doug Ledford
2007-10-19 18:39           ` John Stoffel
2007-10-19 21:23             ` Iustin Pop
2007-10-19 21:42               ` Doug Ledford
2007-10-20  7:53                 ` Iustin Pop
2007-10-20 13:11                   ` Doug Ledford
2007-10-26  9:54                     ` Luca Berra
2007-10-26 16:22                       ` Gabor Gombas
2007-10-26 17:06                         ` Gabor Gombas
2007-10-27 10:34                           ` Luca Berra
2007-10-26 18:52                       ` Doug Ledford
2007-10-26 22:30                         ` Gabor Gombas
2007-10-28  0:26                           ` Doug Ledford
2007-10-28 14:13                             ` Luca Berra
2007-10-28 17:47                               ` Doug Ledford
2007-10-29  8:41                                 ` Luca Berra
2007-10-29 15:30                                   ` Doug Ledford
2007-10-29 21:44                                     ` Luca Berra
2007-10-29 23:05                                       ` Doug Ledford
2007-10-30  3:10                                         ` Neil Brown
2007-10-30  6:55                                         ` Luca Berra
2007-10-30 16:48                                           ` Doug Ledford
2007-10-27  8:00                         ` Luca Berra
2007-10-27 20:09                           ` Doug Ledford
2007-10-28 13:46                             ` Luca Berra
2007-10-23 23:09                 ` Bill Davidsen
2007-10-23 23:03             ` Bill Davidsen
2007-10-24  0:09               ` Doug Ledford
2007-10-24 23:55                 ` Neil Brown
2007-10-25  0:09                   ` Jeff Garzik
2007-10-25  8:09                     ` David Greaves
2007-10-26  6:16                       ` Neil Brown
2007-10-26 14:18                         ` Bill Davidsen
2007-10-26 18:41                           ` Doug Ledford
2007-10-26 22:20                             ` Gabor Gombas
2007-10-26 22:58                               ` Doug Ledford
2007-10-27 11:11                               ` Luca Berra
2007-10-27 15:20                             ` Bill Davidsen
2007-10-28  0:18                               ` Doug Ledford
2007-10-29  0:44                                 ` Bill Davidsen
2007-10-27 21:11                             ` Doug Ledford
2007-10-29  0:48                               ` Bill Davidsen
2007-10-30  3:25                           ` Neil Brown
2007-11-02 12:31                             ` Bill Davidsen
2007-10-25  7:01                   ` Doug Ledford
2007-10-25 14:49                   ` Bill Davidsen
2007-10-25 15:00                     ` David Greaves
2007-10-26  5:56                     ` Neil Brown
2007-10-24 14:00               ` John Stoffel
2007-10-24 15:18                 ` Mike Snitzer
2007-10-24 15:32                 ` Bill Davidsen
2007-10-20 14:09       ` Michael Tokarev
2007-10-20 14:24         ` Doug Ledford
2007-10-20 14:52         ` John Stoffel
2007-10-20 15:07           ` Iustin Pop
2007-10-20 15:36             ` Doug Ledford
2007-10-20 18:24           ` Michael Tokarev
2007-10-22 20:39             ` John Stoffel
2007-10-22 22:29               ` Michael Tokarev
2007-10-24  0:42               ` Doug Ledford
2007-10-24  9:40                 ` David Greaves
2007-10-24 20:22                 ` Bill Davidsen
2007-10-25 16:29                   ` Doug Ledford
2007-11-01 21:02                 ` H. Peter Anvin
2007-11-02 15:50                   ` Doug Ledford
2007-10-24  0:36             ` Doug Ledford
2007-10-23 23:18           ` Bill Davidsen
2007-10-19 16:34     ` Justin Piszcz
2007-10-23 23:19       ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).