linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC - new raid superblock layout for md driver
@ 2002-11-20  4:09 Neil Brown
  2002-11-20 10:03 ` Anton Altaparmakov
                   ` (5 more replies)
  0 siblings, 6 replies; 51+ messages in thread
From: Neil Brown @ 2002-11-20  4:09 UTC (permalink / raw)
  To: linux-kernel, linux-raid


The md driver in linux uses a 'superblock' written to all devices in a
RAID to record the current state and geometry of a RAID and to allow
the various parts to be re-assembled reliably.

The current superblock layout is sub-optimal.  It contains a lot of
redundancy and wastes space.  In 4K it can only fit 27 component
devices.  It has other limitations.

I (and others) would like to define a new (version 1) format that
resolves the problems in the current (0.90.0) format.

The code in 2.5.lastest has all the superblock handling factored out so
that defining a new format is very straight forward.

I would like to propose a new layout, and to receive comment on it..

My current design looks like:
	/* constant array information - 128 bytes */
    u32  md_magic
    u32  major_version == 1
    u32  feature_map     /* bit map of extra features in superblock */
    u32  set_uuid[4]
    u32  ctime
    u32  level
    u32  layout
    u64  size		/* size of component devices, if they are all
			 * required to be the same (Raid 1/5 */
    u32  chunksize
    u32  raid_disks
    char name[32]
    u32  pad1[10];

	/* constant this-device information - 64 bytes */
    u64  address of superblock in device
    u32  number of this device in array  /* constant over reconfigurations */
    u32  device_uuid[4]
    u32  pad3[9]

	/* array state information - 64 bytes */
    u32  utime
    u32  state    /* clean, resync-in-progress */
    u32  sb_csum
    u64  events
    u64  resync-position	/* flag in state if this is valid)
    u32  number of devices
    u32  pad2[8]

	/* device state information, indexed by 'number of device in array' 
	   4 bytes per device */
    for each device:
      u16 position     /* in raid array or 0xffff for a spare. */
      u16 state flags  /* error detected,  in-sync */


This has 128 bytes for essentially constant information about the
array, 64 bytes for constant information about this device, 64 bytes
for changable state information about the array, and 4 bytes per
device for state information about the devices.  This would allow an
array with 192 devices in a 1K superblock, and 960 devices in a 4k
superblock (the current size).

Other features:
   A feature map instead of a minor version number.
   64 bit component device size field.
   field for storing current position of resync process if array is
       shut down while resync is running.
   no "minor" field but a textual "name" field instead.
   address of superblock in superblock to avoid misidentifying
      superblock. e.g. is it in a partition or a whole device.
   uuid for each device.  This is not directly used by the md driver,
      but it is maintained, even if a drive is moved between arrays, 
      and user-space can use it for tracking devices.

md would, of course, continue to support the current layout
indefinately, but this new layout would be available for use by people
who don't need compatability with 2.4 and do want more than 27 devices
etc. 

To create an array with the new superblock layout, the user-space
tool would write directly to the devices, (like mkfs does) and then
assemble the array.  Creating an array using the ioctl interface will
still create an array with the old superblock.

When the kernel loads a superblock, it would check the major_version
to see which piece of code to use to handle it.
When it writes out a superblock, it would use the same version as was
read in (of course).

This superblock would *not* support in-kernel auto-assembly as that
requires the "minor" field that I have deliberatly removed.  However I
don't think this is a big cost as it looks like in-kernel
auto-assembly is about to disappear with the early-user-space patches.

The interpretation of the 'name' field would be up to the user-space
tools and the system administrator.
I imagine having something like:
	host:name
where if "host" isn't the current host name, auto-assembly is not
tried, and if "host" is the current host name then:
  if "name" looks like "md[0-9]*" then the array is assembled as that
    device
  else the array is assembled as /dev/mdN for some large, unused N,
    and a symlink is created from /dev/md/name to /dev/mdN
If the "host" part is empty or non-existant, then the array would be
assembled no-matter what the hostname is.  This would be important
e.g. for assembling the device that stores the root filesystem, as we
may not know the host name until after the root filesystem were loaded.

This would make auto-assembly much more flexable.

Comments welcome.

NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread
* Re: RFC - new raid superblock layout for md driver
@ 2002-11-20 15:55 Steve Pratt
  2002-11-20 23:24 ` Neil Brown
  0 siblings, 1 reply; 51+ messages in thread
From: Steve Pratt @ 2002-11-20 15:55 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel, linux-raid


Neil Brown wrote;

>I would like to propose a new layout, and to receive comment on it..


 >/* constant this-device information - 64 bytes */
    >u64  address of superblock in device
    >u32  number of this device in array  /* constant over reconfigurations
    */

 Does this mean that there can be holes in the numbering for disks that die
    and are replaced?

    >u32  device_uuid[4]
    >u32  pad3[9]

 >/* array state information - 64 bytes */
    >u32  utime
    >u32  state    /* clean, resync-in-progress */
    >u32  sb_csum

 These next 2 fields are not 64 bit aligned. Either rearrange or add
    padding.

    >u64  events
    >u64  resync-position     /* flag in state if this is valid)
    >u32  number of devices
    >u32  pad2[8]



>Other features:
   >A feature map instead of a minor version number.

Good.

   >64 bit component device size field.

Size in sectors not blocks please.


   >no "minor" field but a textual "name" field instead.

Ok, I assume that there will be some way for userspace to query the minor
   which gets dynamically assigned when the array is started.

   >address of superblock in superblock to avoid misidentifying superblock.
   e.g. is it >in a partition or a whole device.

Really needed this.


>The interpretation of the 'name' field would be up to the user-space
>tools and the system administrator.

Yes, so let's leave this out of this discussion.


EVMS 2.0 with full user-space discovery should be able to support the new
superblock format without any problems. We would like to work together on
this new format.

Keep up the good work, Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread
* Re: RFC - new raid superblock layout for md driver
@ 2002-11-20 23:47 Lars Marowsky-Bree
  2002-11-21  0:31 ` Neil Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Lars Marowsky-Bree @ 2002-11-20 23:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel, linux-raid

>The md driver in linux uses a 'superblock' written to all devices in a
>RAID to record the current state and geometry of a RAID and to allow
>the various parts to be re-assembled reliably.
>
>The current superblock layout is sub-optimal.  It contains a lot of
>redundancy and wastes space.  In 4K it can only fit 27 component
>devices.  It has other limitations.

Yes. (In particular, getting all the various counters to agree with each other
is a pain ;-)

Steven raises the valid point that multihost operation isn't currently
possible; I just don't agree with his solution:

- Activating a drive only on one host is already entirely possible.
  (can be done by device uuid in initrd for example)
- Activating a RAID device from multiple hosts is still not possible.
  (Requires way more sophisticated locking support than we currently have)
  
However, for none-RAID devices like multipathing I believe that activating a
drive on multiple hosts should be possible; ie, for these it might not be
necessary to scribble to the superblock every time.

(The md patch for 2.4 I sent you already does that; it reconstructs the
available paths fully dynamic on startup (by activating all paths present);
however it still updates the superblock afterwards)

Anyway, on to the format:

>The code in 2.5.lastest has all the superblock handling factored out so
>that defining a new format is very straight forward.
>
>I would like to propose a new layout, and to receive comment on it..
>
>My current design looks like:
>	/* constant array information - 128 bytes */
>   u32  md_magic
>   u32  major_version == 1
>   u32  feature_map     /* bit map of extra features in superblock */
>   u32  set_uuid[4]
>   u32  ctime
>   u32  level
>   u32  layout
>   u64  size		/* size of component devices, if they are all
>			 * required to be the same (Raid 1/5 */
>   u32  chunksize
>   u32  raid_disks
>   char name[32]
>   u32  pad1[10];

Looks good so far.

>	/* constant this-device information - 64 bytes */
>   u64  address of superblock in device
>   u32  number of this device in array  /* constant over reconfigurations 
>   */
>   u32  device_uuid[4]

What is "address of superblock in device" ? Seems redundant, otherwise you
would have been unable to read it, or am missing something?

Special case here might be required for multipathing. (ie, device_uuid == 0)

>   u32  pad3[9]
>
>	/* array state information - 64 bytes */
>   u32  utime

Timestamps (also above, ctime) are always difficult. Time might not be set
correctly at any given time, in particular during early bootup. This field
should only be advisory.

>   u32  state    /* clean, resync-in-progress */
>   u32  sb_csum
>   u64  events
>   u64  resync-position	/* flag in state if this is valid)
>   u32  number of devices
>   u32  pad2[8]
>
>	/* device state information, indexed by 'number of device in array' 
>	   4 bytes per device */
>   for each device:
>     u16 position     /* in raid array or 0xffff for a spare. */
>     u16 state flags  /* error detected,  in-sync */

u16 != u32; your position flags don't match up. I'd like to be able to take
the "position in the superblock" as a mapping here so it can be found in this
list, or what is the proposed relationship between the two?

>The interpretation of the 'name' field would be up to the user-space
>tools and the system administrator.
>I imagine having something like:
>	host:name
>where if "host" isn't the current host name, auto-assembly is not
>tried, and if "host" is the current host name then:

Oh, well. You seem to sort of have Steven's idea here too ;-) In that case,
I'd go with the idea of Steven. Make that field a uuid of the host.



Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Principal Squirrel 
SuSE Labs - Research & Development, SuSE Linux AG
  
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
  -- Capt. Edward A. Murphy            -- Louis Pasteur

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2002-12-12 15:30 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-20  4:09 RFC - new raid superblock layout for md driver Neil Brown
2002-11-20 10:03 ` Anton Altaparmakov
2002-11-20 23:02   ` Neil Brown
2002-11-22  0:08   ` Kenneth D. Merry
2002-12-09  3:52     ` Neil Brown
2002-12-09 23:50       ` large await discrepancies Joe Pruett
2002-12-10 15:59         ` Joe Pruett
2002-12-12 15:30           ` Joe Pruett
2002-12-10  6:28       ` RFC - new raid superblock layout for md driver Kenneth D. Merry
2002-12-11  0:07         ` Neil Brown
2002-11-20 13:58 ` Bill Rugolsky Jr.
2002-11-20 23:17   ` Neil Brown
2002-11-20 14:09 ` Alan Cox
2002-11-20 23:11   ` Neil Brown
2002-11-21  0:30     ` Alan Cox
2002-11-21  0:10       ` John Adams
2002-11-21  0:30     ` Alan Cox
2002-11-20 16:03 ` Joel Becker
2002-11-20 23:31   ` Neil Brown
2002-11-21  1:46     ` Doug Ledford
2002-11-21 19:34       ` Joel Becker
2002-11-21 19:54         ` Doug Ledford
2002-11-21 19:57           ` Steven Dake
2002-11-21 20:38             ` Doug Ledford
2002-11-21 20:49               ` Steven Dake
2002-11-21 20:35                 ` Kevin Corry
2002-11-21 21:29             ` Alan Cox
2002-11-21 21:22               ` Doug Ledford
2002-11-21 20:53                 ` Kevin Corry
2002-11-21 21:55                   ` Doug Ledford
2002-11-21 23:49               ` DM vs MD (Was: RFC - new raid superblock layout for md driver) Luca Berra
2002-11-21 20:06           ` RFC - new raid superblock layout for md driver Joel Becker
2002-11-21 23:35           ` Luca Berra
2002-11-22 10:13   ` Joe Thornber
2002-12-02 21:38     ` Neil Brown
2002-12-03  8:24       ` Luca Berra
2002-11-20 17:05 ` Steven Dake
2002-11-20 23:30   ` Lars Marowsky-Bree
2002-11-20 23:48   ` Neil Brown
2002-11-21  0:29     ` Steven Dake
2002-11-21 15:23       ` John Stoffel
2002-11-21 19:36   ` Joel Becker
2002-11-22  7:11 ` Jeremy Fitzhardinge
  -- strict thread matches above, loose matches on Subject: below --
2002-11-20 15:55 Steve Pratt
2002-11-20 23:24 ` Neil Brown
2002-11-20 23:47 Lars Marowsky-Bree
2002-11-21  0:31 ` Neil Brown
2002-11-21  0:35 ` Steven Dake
2002-11-21  1:10   ` Alan Cox
2002-12-08 22:35   ` Neil Brown
2002-11-21 19:39 ` Joel Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).