* Re: RFC - new raid superblock layout for md driver @ 2002-11-20 15:55 Steve Pratt 2002-11-20 23:24 ` Neil Brown 0 siblings, 1 reply; 47+ messages in thread From: Steve Pratt @ 2002-11-20 15:55 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, linux-raid Neil Brown wrote; >I would like to propose a new layout, and to receive comment on it.. >/* constant this-device information - 64 bytes */ >u64 address of superblock in device >u32 number of this device in array /* constant over reconfigurations */ Does this mean that there can be holes in the numbering for disks that die and are replaced? >u32 device_uuid[4] >u32 pad3[9] >/* array state information - 64 bytes */ >u32 utime >u32 state /* clean, resync-in-progress */ >u32 sb_csum These next 2 fields are not 64 bit aligned. Either rearrange or add padding. >u64 events >u64 resync-position /* flag in state if this is valid) >u32 number of devices >u32 pad2[8] >Other features: >A feature map instead of a minor version number. Good. >64 bit component device size field. Size in sectors not blocks please. >no "minor" field but a textual "name" field instead. Ok, I assume that there will be some way for userspace to query the minor which gets dynamically assigned when the array is started. >address of superblock in superblock to avoid misidentifying superblock. e.g. is it >in a partition or a whole device. Really needed this. >The interpretation of the 'name' field would be up to the user-space >tools and the system administrator. Yes, so let's leave this out of this discussion. EVMS 2.0 with full user-space discovery should be able to support the new superblock format without any problems. We would like to work together on this new format. Keep up the good work, Steve ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 15:55 RFC - new raid superblock layout for md driver Steve Pratt @ 2002-11-20 23:24 ` Neil Brown 0 siblings, 0 replies; 47+ messages in thread From: Neil Brown @ 2002-11-20 23:24 UTC (permalink / raw) To: Steve Pratt; +Cc: linux-kernel, linux-raid On Wednesday November 20, slpratt@us.ibm.com wrote: > > Neil Brown wrote; > > >I would like to propose a new layout, and to receive comment on it.. > > > >/* constant this-device information - 64 bytes */ > >u64 address of superblock in device > >u32 number of this device in array /* constant over reconfigurations > */ > > Does this mean that there can be holes in the numbering for disks that die > and are replaced? Yes. When a drive is added to an array it gets a number which it keeps for it's life in the array. This is completely separate from the number that says what it's role in the array is. This number, together with the set_uuid, forms the 'name' of the device as long as it is part of the array. So there could well be holes in the numbering of devices, but in general the set of numbers would be fairly dense (max number of holes is max number of hot-spaces that you have had in the array at any one time). > > >u32 device_uuid[4] > >u32 pad3[9] > > >/* array state information - 64 bytes */ > >u32 utime > >u32 state /* clean, resync-in-progress */ > >u32 sb_csum > > These next 2 fields are not 64 bit aligned. Either rearrange or add > padding. Thanks. I think I did check that once, but then I changed things again :-( Actually, making utime a u64 makes this properly aligned again, but I will group the u64s together at the top. > > >u64 events > >u64 resync-position /* flag in state if this is valid) > >u32 number of devices > >u32 pad2[8] > > > > >Other features: > >A feature map instead of a minor version number. > > Good. > > >64 bit component device size field. > > Size in sectors not blocks please. Another possibility that I considered was a size in chunks, but sectors is less confusing. > > > >no "minor" field but a textual "name" field instead. > > Ok, I assume that there will be some way for userspace to query the minor > which gets dynamically assigned when the array is started. Well, actually it is user-space which dynamically assigns a minor. It then has the option of recording, possibly as a symlink in /dev, the relationship between the 'name' of the array and the dynamically assigned minor. > > >address of superblock in superblock to avoid misidentifying superblock. > e.g. is it >in a partition or a whole device. > > Really needed this. > > > >The interpretation of the 'name' field would be up to the user-space > >tools and the system administrator. > > Yes, so let's leave this out of this discussion. ... except to show that is is sufficient to meet the needs of users. Thanks for your comments, NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver
@ 2002-11-20 23:47 Lars Marowsky-Bree
2002-11-21 0:31 ` Neil Brown
` (2 more replies)
0 siblings, 3 replies; 47+ messages in thread
From: Lars Marowsky-Bree @ 2002-11-20 23:47 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-kernel, linux-raid
>The md driver in linux uses a 'superblock' written to all devices in a
>RAID to record the current state and geometry of a RAID and to allow
>the various parts to be re-assembled reliably.
>
>The current superblock layout is sub-optimal. It contains a lot of
>redundancy and wastes space. In 4K it can only fit 27 component
>devices. It has other limitations.
Yes. (In particular, getting all the various counters to agree with each other
is a pain ;-)
Steven raises the valid point that multihost operation isn't currently
possible; I just don't agree with his solution:
- Activating a drive only on one host is already entirely possible.
(can be done by device uuid in initrd for example)
- Activating a RAID device from multiple hosts is still not possible.
(Requires way more sophisticated locking support than we currently have)
However, for none-RAID devices like multipathing I believe that activating a
drive on multiple hosts should be possible; ie, for these it might not be
necessary to scribble to the superblock every time.
(The md patch for 2.4 I sent you already does that; it reconstructs the
available paths fully dynamic on startup (by activating all paths present);
however it still updates the superblock afterwards)
Anyway, on to the format:
>The code in 2.5.lastest has all the superblock handling factored out so
>that defining a new format is very straight forward.
>
>I would like to propose a new layout, and to receive comment on it..
>
>My current design looks like:
> /* constant array information - 128 bytes */
> u32 md_magic
> u32 major_version == 1
> u32 feature_map /* bit map of extra features in superblock */
> u32 set_uuid[4]
> u32 ctime
> u32 level
> u32 layout
> u64 size /* size of component devices, if they are all
> * required to be the same (Raid 1/5 */
> u32 chunksize
> u32 raid_disks
> char name[32]
> u32 pad1[10];
Looks good so far.
> /* constant this-device information - 64 bytes */
> u64 address of superblock in device
> u32 number of this device in array /* constant over reconfigurations
> */
> u32 device_uuid[4]
What is "address of superblock in device" ? Seems redundant, otherwise you
would have been unable to read it, or am missing something?
Special case here might be required for multipathing. (ie, device_uuid == 0)
> u32 pad3[9]
>
> /* array state information - 64 bytes */
> u32 utime
Timestamps (also above, ctime) are always difficult. Time might not be set
correctly at any given time, in particular during early bootup. This field
should only be advisory.
> u32 state /* clean, resync-in-progress */
> u32 sb_csum
> u64 events
> u64 resync-position /* flag in state if this is valid)
> u32 number of devices
> u32 pad2[8]
>
> /* device state information, indexed by 'number of device in array'
> 4 bytes per device */
> for each device:
> u16 position /* in raid array or 0xffff for a spare. */
> u16 state flags /* error detected, in-sync */
u16 != u32; your position flags don't match up. I'd like to be able to take
the "position in the superblock" as a mapping here so it can be found in this
list, or what is the proposed relationship between the two?
>The interpretation of the 'name' field would be up to the user-space
>tools and the system administrator.
>I imagine having something like:
> host:name
>where if "host" isn't the current host name, auto-assembly is not
>tried, and if "host" is the current host name then:
Oh, well. You seem to sort of have Steven's idea here too ;-) In that case,
I'd go with the idea of Steven. Make that field a uuid of the host.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
Principal Squirrel
SuSE Labs - Research & Development, SuSE Linux AG
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:47 Lars Marowsky-Bree @ 2002-11-21 0:31 ` Neil Brown 2002-11-21 0:35 ` Steven Dake 2002-11-21 19:39 ` Joel Becker 2 siblings, 0 replies; 47+ messages in thread From: Neil Brown @ 2002-11-21 0:31 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: linux-kernel, linux-raid On Thursday November 21, lmb@suse.de wrote: > > However, for none-RAID devices like multipathing I believe that activating a > drive on multiple hosts should be possible; ie, for these it might not be > necessary to scribble to the superblock every time. > > (The md patch for 2.4 I sent you already does that; it reconstructs the > available paths fully dynamic on startup (by activating all paths present); > however it still updates the superblock afterwards) I haven't thought much about multipat I admit..... Mt feeling is that a multipath superblock should never be updated. Just writen once at creation and left like that (raid0 and linear are much the same) The only lose would be the utime update, and I don't think that is a real lose. > > /* constant this-device information - 64 bytes */ > > u64 address of superblock in device > > u32 number of this device in array /* constant over reconfigurations > > */ > > u32 device_uuid[4] > > What is "address of superblock in device" ? Seems redundant, otherwise you > would have been unable to read it, or am missing something? Suppose I have a device with a partition that ends at the end of the device (and starts at a 64k align location). Then if there is a superblock in the whole device, it will also be in the final partition... but which is right? Storing the location of the superblock allows us to disambiguate. > > Special case here might be required for multipathing. (ie, device_uuid == 0) > > > u32 pad3[9] > > > > /* array state information - 64 bytes */ > > u32 utime > > Timestamps (also above, ctime) are always difficult. Time might not be set > correctly at any given time, in particular during early bootup. This field > should only be advisory. Indeed, they are only advisory. > > > u32 state /* clean, resync-in-progress */ > > u32 sb_csum > > u64 events > > u64 resync-position /* flag in state if this is valid) > > u32 number of devices > > u32 pad2[8] > > > > /* device state information, indexed by 'number of device in array' > > 4 bytes per device */ > > for each device: > > u16 position /* in raid array or 0xffff for a spare. */ > > u16 state flags /* error detected, in-sync */ > > u16 != u32; your position flags don't match up. I'd like to be able to take > the "position in the superblock" as a mapping here so it can be found in this > list, or what is the proposed relationship between the two? u16 for device flags. u32 (over kill for) array flags. Is there are problem that I am missing? There is an array of struct { u16 position; /* aka role. 0xffff for spare */ u16 state; /* error/insync */ } in each copy of the superblock. It is indexed by 'number of this device in array' which is constant for any given device despite any configuration changes (until the device is removed from the array). If you have two hot spares, then their 'postition' (aka role) will initially be 0xffff. After a failure, one will be swapped in and it's position becomes (say) 3. Once rebuild is complete, the insync flag is set and the device becomes fully active. Does that make it clear? NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:47 Lars Marowsky-Bree 2002-11-21 0:31 ` Neil Brown @ 2002-11-21 0:35 ` Steven Dake 2002-11-21 1:10 ` Alan Cox 2002-12-08 22:35 ` Neil Brown 2002-11-21 19:39 ` Joel Becker 2 siblings, 2 replies; 47+ messages in thread From: Steven Dake @ 2002-11-21 0:35 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Neil Brown, linux-kernel, linux-raid Lars Marowsky-Bree wrote: >>The md driver in linux uses a 'superblock' written to all devices in a >>RAID to record the current state and geometry of a RAID and to allow >>the various parts to be re-assembled reliably. >> >>The current superblock layout is sub-optimal. It contains a lot of >>redundancy and wastes space. In 4K it can only fit 27 component >>devices. It has other limitations. >> >> > >Yes. (In particular, getting all the various counters to agree with each other >is a pain ;-) > >Steven raises the valid point that multihost operation isn't currently >possible; I just don't agree with his solution: > >- Activating a drive only on one host is already entirely possible. > (can be done by device uuid in initrd for example) > > This technique doesn't work if autostart is set (the partition type is tagged as a RAID volume) or if the user is stupid and starts the wrong uuid by accident. It also requires the user to keep track of which uuids are used by which hosts, which is a pain. Trust me, users will start the wrong RAID volume and have a hard time keeping track of the right UUIDs to asssemble. The technique I use ensures that the RAID volumes can all be set to autostart and only the correct volumes will be started on the correct host. >- Activating a RAID device from multiple hosts is still not possible. > (Requires way more sophisticated locking support than we currently have) > > The only application where having a RAID volume shareable between two hosts is useful is for a clustering filesystem (GFS comes to mind). Since RAID is an important need for GFS (if a disk node fails, you don't want ot loose the entire filesystem as you would on GFS) this possibility may be worth exploring. Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've not spent the time looking at it. Neil have you thought about sharing an active volume between two hosts and what sort of support would be needed in the superblock? Thanks -steve >However, for none-RAID devices like multipathing I believe that activating a >drive on multiple hosts should be possible; ie, for these it might not be >necessary to scribble to the superblock every time. > >(The md patch for 2.4 I sent you already does that; it reconstructs the >available paths fully dynamic on startup (by activating all paths present); >however it still updates the superblock afterwards) > >Anyway, on to the format: > > > >>The code in 2.5.lastest has all the superblock handling factored out so >>that defining a new format is very straight forward. >> >>I would like to propose a new layout, and to receive comment on it.. >> >>My current design looks like: >> /* constant array information - 128 bytes */ >> u32 md_magic >> u32 major_version == 1 >> u32 feature_map /* bit map of extra features in superblock */ >> u32 set_uuid[4] >> u32 ctime >> u32 level >> u32 layout >> u64 size /* size of component devices, if they are all >> * required to be the same (Raid 1/5 */ >> u32 chunksize >> u32 raid_disks >> char name[32] >> u32 pad1[10]; >> >> > >Looks good so far. > > > >> /* constant this-device information - 64 bytes */ >> u64 address of superblock in device >> u32 number of this device in array /* constant over reconfigurations >> */ >> u32 device_uuid[4] >> >> > >What is "address of superblock in device" ? Seems redundant, otherwise you >would have been unable to read it, or am missing something? > >Special case here might be required for multipathing. (ie, device_uuid == 0) > > > >> u32 pad3[9] >> >> /* array state information - 64 bytes */ >> u32 utime >> >> > >Timestamps (also above, ctime) are always difficult. Time might not be set >correctly at any given time, in particular during early bootup. This field >should only be advisory. > > > >> u32 state /* clean, resync-in-progress */ >> u32 sb_csum >> u64 events >> u64 resync-position /* flag in state if this is valid) >> u32 number of devices >> u32 pad2[8] >> >> /* device state information, indexed by 'number of device in array' >> 4 bytes per device */ >> for each device: >> u16 position /* in raid array or 0xffff for a spare. */ >> u16 state flags /* error detected, in-sync */ >> >> > >u16 != u32; your position flags don't match up. I'd like to be able to take >the "position in the superblock" as a mapping here so it can be found in this >list, or what is the proposed relationship between the two? > > > >>The interpretation of the 'name' field would be up to the user-space >>tools and the system administrator. >>I imagine having something like: >> host:name >>where if "host" isn't the current host name, auto-assembly is not >>tried, and if "host" is the current host name then: >> >> > >Oh, well. You seem to sort of have Steven's idea here too ;-) In that case, >I'd go with the idea of Steven. Make that field a uuid of the host. > > > >Sincerely, > Lars Marowsky-Brée <lmb@suse.de> > > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 0:35 ` Steven Dake @ 2002-11-21 1:10 ` Alan Cox 2002-12-08 22:35 ` Neil Brown 1 sibling, 0 replies; 47+ messages in thread From: Alan Cox @ 2002-11-21 1:10 UTC (permalink / raw) To: Steven Dake Cc: Lars Marowsky-Bree, Neil Brown, Linux Kernel Mailing List, linux-raid On Thu, 2002-11-21 at 00:35, Steven Dake wrote: > Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've > not spent the time looking at it. OCFS is probably the right place to be looking in terms of development in this area right now ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 0:35 ` Steven Dake 2002-11-21 1:10 ` Alan Cox @ 2002-12-08 22:35 ` Neil Brown 1 sibling, 0 replies; 47+ messages in thread From: Neil Brown @ 2002-12-08 22:35 UTC (permalink / raw) To: Steven Dake; +Cc: Lars Marowsky-Bree, linux-kernel, linux-raid ( sorrt for the delay in replying, I had a week off, and then a week catching up...) On Wednesday November 20, sdake@mvista.com wrote: > The only application where having a RAID volume shareable between two > hosts is useful is for a clustering filesystem (GFS comes to mind). > Since RAID is an important need for GFS (if a disk node fails, you > don't want ot loose the entire filesystem as you would on GFS) this > possibility may be worth exploring. > > Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've > not spent the time looking at it. > > Neil have you thought about sharing an active volume between two hosts > and what sort of support would be needed in the superblock? > I think that the only way shared access could work is if different hosts controlled different slices of the device. The hosts would have to some-how negotiate and record who was managing which bit. It is quite appropriate that this information be stored on the raid array, and quite possibly in a superblock. But I think that this is a sufficiently major departure from how md/raid normally works that I would want it to go in a secondary superblock. There is 60K free at the end of each device on an MD array. Whoever was implementing this scheme could just have a flag in the main superblock to say "there is a secondary superblock" and then read the info about who owns what from somewhere in that extra 60K So in short, I think the metadata needed for this sort of thing is sufficiently large and sufficiently unknown that I wouldn't make any allowance for it in the primary superblock. Does that sound reasonable? NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:47 Lars Marowsky-Bree 2002-11-21 0:31 ` Neil Brown 2002-11-21 0:35 ` Steven Dake @ 2002-11-21 19:39 ` Joel Becker 2 siblings, 0 replies; 47+ messages in thread From: Joel Becker @ 2002-11-21 19:39 UTC (permalink / raw) To: Lars Marowsky-Bree; +Cc: Neil Brown, linux-kernel, linux-raid On Thu, Nov 21, 2002 at 12:47:43AM +0100, Lars Marowsky-Bree wrote: > However, for none-RAID devices like multipathing I believe that activating a > drive on multiple hosts should be possible; ie, for these it might not be > necessary to scribble to the superblock every time. Again, if you don't use persistent superblock and merely run the mkraid from your initscripts (or initrd), raid0 and multipath work just fine today. Joel -- "We will have to repent in this generation not merely for the vitriolic words and actions of the bad people, but for the appalling silence of the good people." - Rev. Dr. Martin Luther King, Jr. Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* RFC - new raid superblock layout for md driver
@ 2002-11-20 4:09 Neil Brown
2002-11-20 10:03 ` Anton Altaparmakov
` (5 more replies)
0 siblings, 6 replies; 47+ messages in thread
From: Neil Brown @ 2002-11-20 4:09 UTC (permalink / raw)
To: linux-kernel, linux-raid
The md driver in linux uses a 'superblock' written to all devices in a
RAID to record the current state and geometry of a RAID and to allow
the various parts to be re-assembled reliably.
The current superblock layout is sub-optimal. It contains a lot of
redundancy and wastes space. In 4K it can only fit 27 component
devices. It has other limitations.
I (and others) would like to define a new (version 1) format that
resolves the problems in the current (0.90.0) format.
The code in 2.5.lastest has all the superblock handling factored out so
that defining a new format is very straight forward.
I would like to propose a new layout, and to receive comment on it..
My current design looks like:
/* constant array information - 128 bytes */
u32 md_magic
u32 major_version == 1
u32 feature_map /* bit map of extra features in superblock */
u32 set_uuid[4]
u32 ctime
u32 level
u32 layout
u64 size /* size of component devices, if they are all
* required to be the same (Raid 1/5 */
u32 chunksize
u32 raid_disks
char name[32]
u32 pad1[10];
/* constant this-device information - 64 bytes */
u64 address of superblock in device
u32 number of this device in array /* constant over reconfigurations */
u32 device_uuid[4]
u32 pad3[9]
/* array state information - 64 bytes */
u32 utime
u32 state /* clean, resync-in-progress */
u32 sb_csum
u64 events
u64 resync-position /* flag in state if this is valid)
u32 number of devices
u32 pad2[8]
/* device state information, indexed by 'number of device in array'
4 bytes per device */
for each device:
u16 position /* in raid array or 0xffff for a spare. */
u16 state flags /* error detected, in-sync */
This has 128 bytes for essentially constant information about the
array, 64 bytes for constant information about this device, 64 bytes
for changable state information about the array, and 4 bytes per
device for state information about the devices. This would allow an
array with 192 devices in a 1K superblock, and 960 devices in a 4k
superblock (the current size).
Other features:
A feature map instead of a minor version number.
64 bit component device size field.
field for storing current position of resync process if array is
shut down while resync is running.
no "minor" field but a textual "name" field instead.
address of superblock in superblock to avoid misidentifying
superblock. e.g. is it in a partition or a whole device.
uuid for each device. This is not directly used by the md driver,
but it is maintained, even if a drive is moved between arrays,
and user-space can use it for tracking devices.
md would, of course, continue to support the current layout
indefinately, but this new layout would be available for use by people
who don't need compatability with 2.4 and do want more than 27 devices
etc.
To create an array with the new superblock layout, the user-space
tool would write directly to the devices, (like mkfs does) and then
assemble the array. Creating an array using the ioctl interface will
still create an array with the old superblock.
When the kernel loads a superblock, it would check the major_version
to see which piece of code to use to handle it.
When it writes out a superblock, it would use the same version as was
read in (of course).
This superblock would *not* support in-kernel auto-assembly as that
requires the "minor" field that I have deliberatly removed. However I
don't think this is a big cost as it looks like in-kernel
auto-assembly is about to disappear with the early-user-space patches.
The interpretation of the 'name' field would be up to the user-space
tools and the system administrator.
I imagine having something like:
host:name
where if "host" isn't the current host name, auto-assembly is not
tried, and if "host" is the current host name then:
if "name" looks like "md[0-9]*" then the array is assembled as that
device
else the array is assembled as /dev/mdN for some large, unused N,
and a symlink is created from /dev/md/name to /dev/mdN
If the "host" part is empty or non-existant, then the array would be
assembled no-matter what the hostname is. This would be important
e.g. for assembling the device that stores the root filesystem, as we
may not know the host name until after the root filesystem were loaded.
This would make auto-assembly much more flexable.
Comments welcome.
NeilBrown
^ permalink raw reply [flat|nested] 47+ messages in thread* Re: RFC - new raid superblock layout for md driver 2002-11-20 4:09 Neil Brown @ 2002-11-20 10:03 ` Anton Altaparmakov 2002-11-20 23:02 ` Neil Brown 2002-11-22 0:08 ` Kenneth D. Merry 2002-11-20 13:58 ` Bill Rugolsky Jr. ` (4 subsequent siblings) 5 siblings, 2 replies; 47+ messages in thread From: Anton Altaparmakov @ 2002-11-20 10:03 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, linux-raid Hi, On Wed, 20 Nov 2002, Neil Brown wrote: > I (and others) would like to define a new (version 1) format that > resolves the problems in the current (0.90.0) format. > > The code in 2.5.lastest has all the superblock handling factored out so > that defining a new format is very straight forward. > > I would like to propose a new layout, and to receive comment on it.. If you are making a new layout anyway, I would suggest to actually add the complete information about each disk which is in the array into the raid superblock of each disk in the array. In that way if a disk blows up, you can just replace the disk use some to be written (?) utility to write the correct superblock to the new disk and add it to the array which then reconstructs the disk. Preferably all of this happens without ever rebooting given a hotplug ide/scsi controller. (-; From a quick read of the layout it doesn't seem to be possible to do the above trivially (or certainly not without help of /etc/raidtab), but perhaps I missed something... Also, autoassembly would be greatly helped if the superblock contained the uuid for each of the devices contained in the array. It is then trivial to unplug all raid devices and move them around on the controller and it would still just work. Again I may be missing something and that is already possible to do trivially. Best regards, Anton -- Anton Altaparmakov <aia21 at cantab.net> (replace at with @) Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 10:03 ` Anton Altaparmakov @ 2002-11-20 23:02 ` Neil Brown 2002-11-22 0:08 ` Kenneth D. Merry 1 sibling, 0 replies; 47+ messages in thread From: Neil Brown @ 2002-11-20 23:02 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: linux-kernel, linux-raid On Wednesday November 20, aia21@cantab.net wrote: > Hi, > > On Wed, 20 Nov 2002, Neil Brown wrote: > > I (and others) would like to define a new (version 1) format that > > resolves the problems in the current (0.90.0) format. > > > > The code in 2.5.lastest has all the superblock handling factored out so > > that defining a new format is very straight forward. > > > > I would like to propose a new layout, and to receive comment on it.. > > If you are making a new layout anyway, I would suggest to actually add the > complete information about each disk which is in the array into the raid > superblock of each disk in the array. In that way if a disk blows up, you > can just replace the disk use some to be written (?) utility to write the > correct superblock to the new disk and add it to the array which then > reconstructs the disk. Preferably all of this happens without ever > rebooting given a hotplug ide/scsi controller. (-; What sort of 'complete information about each disk' are you thinking of? Hot-spares already work. Auto-detecting a new drive that has just been physically plugged in and adding it to a raid array is as issue that requires configuration well beyond the scope of the superblock I believe. But if you could be more concrete, I might be convinced. > > >From a quick read of the layout it doesn't seem to be possible to do the > above trivially (or certainly not without help of /etc/raidtab), but > perhaps I missed something... > > Also, autoassembly would be greatly helped if the superblock contained the > uuid for each of the devices contained in the array. It is then trivial to > unplug all raid devices and move them around on the controller and it > would still just work. Again I may be missing something and that is > already possible to do trivially. Well... it depends on whether you want a 'name' or an 'address' in the superblock. A 'name' is something you can use to recognise the device when you see it, an 'address' is some way to go and find the device if you don't have it. Each superblock already has the 'name' of every other device implicitly, as a devices 'name' is the set_uuid plus a device number. I think storing addresses in the superblock is a bad idea as they are in-general not stable, and if you did try to store some sort of stable address, you would need to allocate quite a lot of space which I don't think is justified. Just storing a name is enough for auto-assembly providing you can enumerate all devices. I think at this stage we have to assume that userspace can enumerate all devices and so can find the device for each name. i.e. find all devices with the correct set_uuid. Does that make sense? Thankyou for your feedback. NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 10:03 ` Anton Altaparmakov 2002-11-20 23:02 ` Neil Brown @ 2002-11-22 0:08 ` Kenneth D. Merry 2002-12-09 3:52 ` Neil Brown 1 sibling, 1 reply; 47+ messages in thread From: Kenneth D. Merry @ 2002-11-22 0:08 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Neil Brown, linux-kernel, linux-raid On Wed, Nov 20, 2002 at 10:03:26 +0000, Anton Altaparmakov wrote: > Hi, > > On Wed, 20 Nov 2002, Neil Brown wrote: > > I (and others) would like to define a new (version 1) format that > > resolves the problems in the current (0.90.0) format. > > > > The code in 2.5.lastest has all the superblock handling factored out so > > that defining a new format is very straight forward. > > > > I would like to propose a new layout, and to receive comment on it.. > > If you are making a new layout anyway, I would suggest to actually add the > complete information about each disk which is in the array into the raid > superblock of each disk in the array. In that way if a disk blows up, you > can just replace the disk use some to be written (?) utility to write the > correct superblock to the new disk and add it to the array which then > reconstructs the disk. Preferably all of this happens without ever > rebooting given a hotplug ide/scsi controller. (-; > > >From a quick read of the layout it doesn't seem to be possible to do the > above trivially (or certainly not without help of /etc/raidtab), but > perhaps I missed something... > > Also, autoassembly would be greatly helped if the superblock contained the > uuid for each of the devices contained in the array. It is then trivial to > unplug all raid devices and move them around on the controller and it > would still just work. Again I may be missing something and that is > already possible to do trivially. This is a good idea. Having all of the devices listed in the metadata on each disk is very helpful. (See below for why.) Here are some of my ideas about the features you'll want out of a new type of metadata: [ these you've already got ] - each array has a unique identifier (you've got this already) - each disk/partition/component has a unique identifier (you've got this already) - a monotonically increasing serial number that gets incremented every time you write out the metadata (you've got this, the 'events' field) [ these are features I think would be good to have ] - Per-array state that lets you know whether you're doing a resync, reconstruction, verify, verify and fix, and so on. This is part of the state you'll need to do checkpointing -- picking up where you left off after a reboot during the middle of an operation. - Per-array block number that tells you how far along you are in a verify, resync, reconstruction, etc. If you reboot, you can, for example, pick a verify back up where you left off. - Enough per-disk state so you can determine, if you're doing a resync or reconstruction, which disk is the target of the operation. When I was doing a lot of work on md a while back, one of the things I ran into is that when you do a resync of a RAID-1, it always resyncs from the first to the second disk, even if the first disk is the one out of sync. (I changed this, with Adaptec metadata at least, so it would resync onto the correct disk.) - Each component knows about every other component in the array. (It knows by UUID, not just that there are N other devices in the array.) This is an important piece of information: - You can compose the array now, using each disk's set_uuid and the position of the device in the array, and by using the events field to filter out the older of two disks that claim the same position. The problem comes in more complicated scenarios. For example: - user pulls one disk out of a RAID-1 with a spare - md reconstructs onto the spare - user shuts down machine, pulls the (former) spare that is now part of the machine, and replaces the disk that he originally pulled. So now you've got a scenario where you have a disk that claims to be part of the array (same set_uuid), but its events field is a little behind. You could just resync the disk since it is out of date, but still claims to be part of the array. But you'd be back in the same position if the user pulls the disk again and puts the former spare back in -- you'd have to resync again. If each disk had a list of the uuids of every disk in the array, you could tell from the disk table on the "freshest" disk that the disk the user stuck back in isn't part of the array, despite the fact that it claims to be. (It was at one point, and then was removed.) You can then make the user add it back explicitly, instead of just resyncing onto it. - Possibly the ability to setup multilevel arrays within a given piece of metadata. As far as multilevel arrays go, there are two basic approaches to the metadata: - Integrated metadata defines all levels of the array in a single chunk of metadata. So, for example, by reading metadata off of sdb, you can figure out that it is a component of a RAID-1 array, and that that RAID-1 array is a component of a RAID-10. There are a couple of advantages to integrated metadata: - You can keep state that applies to the whole array (clean/dirty, for example) in one place. - It helps in autoconfiguring an array, since you don't have to go through multiple steps to find out all the levels of an array. You just read the metadata from one place on one disk, and you've got There are a couple of disadvantages to integrated metadata: - Possibly reduced/limited space for defining multiple array levels or arrays with lots of disks. This is not a problem, though, given sufficient metadata space. - Marginally more difficulty handling metadata updates, depending on how you handle your multilevel arrays. If you handle them like md currently does (separate block devices for each level and component of the array), it'll be pretty difficult to use integrated metadata. - Recursive metadata defines each level of the array separately. So, for example, you'd read the metadata from the end of a disk and determine it is part of a RAID-1 array. Then, you configure the RAID-1 array, and read the metadata from the end of that array, and determine it is part of a RAID-0 array. So then you configure the RAID-0 array, look at the end, fail to find metadata, and figure out that you've reached the top level of the array. This is almost how md currently does things, except that it really has no mechanism for autoconfiguring multilevel arrays. There are a couple of advantages to recursive metadata: - It is easier to handle metadata updates for multilevel arrays, especially if the various levels of the array are handled by different block devices, as md does. - You've potentially got more space for defining disks as part of the array, since you're only defining one level at a time. There are a couple of disadvantages to recursive metadata: - You have to have multiple copies of any state that applies to the whole array (e.g. clean/dirty). - More windows of opportunity for incomplete metadata writes. Since metadata is in multiple places, there are more opportunities for scenarios where you'll have metadata for one part of the array written out, but not another part before you crash or a disk crashes...etc. I know Neil has philosophical issues with autoconfiguration (or perhaps in-kernel autoconfiguration), but it really is helpful, especially in certain situations. As for recursive versus integrated metadata, it would be nice if md could handle autoconfiguration with either type of multilevel array. The reason I say this is that Adaptec HostRAID adapters use integrated metadata. So if you want to support multilevel arrays with md on HostRAID adapters, you have to have support for multilevel arrays with integrated metadata. When I did the first port of md to work on HostRAID, I pretty much had to skip doing RAID-10 support because it wasn't structurally feasible to autodetect and configure a multilevel array. (I ended up doing a full rewrite of md that I was partially done with when I got laid off from Adaptec.) Anyway, if you want to see the Adaptec HostRAID support, which includes metadata definitions: http://people.freebsd.org/~ken/linux/md.html The patches are against 2.4.18, but you should be able to get an idea of what I'm talking about as far as integrated metadata goes. This is all IMO, maybe it'll be helpful, maybe not, but hopefully it'll be useful to consider these ideas. Ken -- Kenneth Merry ken@kdm.org ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-22 0:08 ` Kenneth D. Merry @ 2002-12-09 3:52 ` Neil Brown 2002-12-10 6:28 ` Kenneth D. Merry 0 siblings, 1 reply; 47+ messages in thread From: Neil Brown @ 2002-12-09 3:52 UTC (permalink / raw) To: Kenneth D. Merry; +Cc: Anton Altaparmakov, linux-kernel, linux-raid On Thursday November 21, ken@kdm.org wrote: > > This is a good idea. Having all of the devices listed in the metadata on > each disk is very helpful. (See below for why.) > > Here are some of my ideas about the features you'll want out of a new type > of metadata: ... > > [ these are features I think would be good to have ] > > - Per-array state that lets you know whether you're doing a resync, > reconstruction, verify, verify and fix, and so on. This is part of the > state you'll need to do checkpointing -- picking up where you left off > after a reboot during the middle of an operation. > Yes, a couple of flags in the 'state' field could do this. > - Per-array block number that tells you how far along you are in a verify, > resync, reconstruction, etc. If you reboot, you can, for example, pick > a verify back up where you left off. Got that, called "resync-position" (though I guess I have to change the hypen...). > > - Enough per-disk state so you can determine, if you're doing a resync or > reconstruction, which disk is the target of the operation. When I was > doing a lot of work on md a while back, one of the things I ran into is > that when you do a resync of a RAID-1, it always resyncs from the first > to the second disk, even if the first disk is the one out of sync. (I > changed this, with Adaptec metadata at least, so it would resync onto > the correct disk.) When a raid1 array is out of sync, it doesn't mean anything to say which disc is out of sync. They all are, with each other... Nonetheless, the per-device stateflags have an 'in-sync' bit which can be set or cleared as appropriate. > > - Each component knows about every other component in the array. (It > knows by UUID, not just that there are N other devices in the array.) > This is an important piece of information: > - You can compose the array now, using each disk's set_uuid and the > position of the device in the array, and by using the events > field to filter out the older of two disks that claim the same > position. > > The problem comes in more complicated scenarios. For example: > - user pulls one disk out of a RAID-1 with a spare > - md reconstructs onto the spare > - user shuts down machine, pulls the (former) spare that is > now part of the machine, and replaces the disk that he > originally pulled. > > So now you've got a scenario where you have a disk that claims to > be part of the array (same set_uuid), but its events field is a > little behind. You could just resync the disk since it is out of > date, but still claims to be part of the array. But you'd be > back in the same position if the user pulls the disk again and > puts the former spare back in -- you'd have to resync again. > > If each disk had a list of the uuids of every disk in the array, > you could tell from the disk table on the "freshest" disk that > the disk the user stuck back in isn't part of the array, despite > the fact that it claims to be. (It was at one point, and then > was removed.) You can then make the user add it back explicitly, > instead of just resyncing onto it. The event counter is enough to determine if a device is really part of the current array or not, and I cannot see why you need more than that. As far as I can tell, everything that you have said can be achieved with setuid/devnumber/event. > > - Possibly the ability to setup multilevel arrays within a given piece of > metadata. As far as multilevel arrays go, there are two basic > approaches to the metadata: How many actual uses of multi-level arrays are there?? The most common one is raid0 over raid1, and I think there is a strong case for implementing a 'raid10' module that does that, but also allows a raid10 of an odd number of drives and things like that. I don't think anything else is sufficiently common to really deserve special treatment: recursive metadata is adequate I think. Concerning the auto-assembly of multi-level arrays, that is not particularly difficult, it just needs to be described precisely, and coded. It is a user-space thing and easily solved at that level. > > I know Neil has philosophical issues with autoconfiguration (or perhaps > in-kernel autoconfiguration), but it really is helpful, especially in > certain situations. I have issues with autoconfiguration that is not adequately configurable, and current linux in-kernel autoconfiguration is not adequately configurable. With mdadm autoconfiguration is (very nearly) adequately configurable and is fine. There is still room for some improvement, but not much. Thanks for your input, NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-12-09 3:52 ` Neil Brown @ 2002-12-10 6:28 ` Kenneth D. Merry 2002-12-11 0:07 ` Neil Brown 0 siblings, 1 reply; 47+ messages in thread From: Kenneth D. Merry @ 2002-12-10 6:28 UTC (permalink / raw) To: Neil Brown; +Cc: Anton Altaparmakov, linux-kernel, linux-raid On Mon, Dec 09, 2002 at 14:52:11 +1100, Neil Brown wrote: > > - Enough per-disk state so you can determine, if you're doing a resync or > > reconstruction, which disk is the target of the operation. When I was > > doing a lot of work on md a while back, one of the things I ran into is > > that when you do a resync of a RAID-1, it always resyncs from the first > > to the second disk, even if the first disk is the one out of sync. (I > > changed this, with Adaptec metadata at least, so it would resync onto > > the correct disk.) > > When a raid1 array is out of sync, it doesn't mean anything to say > which disc is out of sync. They all are, with each other... > Nonetheless, the per-device stateflags have an 'in-sync' bit which can > be set or cleared as appropriate. This sort of information (if it is used) would be very useful for dealing with Adaptec metadata. Adaptec HostRAID adapters let you build a RAID-1 by copying one disk onto the other, with the state set to indicate the source and target disks. Since the BIOS on those adapters takes a long time to do a copy, it's easier to break out of the build after it gets started, and let the kernel pick back up where the BIOS left off. To do that, you need checkpointing support (i.e. be able to figure out where we left off with a particular operation) and you need to be able to determine which disk is the source and which is the target. To do this with the first set of Adaptec metadata patches I wrote for md, I had to kinda "bolt on" some extra state in the kernel, so I could figure out which disk was the target, since md doesn't really pay attention to the current per-disk in sync flags. I solved this the second time around by making the in-core metadata generic (and thus a superset of all the metadata types I planned on supporting), and each metadata personality could supply target disk information if possible. > > - Each component knows about every other component in the array. (It > > knows by UUID, not just that there are N other devices in the array.) > > This is an important piece of information: > > - You can compose the array now, using each disk's set_uuid and the > > position of the device in the array, and by using the events > > field to filter out the older of two disks that claim the same > > position. > > > > The problem comes in more complicated scenarios. For example: > > - user pulls one disk out of a RAID-1 with a spare > > - md reconstructs onto the spare > > - user shuts down machine, pulls the (former) spare that is > > now part of the machine, and replaces the disk that he > > originally pulled. > > > > So now you've got a scenario where you have a disk that claims to > > be part of the array (same set_uuid), but its events field is a > > little behind. You could just resync the disk since it is out of > > date, but still claims to be part of the array. But you'd be > > back in the same position if the user pulls the disk again and > > puts the former spare back in -- you'd have to resync again. > > > > If each disk had a list of the uuids of every disk in the array, > > you could tell from the disk table on the "freshest" disk that > > the disk the user stuck back in isn't part of the array, despite > > the fact that it claims to be. (It was at one point, and then > > was removed.) You can then make the user add it back explicitly, > > instead of just resyncing onto it. > > The event counter is enough to determine if a device is really part of > the current array or not, and I cannot see why you need more than > that. > As far as I can tell, everything that you have said can be achieved > with setuid/devnumber/event. It'll work with just the setuuid/devnumber/event, but as I mentioned in the last paragraph above, you'll end up resyncing onto the disk that is pulled and then reinserted, because you don't really have any way of knowing it is no longer a part of the array. All you know is that it is out of date. > > - Possibly the ability to setup multilevel arrays within a given piece of > > metadata. As far as multilevel arrays go, there are two basic > > approaches to the metadata: > > How many actual uses of multi-level arrays are there?? > > The most common one is raid0 over raid1, and I think there is a strong > case for implementing a 'raid10' module that does that, but also > allows a raid10 of an odd number of drives and things like that. RAID-10 is the most common, but RAID-50 is found in the "wild" as well. It would be more flexible if you could stack personalities on top of each other. This would give people the option of combining whatever personalities they want (within reason; the multipath personality doesn't make a whole lot of sense to stack). > I don't think anything else is sufficiently common to really deserve > special treatment: recursive metadata is adequate I think. Recursive metadata is fine, but I would encourage you to think about how you would (structurally) support multilevel arrays that use integrated metadata. (e.g. like RAID-10 on an Adaptec HostRAID board) > Concerning the auto-assembly of multi-level arrays, that is not > particularly difficult, it just needs to be described precisely, and > coded. > It is a user-space thing and easily solved at that level. How does it work if you're trying to boot off the array? The kernel needs to know how to auto-assemble the array in order to run init and everything else that makes a userland program run. > > > > I know Neil has philosophical issues with autoconfiguration (or perhaps > > in-kernel autoconfiguration), but it really is helpful, especially in > > certain situations. > > I have issues with autoconfiguration that is not adequately > configurable, and current linux in-kernel autoconfiguration is not > adequately configurable. With mdadm autoconfiguration is (very > nearly) adequately configurable and is fine. There is still room for > some improvement, but not much. I agree that userland configuration is very flexible, but I think there is a place for kernel-level autoconfiguration as well. With something like an Adaptec HostRAID board (i.e. something you can boot from), you need kernel level autoconfiguration in order for it to work smoothly. Ken -- Kenneth Merry ken@kdm.org ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-12-10 6:28 ` Kenneth D. Merry @ 2002-12-11 0:07 ` Neil Brown 0 siblings, 0 replies; 47+ messages in thread From: Neil Brown @ 2002-12-11 0:07 UTC (permalink / raw) To: Kenneth D. Merry; +Cc: Anton Altaparmakov, linux-kernel, linux-raid On Monday December 9, ken@kdm.org wrote: > On Mon, Dec 09, 2002 at 14:52:11 +1100, Neil Brown wrote: > > > - Enough per-disk state so you can determine, if you're doing a resync or > > > reconstruction, which disk is the target of the operation. When I was > > > doing a lot of work on md a while back, one of the things I ran into is > > > that when you do a resync of a RAID-1, it always resyncs from the first > > > to the second disk, even if the first disk is the one out of sync. (I > > > changed this, with Adaptec metadata at least, so it would resync onto > > > the correct disk.) > > > > When a raid1 array is out of sync, it doesn't mean anything to say > > which disc is out of sync. They all are, with each other... > > Nonetheless, the per-device stateflags have an 'in-sync' bit which can > > be set or cleared as appropriate. > > This sort of information (if it is used) would be very useful for dealing > with Adaptec metadata. Adaptec HostRAID adapters let you build a RAID-1 by > copying one disk onto the other, with the state set to indicate the source > and target disks. > > Since the BIOS on those adapters takes a long time to do a copy, it's > easier to break out of the build after it gets started, and let the kernel > pick back up where the BIOS left off. To do that, you need checkpointing > support (i.e. be able to figure out where we left off with a particular > operation) and you need to be able to determine which disk is the source > and which is the target. > > To do this with the first set of Adaptec metadata patches I wrote for > md, I had to kinda "bolt on" some extra state in the kernel, so I could > figure out which disk was the target, since md doesn't really pay attention > to the current per-disk in sync flags. The way to solve this that would be most in-keeping with the raid code in 2.4 would be for the drives that were not yet in-sync to appear as 'spare' drives. On array assembly, the first spare would get rebuilt by md and then fully incorporated into the array. I agree that this is not a very good conceptual fit. The 2.5 code is a lot tidier with respect to this. Each device has an 'in-sync' flag so when an array has a missing drive, a spare is added and marked not-in-sync. When recovery finishes, the drive that was spare has the in-sync flag set. 2.5 code has an insync flag to, but it is not used sensibly. Note that this relates to a drive being out-of-sync (as in a reconstruction or recovery operation). It is quite different to the array being out-of-sync which requires a resync operation. > > > > > > If each disk had a list of the uuids of every disk in the array, > > > you could tell from the disk table on the "freshest" disk that > > > the disk the user stuck back in isn't part of the array, despite > > > the fact that it claims to be. (It was at one point, and then > > > was removed.) You can then make the user add it back explicitly, > > > instead of just resyncing onto it. > > > > The event counter is enough to determine if a device is really part of > > the current array or not, and I cannot see why you need more than > > that. > > As far as I can tell, everything that you have said can be achieved > > with setuid/devnumber/event. > > It'll work with just the setuuid/devnumber/event, but as I mentioned in the > last paragraph above, you'll end up resyncing onto the disk that is pulled > and then reinserted, because you don't really have any way of knowing it is > no longer a part of the array. All you know is that it is out of > date. If you pull drive N, then it will appear to fail and all other drives will be marked to say that 'drive N is faulty'. If you plug drive N back in, the md code simply wont notice. If you tell it to 'hot-add' the drive, it will rebuild onto it, but that is what you asked to do. If you shut down and restart, the auto-detection may well find drive N, but even if it's event number is sufficiently recent (which would require an unclean shutdown of the array), the fact that the most recent superblocks will say that drive N is failed will mean that it doesn't get incorporated into the array. You still have to explicitly hot-add it before it will resync. I still don't see the problem, sorry. > > > > - Possibly the ability to setup multilevel arrays within a given piece of > > > metadata. As far as multilevel arrays go, there are two basic > > > approaches to the metadata: > > > > How many actual uses of multi-level arrays are there?? > > > > The most common one is raid0 over raid1, and I think there is a strong > > case for implementing a 'raid10' module that does that, but also > > allows a raid10 of an odd number of drives and things like that. > > RAID-10 is the most common, but RAID-50 is found in the "wild" as well. > > It would be more flexible if you could stack personalities on top of each > other. This would give people the option of combining whatever > personalities they want (within reason; the multipath personality doesn't > make a whole lot of sense to stack). > > > I don't think anything else is sufficiently common to really deserve > > special treatment: recursive metadata is adequate I think. > > Recursive metadata is fine, but I would encourage you to think about how > you would (structurally) support multilevel arrays that use integrated > metadata. (e.g. like RAID-10 on an Adaptec HostRAID board) How about this: Option 1: Assertion: The only sensible raid stacks involve two levels: A level that provides redundancy (Raid1/raid5) on the bottom, and a level that compbines capacity on the top (raid0/linear). Observation: in-kernel knowledge of superblock is only needed for levels that provide redundancy (raid1/raid5) and so need to update to superblock after errors, etc. raid0/linear can be managed fine without any in-kernel knowledge of superblocks. Approach: Teach the kernel to read your adaptec raid10 superblock and present it to md as N separate raid1 arrays. Have a user-space tool that assembles the array as follows: 1/ read the superblocks 2/ build the raid1 arrays 3/ build the raid0 on-top using non-persistant superblocks. There may need to be small changes to the md code to make this work properly, but I feel that it is a good approach. Option 2: Possibly you disagree with the above assertion. Possibly you think that a raid5 build from a number of raid1's is a good idea. And maybe you are right. Approach: Add an ioctl, or possibly an 'magic' address, so that it is possible to read a raw superblock from an md array. Define two in-kernel superblock reading methods. One reads the superblock and presents it as the bottom level only. The other reads the raw superblock out of the underlying device, using the ioctl or magic address (e.g. read from MAX_SECTOR-8) and presents it as the next level of the raid stack. I think this approach, possible with some refinement, would be adequate to support any sort of stacking and any sort of raid superblock, and it would be my preferred way to go, if this were necessary. > > > Concerning the auto-assembly of multi-level arrays, that is not > > particularly difficult, it just needs to be described precisely, and > > coded. > > It is a user-space thing and easily solved at that level. > > How does it work if you're trying to boot off the array? The kernel needs > to know how to auto-assemble the array in order to run init and everything > else that makes a userland program run. initramdisk or initramfs or whatever is appropriate for the kernel you are using. Also, remember to keep the concepts of 'boot' and 'root' distinct. To boot off an array, your BIOS needs to know about the array. There are no two ways about that. It doesn't need to know a lot about the array, and for raid1 all it needs to know is 'try this device, and if it fails, try that device'. To have root on an array, you need to be able to assemble the array before root is mounted. md= kernel parameters in one option, but not a very extensible one. initramfs will be the preferred approach in 2.6. i.e. an initial root is loaded along with the kernel, and it has the user-space tools for finding, assembling and mounting the root device. > > > > > > > I know Neil has philosophical issues with autoconfiguration (or perhaps > > > in-kernel autoconfiguration), but it really is helpful, especially in > > > certain situations. > > > > I have issues with autoconfiguration that is not adequately > > configurable, and current linux in-kernel autoconfiguration is not > > adequately configurable. With mdadm autoconfiguration is (very > > nearly) adequately configurable and is fine. There is still room for > > some improvement, but not much. > > I agree that userland configuration is very flexible, but I think there is > a place for kernel-level autoconfiguration as well. With something like an > Adaptec HostRAID board (i.e. something you can boot from), you need kernel > level autoconfiguration in order for it to work smoothly. I disagree, and the development directions of 2.5 tend to support me. You certainly need something before root is mounted, but 2.5 is leading us to 'early-user-space configuration' rather than 'in-kernel configuration'. NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 4:09 Neil Brown 2002-11-20 10:03 ` Anton Altaparmakov @ 2002-11-20 13:58 ` Bill Rugolsky Jr. 2002-11-20 23:17 ` Neil Brown 2002-11-20 14:09 ` Alan Cox ` (3 subsequent siblings) 5 siblings, 1 reply; 47+ messages in thread From: Bill Rugolsky Jr. @ 2002-11-20 13:58 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, linux-raid On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote: > u32 feature_map /* bit map of extra features in superblock */ Perhaps compat/incompat feature flags, like ext[23]? Also, journal information, such as a journal UUID? Regards, Bill Rugolsky ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 13:58 ` Bill Rugolsky Jr. @ 2002-11-20 23:17 ` Neil Brown 0 siblings, 0 replies; 47+ messages in thread From: Neil Brown @ 2002-11-20 23:17 UTC (permalink / raw) To: Bill Rugolsky Jr.; +Cc: linux-kernel, linux-raid On Wednesday November 20, brugolsky@telemetry-investments.com wrote: > On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote: > > u32 feature_map /* bit map of extra features in superblock */ > > Perhaps compat/incompat feature flags, like ext[23]? I thought about that, but am not sure that it makes sense as there is much less metadata in a raid array than there is in a filesystem. I think I am happier to have initial code require feature_map == 0 or it doesn't get loaded, and if it becomes an issue, get user-space to clear any 'compatible' flags before passing the device to an 'old' kernel. > > Also, journal information, such as a journal UUID? As there is no current code, or serious project that I know of, to add journalling to md (I have thought about it, but it isn't a priority) I wouldn't like to pre-empt it at all by defining fields now. I would rather that presense-of-a-journal be indicated by a bit in the feature map, and that would imply uuid was stored in one of the current 'pad' fields. I think there is plenty of space. Thanks, NeilBrown > > Regards, > > Bill Rugolsky ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 4:09 Neil Brown 2002-11-20 10:03 ` Anton Altaparmakov 2002-11-20 13:58 ` Bill Rugolsky Jr. @ 2002-11-20 14:09 ` Alan Cox 2002-11-20 23:11 ` Neil Brown 2002-11-20 16:03 ` Joel Becker ` (2 subsequent siblings) 5 siblings, 1 reply; 47+ messages in thread From: Alan Cox @ 2002-11-20 14:09 UTC (permalink / raw) To: Neil Brown; +Cc: Linux Kernel Mailing List, linux-raid On Wed, 2002-11-20 at 04:09, Neil Brown wrote: > u32 set_uuid[4] Wouldnt u8 for the uuid avoid a lot of endian mess > u32 ctime Use some padding so you can go to 64bit times ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 14:09 ` Alan Cox @ 2002-11-20 23:11 ` Neil Brown 2002-11-21 0:30 ` Alan Cox 2002-11-21 0:30 ` Alan Cox 0 siblings, 2 replies; 47+ messages in thread From: Neil Brown @ 2002-11-20 23:11 UTC (permalink / raw) To: Alan Cox; +Cc: Linux Kernel Mailing List, linux-raid On November 20, alan@lxorguk.ukuu.org.uk wrote: > On Wed, 2002-11-20 at 04:09, Neil Brown wrote: > > u32 set_uuid[4] > > Wouldnt u8 for the uuid avoid a lot of endian mess Probably.... This makes it very similar to 'name'. The difference if partly the intent for how user-space would use it, and partly that set_uuid must *never* change, while you would probably want name to be allowed to change. > > > u32 ctime > > Use some padding so you can go to 64bit times > Before or after? Or just make it 64bits of seconds now? This brings up endian-ness? Should I assert 'little-endian' or should the code check the endianness of the magic number and convert if necessary? The former is less code which will be exercised more often, so it is probably safe. So: All values shall be little-endian and times shall be stored in 64 bits with the top 20 bits representing microseconds (so we & with (1<<44)-1 to get seconds. Thanks. NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:11 ` Neil Brown @ 2002-11-21 0:30 ` Alan Cox 2002-11-21 0:10 ` John Adams 2002-11-21 0:30 ` Alan Cox 1 sibling, 1 reply; 47+ messages in thread From: Alan Cox @ 2002-11-21 0:30 UTC (permalink / raw) To: Neil Brown; +Cc: Linux Kernel Mailing List, linux-raid On Wed, 2002-11-20 at 23:11, Neil Brown wrote: > This brings up endian-ness? Should I assert 'little-endian' or should > the code check the endianness of the magic number and convert if > necessary? > The former is less code which will be exercised more often, so it is > probably safe. From my own experience pick a single endianness otherwise some tool will always get one endian case wrong on one platform with one word size. > > So: > All values shall be little-endian and times shall be stored in 64 > bits with the top 20 bits representing microseconds (so we & with > (1<<44)-1 to get seconds. Could do - or struct timeval or whatever ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 0:30 ` Alan Cox @ 2002-11-21 0:10 ` John Adams 0 siblings, 0 replies; 47+ messages in thread From: John Adams @ 2002-11-21 0:10 UTC (permalink / raw) To: linux-raid On Wednesday 20 November 2002 07:30 pm, Alan Cox wrote: > On Wed, 2002-11-20 at 23:11, Neil Brown wrote: > > This brings up endian-ness? Should I assert 'little-endian' or should > > the code check the endianness of the magic number and convert if > > necessary? > > The former is less code which will be exercised more often, so it is > > probably safe. > > From my own experience pick a single endianness otherwise some tool will > always get one endian case wrong on one platform with one word size. > Use network byte order. hton[sl] macros already exist. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:11 ` Neil Brown 2002-11-21 0:30 ` Alan Cox @ 2002-11-21 0:30 ` Alan Cox 1 sibling, 0 replies; 47+ messages in thread From: Alan Cox @ 2002-11-21 0:30 UTC (permalink / raw) To: Neil Brown; +Cc: Linux Kernel Mailing List, linux-raid On Wed, 2002-11-20 at 23:11, Neil Brown wrote: > This brings up endian-ness? Should I assert 'little-endian' or should > the code check the endianness of the magic number and convert if > necessary? > The former is less code which will be exercised more often, so it is > probably safe. From my own experience pick a single endianness otherwise some tool will always get one endian case wrong on one platform with one word size. > > So: > All values shall be little-endian and times shall be stored in 64 > bits with the top 20 bits representing microseconds (so we & with > (1<<44)-1 to get seconds. Could do - or struct timeval or whatever ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 4:09 Neil Brown ` (2 preceding siblings ...) 2002-11-20 14:09 ` Alan Cox @ 2002-11-20 16:03 ` Joel Becker 2002-11-20 23:31 ` Neil Brown 2002-11-22 10:13 ` Joe Thornber 2002-11-20 17:05 ` Steven Dake 2002-11-22 7:11 ` Jeremy Fitzhardinge 5 siblings, 2 replies; 47+ messages in thread From: Joel Becker @ 2002-11-20 16:03 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, linux-raid On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote: > The interpretation of the 'name' field would be up to the user-space > tools and the system administrator. > I imagine having something like: > host:name > where if "host" isn't the current host name, auto-assembly is not > tried, and if "host" is the current host name then: > if "name" looks like "md[0-9]*" then the array is assembled as that > device > else the array is assembled as /dev/mdN for some large, unused N, > and a symlink is created from /dev/md/name to /dev/mdN > If the "host" part is empty or non-existant, then the array would be > assembled no-matter what the hostname is. This would be important > e.g. for assembling the device that stores the root filesystem, as we > may not know the host name until after the root filesystem were loaded. Hmm, what is the intended future interaction of DM and MD? Two ways at the same problem? Just curious. Assuming MD as a continually used feature, the "name" bits above seem to be preparing to support multiple shared users of the array. If that is the case, shouldn't the superblock contain everything needed for "clustered" operation? Joel -- "When I am working on a problem I never think about beauty. I only think about how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." - Buckminster Fuller Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 16:03 ` Joel Becker @ 2002-11-20 23:31 ` Neil Brown 2002-11-21 1:46 ` Doug Ledford 2002-11-22 10:13 ` Joe Thornber 1 sibling, 1 reply; 47+ messages in thread From: Neil Brown @ 2002-11-20 23:31 UTC (permalink / raw) To: Joel Becker; +Cc: linux-kernel, linux-raid On Wednesday November 20, Joel.Becker@oracle.com wrote: > On Wed, Nov 20, 2002 at 03:09:18PM +1100, Neil Brown wrote: > > The interpretation of the 'name' field would be up to the user-space > > tools and the system administrator. > > I imagine having something like: > > host:name > > where if "host" isn't the current host name, auto-assembly is not > > tried, and if "host" is the current host name then: > > if "name" looks like "md[0-9]*" then the array is assembled as that > > device > > else the array is assembled as /dev/mdN for some large, unused N, > > and a symlink is created from /dev/md/name to /dev/mdN > > If the "host" part is empty or non-existant, then the array would be > > assembled no-matter what the hostname is. This would be important > > e.g. for assembling the device that stores the root filesystem, as we > > may not know the host name until after the root filesystem were loaded. > > Hmm, what is the intended future interaction of DM and MD? Two > ways at the same problem? Just curious. I see MD and DM as quite different, though I haven't looked much as DM so I could be wrong. I see raid1 and raid5 as being the key elements of MD. i.e. handling redundancy, rebuilding hot spares, stuff like that. raid0 and linear are almost optional frills. DM on the other hand doesn't do redundancy (I don't think) but helps to chop devices up into little bits and put them back together into other devices.... a bit like a filesystem really, but it provided block devices instead of files. So raid0 and linear are more the domain of DM than MD in my mind. But they are currently supported by MD and there is no real need to change that. > Assuming MD as a continually used feature, the "name" bits above > seem to be preparing to support multiple shared users of the array. If > that is the case, shouldn't the superblock contain everything needed for > "clustered" operation? Only if I knew what 'everything needed for clustered operation' was.... There is room for expansion in the superblock so stuff could be added. If there were some specific things that you think would help clustered operation, I'd be happy to hear the details. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:31 ` Neil Brown @ 2002-11-21 1:46 ` Doug Ledford 2002-11-21 19:34 ` Joel Becker 0 siblings, 1 reply; 47+ messages in thread From: Doug Ledford @ 2002-11-21 1:46 UTC (permalink / raw) To: Neil Brown; +Cc: Joel Becker, linux-kernel, linux-raid On Thu, Nov 21, 2002 at 10:31:47AM +1100, Neil Brown wrote: > I see MD and DM as quite different, though I haven't looked much as DM > so I could be wrong. I haven't yet played with the new dm code, but if it's like I expect it to be, then I predict that in a few years, or maybe much less, md and dm will be two parts of the same whole. The purpose of md is to map from a single logical device to all the underlying physical devices. The purpose of :VM code in general is to handle the creation, orginization, and mapping of multiple physical devices into a single logical device. LVM code is usually shy on advanced mapping routines like RAID5, relying instead on underlying hardware to handle things like that while the LVM code itself just concentrates on physical volumes in the logical volume similar to how linear would do things. But, the things LVM does do that are very handy, are things like adding a new disk to a volume group and having the volume group automatically expand to fill the additional space, making it possible to increase the size of a logical volume on the fly. When you get right down to it, MD is 95% advanced mapping of physical disks with different possibilities for redundancy and performance. DM is 95% advanced handling of logical volumes including snapshot support, shrink/grow on the fly support, labelling, sharing, etc. The best of both worlds would be to make all of the MD modules be plug-ins in the DM code so that anyone creating a logical volume from a group of physical disks could pick which mapping they want used; linear, raid0, raid1, raid5, etc. You would also want all the md modules inside the DM/LVM core to support the advanced features of LVM, with the online resizing being the primary one that the md modules would need to implement and export an interface for. I would think that the snapshot support would be done at the LVM/DM level instead of in the individual md modules. Anyway, that's my take on how the two *should* go over the next year or so, who knows if that's what will actually happen. -- Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 1:46 ` Doug Ledford @ 2002-11-21 19:34 ` Joel Becker 2002-11-21 19:54 ` Doug Ledford 0 siblings, 1 reply; 47+ messages in thread From: Joel Becker @ 2002-11-21 19:34 UTC (permalink / raw) To: Neil Brown, linux-kernel, linux-raid On Wed, Nov 20, 2002 at 08:46:25PM -0500, Doug Ledford wrote: > I haven't yet played with the new dm code, but if it's like I expect it to > be, then I predict that in a few years, or maybe much less, md and dm will > be two parts of the same whole. The purpose of md is to map from a single Most LVMs support mirroring as an essential function. They don't usually support RAID5, leaving that to hardware. I certainly don't want to have to deal with two disparate systems to get my code up and running. I don't want to be limited in my mirroring options at the block device level. DM supports mirroring. It's a simple 1:2 map. Imagine this LVM volume layout, where volume 1 is data and mirrored, and volume 2 is some scratch space crossing both disks. [Disk 1] [Disk 2] [volume 1] [volume 1 copy] [ volume 2 ] If DM handles the mirroring, this works great. Disk 1 and disk 2 are handled either as the whole disk (sd[ab]) or one big partition on each disk (sd[ab]1), with DM handling the sizing and layout, even dynamically. If MD is handling this, then the disks have to be partitioned. sd[ab]1 contain the portions of md0, and sd[ab]2 are managed by DM. I can't resize the partitions on the fly, I can't break the mirror to add space to volume 2 quickly, etc. Joel -- "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." - Albert Einstein Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 19:34 ` Joel Becker @ 2002-11-21 19:54 ` Doug Ledford 2002-11-21 19:57 ` Steven Dake ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Doug Ledford @ 2002-11-21 19:54 UTC (permalink / raw) To: Joel Becker; +Cc: Neil Brown, linux-kernel, linux-raid On Thu, Nov 21, 2002 at 11:34:24AM -0800, Joel Becker wrote: > On Wed, Nov 20, 2002 at 08:46:25PM -0500, Doug Ledford wrote: > > I haven't yet played with the new dm code, but if it's like I expect it to > > be, then I predict that in a few years, or maybe much less, md and dm will > > be two parts of the same whole. The purpose of md is to map from a single > > Most LVMs support mirroring as an essential function. They > don't usually support RAID5, leaving that to hardware. > I certainly don't want to have to deal with two disparate > systems to get my code up and running. I don't want to be limited in my > mirroring options at the block device level. > DM supports mirroring. It's a simple 1:2 map. Imagine this LVM > volume layout, where volume 1 is data and mirrored, and volume 2 is some > scratch space crossing both disks. > > [Disk 1] [Disk 2] > [volume 1] [volume 1 copy] > [ volume 2 ] > > If DM handles the mirroring, this works great. Disk 1 and disk > 2 are handled either as the whole disk (sd[ab]) or one big partition on > each disk (sd[ab]1), with DM handling the sizing and layout, even > dynamically. > If MD is handling this, then the disks have to be partitioned. > sd[ab]1 contain the portions of md0, and sd[ab]2 are managed by DM. I > can't resize the partitions on the fly, I can't break the mirror to add > space to volume 2 quickly, etc. Not at all. That was the point of me entire email, that the LVM code should handle these types of shuffles of space and simply use md modules as the underlying mapper technology. Then, you go to one place to both specify how things are laid out and what mapping is used in those laid out spaces. Basically, I'm saying how I think things *should* be, and you're telling me how they *are*. I know this, and I'm saying how things *are* is wrong. There *should* be no md superblocks, there should only be dm superblocks on LVM physical devices and those DM superblocks should include the data needed to fire up the proper md module on the proper physical extents based upon what mapper technology is specified in the DM superblock and what layout is specified in the DM superblock. In my opinion, the existence of both an MD and DM driver is wrong because they are inherently two sides of the same coin, logical device mapping support, with one being better at putting physical disks into intelligent arrays and one being better at mapping different logical volumes onto one or more physical volume groups. -- Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 19:54 ` Doug Ledford @ 2002-11-21 19:57 ` Steven Dake 2002-11-21 20:38 ` Doug Ledford 2002-11-21 21:29 ` Alan Cox 2002-11-21 20:06 ` Joel Becker 2002-11-21 23:35 ` Luca Berra 2 siblings, 2 replies; 47+ messages in thread From: Steven Dake @ 2002-11-21 19:57 UTC (permalink / raw) To: Doug Ledford; +Cc: Joel Becker, Neil Brown, linux-kernel, linux-raid Doug, EVMS integrates all of this stuff together into one cohesive peice of technology. But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD should be modified to support volume management. Since RAID 1 and RAID 5 are easier to implement, LVM is probably the best place to put all this stuff. Doug Ledford wrote: >On Thu, Nov 21, 2002 at 11:34:24AM -0800, Joel Becker wrote: > > >>On Wed, Nov 20, 2002 at 08:46:25PM -0500, Doug Ledford wrote: >> >> >>>I haven't yet played with the new dm code, but if it's like I expect it to >>>be, then I predict that in a few years, or maybe much less, md and dm will >>>be two parts of the same whole. The purpose of md is to map from a single >>> >>> >> Most LVMs support mirroring as an essential function. They >>don't usually support RAID5, leaving that to hardware. >> I certainly don't want to have to deal with two disparate >>systems to get my code up and running. I don't want to be limited in my >>mirroring options at the block device level. >> DM supports mirroring. It's a simple 1:2 map. Imagine this LVM >>volume layout, where volume 1 is data and mirrored, and volume 2 is some >>scratch space crossing both disks. >> >> [Disk 1] [Disk 2] >> [volume 1] [volume 1 copy] >> [ volume 2 ] >> >> If DM handles the mirroring, this works great. Disk 1 and disk >>2 are handled either as the whole disk (sd[ab]) or one big partition on >>each disk (sd[ab]1), with DM handling the sizing and layout, even >>dynamically. >> If MD is handling this, then the disks have to be partitioned. >>sd[ab]1 contain the portions of md0, and sd[ab]2 are managed by DM. I >>can't resize the partitions on the fly, I can't break the mirror to add >>space to volume 2 quickly, etc. >> >> > >Not at all. That was the point of me entire email, that the LVM code >should handle these types of shuffles of space and simply use md modules >as the underlying mapper technology. Then, you go to one place to both >specify how things are laid out and what mapping is used in those laid out >spaces. Basically, I'm saying how I think things *should* be, and you're >telling me how they *are*. I know this, and I'm saying how things *are* >is wrong. There *should* be no md superblocks, there should only be dm >superblocks on LVM physical devices and those DM superblocks should >include the data needed to fire up the proper md module on the proper >physical extents based upon what mapper technology is specified in the >DM superblock and what layout is specified in the DM superblock. In my >opinion, the existence of both an MD and DM driver is wrong because they >are inherently two sides of the same coin, logical device mapping support, >with one being better at putting physical disks into intelligent arrays >and one being better at mapping different logical volumes onto one or more >physical volume groups. > > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 19:57 ` Steven Dake @ 2002-11-21 20:38 ` Doug Ledford 2002-11-21 20:49 ` Steven Dake 2002-11-21 21:29 ` Alan Cox 1 sibling, 1 reply; 47+ messages in thread From: Doug Ledford @ 2002-11-21 20:38 UTC (permalink / raw) To: Steven Dake; +Cc: Joel Becker, Neil Brown, linux-kernel, linux-raid On Thu, Nov 21, 2002 at 12:57:42PM -0700, Steven Dake wrote: > Doug, > > EVMS integrates all of this stuff together into one cohesive peice of > technology. > > But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD > should be modified to support volume management. Since RAID 1 and RAID > 5 are easier to implement, LVM is probably the best place to put all > this stuff. Yep. I tend to agree there. A little work to make device mapping modular in LVM, and a little work to make the md modules plug into LVM, and you could be done. All that would be left then is adding the right stuff into the user space tools. Basically, what irks me about the current situation is that right now in the Red Hat installer, if I want LVM features I have to create one type of object with a disk, and if I want reasonable software RAID I have to create another type of object with partitions. That shouldn't be the case, I should just create an LVM logical volume, assign physical disks to it, and then additionally assign the redundancy or performance layout I want (IMNSHO) :-) -- Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 20:38 ` Doug Ledford @ 2002-11-21 20:49 ` Steven Dake 2002-11-21 20:35 ` Kevin Corry 0 siblings, 1 reply; 47+ messages in thread From: Steven Dake @ 2002-11-21 20:49 UTC (permalink / raw) To: Doug Ledford; +Cc: Joel Becker, Neil Brown, linux-kernel, linux-raid Doug, Yup this would be ideal and I think this is what EVMS tries to do, although I haven't tried it. The advantage of doing such a thing would also be that MD could be made to work with shared LVM VGs for shared storage environments. now to write the code... -steve Doug Ledford wrote: >On Thu, Nov 21, 2002 at 12:57:42PM -0700, Steven Dake wrote: > > >>Doug, >> >>EVMS integrates all of this stuff together into one cohesive peice of >>technology. >> >>But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD >>should be modified to support volume management. Since RAID 1 and RAID >>5 are easier to implement, LVM is probably the best place to put all >>this stuff. >> >> > >Yep. I tend to agree there. A little work to make device mapping modular >in LVM, and a little work to make the md modules plug into LVM, and you >could be done. All that would be left then is adding the right stuff into >the user space tools. Basically, what irks me about the current situation >is that right now in the Red Hat installer, if I want LVM features I have >to create one type of object with a disk, and if I want reasonable >software RAID I have to create another type of object with partitions. >That shouldn't be the case, I should just create an LVM logical volume, >assign physical disks to it, and then additionally assign the redundancy >or performance layout I want (IMNSHO) :-) > > > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 20:49 ` Steven Dake @ 2002-11-21 20:35 ` Kevin Corry 0 siblings, 0 replies; 47+ messages in thread From: Kevin Corry @ 2002-11-21 20:35 UTC (permalink / raw) To: Steven Dake, Doug Ledford Cc: Joel Becker, Neil Brown, linux-kernel, linux-raid On Thursday 21 November 2002 14:49, Steven Dake wrote: > Doug, > > Yup this would be ideal and I think this is what EVMS tries to do, > although I haven't tried it. This is indeed what EVMS's new design does. It has user-space plugins that understand a variety of on-disk-metadata formats. There are plugins for LVM volumes, for MD RAID devices, for partitions, as well as others. The plugins communicate with the MD driver to activate MD devices, and with the device-mapper driver to activate other devices. As for whether DM and MD kernel drivers should be merged: I imagine it could be done, since DM already has support for easily adding new modules, but I don't see any overwhelming reason to merge them right now. I'm sure it will be discussed more when 2.7 comes out. For now they seem to work fine as separate drivers doing what each specializes in. All the integration issues that have been brought up can usually be dealt with in user-space. -- Kevin Corry corryk@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 19:57 ` Steven Dake 2002-11-21 20:38 ` Doug Ledford @ 2002-11-21 21:29 ` Alan Cox 2002-11-21 21:22 ` Doug Ledford 1 sibling, 1 reply; 47+ messages in thread From: Alan Cox @ 2002-11-21 21:29 UTC (permalink / raw) To: Steven Dake Cc: Doug Ledford, Joel Becker, Neil Brown, Linux Kernel Mailing List, linux-raid On Thu, 2002-11-21 at 19:57, Steven Dake wrote: > Doug, > > EVMS integrates all of this stuff together into one cohesive peice of > technology. > > But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD > should be modified to support volume management. Since RAID 1 and RAID > 5 are easier to implement, LVM is probably the best place to put all > this stuff. User space issue. Its about the tools view not about the kernel drivers. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 21:29 ` Alan Cox @ 2002-11-21 21:22 ` Doug Ledford 2002-11-21 20:53 ` Kevin Corry 0 siblings, 1 reply; 47+ messages in thread From: Doug Ledford @ 2002-11-21 21:22 UTC (permalink / raw) To: Alan Cox Cc: Steven Dake, Joel Becker, Neil Brown, Linux Kernel Mailing List, linux-raid On Thu, Nov 21, 2002 at 09:29:36PM +0000, Alan Cox wrote: > On Thu, 2002-11-21 at 19:57, Steven Dake wrote: > > Doug, > > > > EVMS integrates all of this stuff together into one cohesive peice of > > technology. > > > > But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD > > should be modified to support volume management. Since RAID 1 and RAID > > 5 are easier to implement, LVM is probably the best place to put all > > this stuff. > > User space issue. Its about the tools view not about the kernel drivers. Not entirely true. You could do everything in user space except online resize of raid0/4/5 arrays, that requires specific support in the md modules and it begs for integration between LVM and MD since the MD is what has to resize the underlying device yet it's the LVM that usually handles filesystem resizing. -- Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 21:22 ` Doug Ledford @ 2002-11-21 20:53 ` Kevin Corry 2002-11-21 21:55 ` Doug Ledford 0 siblings, 1 reply; 47+ messages in thread From: Kevin Corry @ 2002-11-21 20:53 UTC (permalink / raw) To: Doug Ledford, Alan Cox Cc: Steven Dake, Joel Becker, Neil Brown, Linux Kernel Mailing List, linux-raid On Thursday 21 November 2002 15:22, Doug Ledford wrote: > On Thu, Nov 21, 2002 at 09:29:36PM +0000, Alan Cox wrote: > > On Thu, 2002-11-21 at 19:57, Steven Dake wrote: > > > Doug, > > > > > > EVMS integrates all of this stuff together into one cohesive peice of > > > technology. > > > > > > But I agree, LVM should be modified to support RAID 1 and RAID 5, or MD > > > should be modified to support volume management. Since RAID 1 and RAID > > > 5 are easier to implement, LVM is probably the best place to put all > > > this stuff. > > > > User space issue. Its about the tools view not about the kernel drivers. > > Not entirely true. You could do everything in user space except online > resize of raid0/4/5 arrays, that requires specific support in the md > modules and it begs for integration between LVM and MD since the MD is > what has to resize the underlying device yet it's the LVM that usually > handles filesystem resizing. LVM doesn't handle the filesystem resizing, the filesystem tools do. The only thing you need is something in user-space to ensure the correct ordering. For an expand, the MD device must be expanded first. When that is complete, resizefs is called to expand the filesystem. MD currently doesn't allow resize of RAID 0, 4 or 5, because expanding striped devices is way ugly. If it was determined to be possible, the MD driver may need additional support to allow online resize. But it is just as easy to add this support to MD rather than have to merge MD and DM. -- Kevin Corry corryk@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 20:53 ` Kevin Corry @ 2002-11-21 21:55 ` Doug Ledford 0 siblings, 0 replies; 47+ messages in thread From: Doug Ledford @ 2002-11-21 21:55 UTC (permalink / raw) To: Kevin Corry Cc: Alan Cox, Steven Dake, Joel Becker, Neil Brown, Linux Kernel Mailing List, linux-raid On Thu, Nov 21, 2002 at 02:53:23PM -0600, Kevin Corry wrote: > > LVM doesn't handle the filesystem resizing, the filesystem tools do. The only > thing you need is something in user-space to ensure the correct ordering. For > an expand, the MD device must be expanded first. When that is complete, > resizefs is called to expand the filesystem. > > MD currently doesn't allow resize of RAID 0, 4 or 5, because expanding > striped devices is way ugly. MD doesn't, raidreconf does but not online. > If it was determined to be possible, the MD > driver may need additional support to allow online resize. Yes, it would. It's not impossible, just difficult. > But it is just as > easy to add this support to MD rather than have to merge MD and DM. Well, merging the two would actually be rather a simple task I think since you would still keep each md mode a separate module, the only difference might be some inter-communication call backs between LVM and MD, but even those aren't necessarily required. The prime benefit I would see from making the two into one is being able to integrate all the disparate superblocks into a single superblock format that helps to avoid any possible startup errors between the different logical mapping levels. -- Doug Ledford <dledford@redhat.com> 919-754-3700 x44233 Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 19:54 ` Doug Ledford 2002-11-21 19:57 ` Steven Dake @ 2002-11-21 20:06 ` Joel Becker 2002-11-21 23:35 ` Luca Berra 2 siblings, 0 replies; 47+ messages in thread From: Joel Becker @ 2002-11-21 20:06 UTC (permalink / raw) To: Neil Brown, linux-kernel, linux-raid On Thu, Nov 21, 2002 at 02:54:06PM -0500, Doug Ledford wrote: > opinion, the existence of both an MD and DM driver is wrong because they > are inherently two sides of the same coin This is exactly my point. I got "MD and DM should be used together" out of your email, and I guess I didn't get your stance clearly. Joel -- Life's Little Instruction Book #69 "Whistle" Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 19:54 ` Doug Ledford 2002-11-21 19:57 ` Steven Dake 2002-11-21 20:06 ` Joel Becker @ 2002-11-21 23:35 ` Luca Berra 2 siblings, 0 replies; 47+ messages in thread From: Luca Berra @ 2002-11-21 23:35 UTC (permalink / raw) To: linux-raid; +Cc: linux-kernel On Thu, Nov 21, 2002 at 02:54:06PM -0500, Doug Ledford wrote: >is wrong. There *should* be no md superblocks, there should only be dm >superblocks on LVM physical devices and those DM superblocks should >include the data needed to fire up the proper md module on the proper >physical extents based upon what mapper technology is specified in the >DM superblock and what layout is specified in the DM superblock. In my there are no DM superblocks, DM only maps sectors of existing devices into new (logical) device. the decision of which sectors should be mapped and where rests in user-space be it LVM2, dmsetup, EVMS or whatever. Regards, L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 16:03 ` Joel Becker 2002-11-20 23:31 ` Neil Brown @ 2002-11-22 10:13 ` Joe Thornber 2002-12-02 21:38 ` Neil Brown 1 sibling, 1 reply; 47+ messages in thread From: Joe Thornber @ 2002-11-22 10:13 UTC (permalink / raw) To: Joel Becker; +Cc: Neil Brown, linux-kernel, linux-raid On Wed, Nov 20, 2002 at 08:03:00AM -0800, Joel Becker wrote: > Hmm, what is the intended future interaction of DM and MD? Two > ways at the same problem? Just curious. There are a couple of good arguments for moving the _mapping_ code from md into dm targets: 1) Building a mirror is essentially just copying large amounts of data around, exactly what is needed to implement move functionality for arbitrarily remapping volumes. (see http://people.sistina.com/~thornber/pvmove_outline.txt). So I've always had every intention of implementing a mirror target for dm. 2) Extending raid 5 volumes becomes very simple if they are dm targets since you just add another segment, this new segment could even have different numbers of stripes. eg, old volume new volume +--------------------+ +--------------------+--------------------+ | raid5 across 3 LVs | => | raid5 across 3 LVs | raid5 across 5 LVs | +--------------------+ +--------------------+--------------------+ Of course this could be done ATM by stacking 'bottom LVs' -> mds -> 'top LV', but that does create more intermediate devices and sacrifices space to the md metadata (eg, LVM has its own metadata and doesn't need md to duplicate it). MD would continue to exist as a seperate driver, it needs to read and write the md metadata at the beginning of the physical volumes, and implement all the nice recovery/hot spare features. ie. dm does the mapping, md implements the policies by driving dm appropriately. If other volume managers such as EVMS or LVM want to implement features not provided by md, they are free to drive dm directly. I don't think it's a huge amount of work to refactor the md code, and now might be the right time if Neil is already changing things. I would be more than happy to work on this if Neil agrees. - Joe ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-22 10:13 ` Joe Thornber @ 2002-12-02 21:38 ` Neil Brown 2002-12-03 8:24 ` Luca Berra 0 siblings, 1 reply; 47+ messages in thread From: Neil Brown @ 2002-12-02 21:38 UTC (permalink / raw) To: Joe Thornber; +Cc: Joel Becker, linux-kernel, linux-raid On Friday November 22, joe@fib011235813.fsnet.co.uk wrote: > On Wed, Nov 20, 2002 at 08:03:00AM -0800, Joel Becker wrote: > > Hmm, what is the intended future interaction of DM and MD? Two > > ways at the same problem? Just curious. > > > There are a couple of good arguments for moving the _mapping_ code > from md into dm targets: > > 1) Building a mirror is essentially just copying large amounts of data > around, exactly what is needed to implement move functionality for > arbitrarily remapping volumes. (see > http://people.sistina.com/~thornber/pvmove_outline.txt). Building a mirror may be just moving data around. But the interesting issues in raid1 are more about maintaining a mirror: read balancing, retry on error, hot spares, etc. > > So I've always had every intention of implementing a mirror target > for dm. > > 2) Extending raid 5 volumes becomes very simple if they are dm targets > since you just add another segment, this new segment could even > have different numbers of stripes. eg, > > > old volume new volume > +--------------------+ +--------------------+--------------------+ > | raid5 across 3 LVs | => | raid5 across 3 LVs | raid5 across 5 LVs | > +--------------------+ +--------------------+--------------------+ > > Of course this could be done ATM by stacking 'bottom LVs' -> mds -> > 'top LV', but that does create more intermediate devices and > sacrifices space to the md metadata (eg, LVM has its own metadata > and doesn't need md to duplicate it). But is this something that you would *want* to do??? To my mind, the raid1/raid5 almost always lives below any LVM or partitioning scheme. You use raid1/raid5 to combine drives (real, physical drives) into virtual drives that are more reliable, and then you partition them or whatever you want to do. raid1 and raid5 on top of LVM bits just doesn't make sense to me. I say 'almost' above because there is one situation where something else makes sense. That is when you have a small number of drives in a machine (3 to 5) and you really want RAID5 for all of these, but booting only really works for RAID1. So you partition the drives, use RAID1 for the first partitions, and RAID5 for the rest. put boot (or maybe root) on the RAID1 bit and all your interesting data on the RAID5 bit. [[ I just had this really sick idea of creating a raid level that did data duplication (aka mirroring) for the first N stripes, and stripe/parity (aka raid5) for the remaining stripes. Then you just combine your set of drives together with this level, and depending on your choice of N, you get all raid1, all raid5, or a mixture which allows booting off the early sectors....]] > > MD would continue to exist as a seperate driver, it needs to read and > write the md metadata at the beginning of the physical volumes, and > implement all the nice recovery/hot spare features. ie. dm does the > mapping, md implements the policies by driving dm appropriately. If > other volume managers such as EVMS or LVM want to implement features > not provided by md, they are free to drive dm directly. > > I don't think it's a huge amount of work to refactor the md code, and > now might be the right time if Neil is already changing things. I > would be more than happy to work on this if Neil agrees. I would probably need a more concrete proposal before I had something to agree with :-) I really think the raid1/raid5 parts of MD are distinctly different from DM, and they should remain separate. However I am quite happy to improve the interfaces so that seamless connections can be presented by user-space tools. For example, md currently gets its 'super-block' information by reading the device directly. Soon it will have two separate routines that get the super-block info, one for each super-block format. I would be quite happy for there to be a way for DM to give a device to MD along with some routine that provided super-block info by getting it out of some near-by LVM superblock rather than out of the device itself. Similarly, if an API could be designed for MD to provide higher levels with access to the spare parts of it's superblock, e.g. for partition table information, then that might make sense. To summarise: If you want tigher integration between MD and DM, do it by defining useful interfaces, not by trying to tie them both together into one big lump. NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-12-02 21:38 ` Neil Brown @ 2002-12-03 8:24 ` Luca Berra 0 siblings, 0 replies; 47+ messages in thread From: Luca Berra @ 2002-12-03 8:24 UTC (permalink / raw) To: Neil Brown; +Cc: Joe Thornber, linux-kernel, linux-raid On Tue, Dec 03, 2002 at 08:38:25AM +1100, Neil Brown wrote: >> 1) Building a mirror is essentially just copying large amounts of data >> around, exactly what is needed to implement move functionality for >> arbitrarily remapping volumes. (see >> http://people.sistina.com/~thornber/pvmove_outline.txt). > >Building a mirror may be just moving data around. But the interesting >issues in raid1 are more about maintaining a mirror: read balancing, >retry on error, hot spares, etc. true, that's why LVM (dm) should use md for the raid work. >> >> 2) Extending raid 5 volumes becomes very simple if they are dm targets >> since you just add another segment, this new segment could even >> have different numbers of stripes. eg, >> >But is this something that you would *want* to do??? > >To my mind, the raid1/raid5 almost always lives below any LVM or >partitioning scheme. You use raid1/raid5 to combine drives (real, >physical drives) into virtual drives that are more reliable, and then >you partition them or whatever you want to do. raid1 and raid5 on top >of LVM bits just doesn't make sense to me. well to me does: - you might want to split a mirror of a portion of data for backup purposes (when snapshots won't do) or for safety before attempting a risky operation. - you might also want to have different raid strategies for different data. Think a medium sized storage with oracle, you might want to do a fast mirror for online redo logs(1) and raid5 for datafiles.(2) - you might want to add mirroring after having put data on your disks and the current way to do it with MD on partitions is complex, with LVM over MD is really hard to do right. - stackable devices are harder to maintain, a single interface to deal with mirroring and volume management would be easier. - we wont have any more problems with 'switching cache buffer size' :)))) (1) yes i know they are mirrored by oracle, but having a fs unavailable due to disk failure is a pita anyway (2) a dba will tell you to use different disks, but i never found anyone willing to use 4 73Gb disks for redo logs >[[ I just had this really sick idea of creating a raid level that did >data duplication (aka mirroring) for the first N stripes, and I had another sick idea of teaching lilo how to do raid5, but it won't fit in 512b. anyway for the normal MD on partitions case creating one n-way raid1 for /boot and raid5 for the rest >I really think the raid1/raid5 parts of MD are distinctly different >from DM, and they should remain separate. However I am quite happy to >improve the interfaces so that seamless connections can be presented >by user-space tools. reading this it looks like that the only way dm could get raid is reimplementing or duplicating code from existing md, thus duplicating code in the kernel. >To summarise: If you want tigher integration between MD and DM, do it >by defining useful interfaces, not by trying to tie them both together >into one big lump. we can think of md split in those major areas 1 the superblock interface, which i believe we all agree should go to user mode for all the array setup function, and should keep the portion for updating superblock in kernel space. 2 the raid logic code 3 the interface to lower block device layer 4 the interface to upper block device layer (in md these 3 are thightly coupled) some of these areas overlap with dm and it could be possible to merge the duplicated functionality. having said that and having looked 'briefly' at the code i believe that doing something like this would mean reworking completely the logic behind md, and adding some major parts to dm, or better to a separate module. in my idea we will have a core that handles request mapping metadata plugins for both md superblock format and lvm metadata (those would deal with keeping the metadata current with the array current status) layout plugins for raid?, striping, linear, multipath (does this belong here or at a different level?) L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 4:09 Neil Brown ` (3 preceding siblings ...) 2002-11-20 16:03 ` Joel Becker @ 2002-11-20 17:05 ` Steven Dake 2002-11-20 23:30 ` Lars Marowsky-Bree ` (2 more replies) 2002-11-22 7:11 ` Jeremy Fitzhardinge 5 siblings, 3 replies; 47+ messages in thread From: Steven Dake @ 2002-11-20 17:05 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, linux-raid Neil, I would suggest adding a 64 bit field called "unique_identifier" to the per-device structure. This would allow a RAID volume to be locked to a specific host, allowing the ability for true multihost operation. For FibreChannel, we have a patch which places the host's FC WWN into the superblock structure, and only allows importing an array disk (via ioctl or autostart) if the superblock's WWN matches the target dev_t's host fibrechannel WWN. We also use this for environments where slots are used to house either CPU or disks and lock a RAID array to a specific cpu slot. WWNs are 64 bit, which is why I suggest such a large bitsize for this field. This really helps in hotswap environments where a CPU blade is replaced and should use the same disk, but the disk naming may have changed between reboots. This could be done without this field, but then the RAID arrays could be started unintentionally by the wrong host. Imagine a host starting the wrong RAID array while it has been already started by some other party (forcing a rebuild) ugh! Thanks -steve Neil Brown wrote: >The md driver in linux uses a 'superblock' written to all devices in a >RAID to record the current state and geometry of a RAID and to allow >the various parts to be re-assembled reliably. > >The current superblock layout is sub-optimal. It contains a lot of >redundancy and wastes space. In 4K it can only fit 27 component >devices. It has other limitations. > >I (and others) would like to define a new (version 1) format that >resolves the problems in the current (0.90.0) format. > >The code in 2.5.lastest has all the superblock handling factored out so >that defining a new format is very straight forward. > >I would like to propose a new layout, and to receive comment on it.. > >My current design looks like: > /* constant array information - 128 bytes */ > u32 md_magic > u32 major_version == 1 > u32 feature_map /* bit map of extra features in superblock */ > u32 set_uuid[4] > u32 ctime > u32 level > u32 layout > u64 size /* size of component devices, if they are all > * required to be the same (Raid 1/5 */ > u32 chunksize > u32 raid_disks > char name[32] > u32 pad1[10]; > > /* constant this-device information - 64 bytes */ > u64 address of superblock in device > u32 number of this device in array /* constant over reconfigurations */ > u32 device_uuid[4] > u32 pad3[9] > > /* array state information - 64 bytes */ > u32 utime > u32 state /* clean, resync-in-progress */ > u32 sb_csum > u64 events > u64 resync-position /* flag in state if this is valid) > u32 number of devices > u32 pad2[8] > > /* device state information, indexed by 'number of device in array' > 4 bytes per device */ > for each device: > u16 position /* in raid array or 0xffff for a spare. */ > u16 state flags /* error detected, in-sync */ > > >This has 128 bytes for essentially constant information about the >array, 64 bytes for constant information about this device, 64 bytes >for changable state information about the array, and 4 bytes per >device for state information about the devices. This would allow an >array with 192 devices in a 1K superblock, and 960 devices in a 4k >superblock (the current size). > >Other features: > A feature map instead of a minor version number. > 64 bit component device size field. > field for storing current position of resync process if array is > shut down while resync is running. > no "minor" field but a textual "name" field instead. > address of superblock in superblock to avoid misidentifying > superblock. e.g. is it in a partition or a whole device. > uuid for each device. This is not directly used by the md driver, > but it is maintained, even if a drive is moved between arrays, > and user-space can use it for tracking devices. > >md would, of course, continue to support the current layout >indefinately, but this new layout would be available for use by people >who don't need compatability with 2.4 and do want more than 27 devices >etc. > >To create an array with the new superblock layout, the user-space >tool would write directly to the devices, (like mkfs does) and then >assemble the array. Creating an array using the ioctl interface will >still create an array with the old superblock. > >When the kernel loads a superblock, it would check the major_version >to see which piece of code to use to handle it. >When it writes out a superblock, it would use the same version as was >read in (of course). > >This superblock would *not* support in-kernel auto-assembly as that >requires the "minor" field that I have deliberatly removed. However I >don't think this is a big cost as it looks like in-kernel >auto-assembly is about to disappear with the early-user-space patches. > >The interpretation of the 'name' field would be up to the user-space >tools and the system administrator. >I imagine having something like: > host:name >where if "host" isn't the current host name, auto-assembly is not >tried, and if "host" is the current host name then: > if "name" looks like "md[0-9]*" then the array is assembled as that > device > else the array is assembled as /dev/mdN for some large, unused N, > and a symlink is created from /dev/md/name to /dev/mdN >If the "host" part is empty or non-existant, then the array would be >assembled no-matter what the hostname is. This would be important >e.g. for assembling the device that stores the root filesystem, as we >may not know the host name until after the root filesystem were loaded. > >This would make auto-assembly much more flexable. > >Comments welcome. > >NeilBrown >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > > > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 17:05 ` Steven Dake @ 2002-11-20 23:30 ` Lars Marowsky-Bree 2002-11-20 23:48 ` Neil Brown 2002-11-21 19:36 ` Joel Becker 2 siblings, 0 replies; 47+ messages in thread From: Lars Marowsky-Bree @ 2002-11-20 23:30 UTC (permalink / raw) To: Steven Dake, Neil Brown; +Cc: linux-kernel, linux-raid On 2002-11-20T10:05:29, Steven Dake <sdake@mvista.com> said: > This could be done without this field, but then the RAID arrays could be > started unintentionally by the wrong host. Imagine a host starting the > wrong RAID array while it has been already started by some other party > (forcing a rebuild) ugh! This is already easy and does not require addition of a field to the md superblock. Just only explicitly start disks with the proper uuid in the md superblock. Don't simply start them all. (I'll reply to Neil's mail momentarily) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- Principal Squirrel SuSE Labs - Research & Development, SuSE Linux AG "If anything can go wrong, it will." "Chance favors the prepared (mind)." -- Capt. Edward A. Murphy -- Louis Pasteur ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 17:05 ` Steven Dake 2002-11-20 23:30 ` Lars Marowsky-Bree @ 2002-11-20 23:48 ` Neil Brown 2002-11-21 0:29 ` Steven Dake 2002-11-21 19:36 ` Joel Becker 2 siblings, 1 reply; 47+ messages in thread From: Neil Brown @ 2002-11-20 23:48 UTC (permalink / raw) To: Steven Dake; +Cc: linux-kernel, linux-raid On Wednesday November 20, sdake@mvista.com wrote: > Neil, > > I would suggest adding a 64 bit field called "unique_identifier" to the > per-device structure. This would allow a RAID volume to be locked to a > specific host, allowing the ability for true multihost operation. You seem to want a uniq id in 'per device' which will identify the 'volume'. That doesn't make sense to me so maybe I am missing something. If you want to identify the 'volume', you put some sort of id in the 'per-volume' data structure. This is what the 'name' field is for. > > For FibreChannel, we have a patch which places the host's FC WWN into > the superblock structure, and only allows importing an array disk (via > ioctl or autostart) if the superblock's WWN matches the target dev_t's > host fibrechannel WWN. We also use this for environments where slots > are used to house either CPU or disks and lock a RAID array to a > specific cpu slot. WWNs are 64 bit, which is why I suggest such a large > bitsize for this field. This really helps in hotswap environments where > a CPU blade is replaced and should use the same disk, but the disk > naming may have changed between reboots. > > This could be done without this field, but then the RAID arrays could be > started unintentionally by the wrong host. Imagine a host starting the > wrong RAID array while it has been already started by some other party > (forcing a rebuild) ugh! Put your 64 bit WWN in the 'name' field, and teach user-space to match 'name' to FC adapter. Does that work for you? NeilBrown ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 23:48 ` Neil Brown @ 2002-11-21 0:29 ` Steven Dake 2002-11-21 15:23 ` John Stoffel 0 siblings, 1 reply; 47+ messages in thread From: Steven Dake @ 2002-11-21 0:29 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, linux-raid Neil Brown wrote: >On Wednesday November 20, sdake@mvista.com wrote: > > >>Neil, >> >>I would suggest adding a 64 bit field called "unique_identifier" to the >>per-device structure. This would allow a RAID volume to be locked to a >>specific host, allowing the ability for true multihost operation. >> >> > >You seem to want a uniq id in 'per device' which will identify the >'volume'. >That doesn't make sense to me so maybe I am missing something. >If you want to identify the 'volume', you put some sort of id in the >'per-volume' data structure. > >This is what the 'name' field is for. > > This is useful, atleast in the current raid implementation, because md_import can be changed to return an error if the device's unique identifier doesn't match the host identifier. In this way, each device of a RAID volume is individually locked to the specific host, and rejection occurs at import of the device time. Perhaps locking using the name field would work except that other userspace applications may reuse that name field for some other purpose, not providing any kind of uniqueness. Thanks for the explination of how the name field was intended to be used. -steve ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-21 0:29 ` Steven Dake @ 2002-11-21 15:23 ` John Stoffel 0 siblings, 0 replies; 47+ messages in thread From: John Stoffel @ 2002-11-21 15:23 UTC (permalink / raw) To: Steven Dake; +Cc: Neil Brown, linux-kernel, linux-raid Steven> This is useful, atleast in the current raid implementation, Steven> because md_import can be changed to return an error if the Steven> device's unique identifier doesn't match the host identifier. Steven> In this way, each device of a RAID volume is individually Steven> locked to the specific host, and rejection occurs at import of Steven> the device time. This is a key issue on SANs as well. I think that having the hosts' UUID in the RAID superblock will allow rejection to happen gracefully. If needed, the user-land tools can have a --force option. Steven> Perhaps locking using the name field would work except that Steven> other userspace applications may reuse that name field for Steven> some other purpose, not providing any kind of uniqueness. I think the there needs to be two fields, a UUID field for the host owning the RAID superblocks, and then a name field so that the host, along with any other systems which can *view* the RAID superblock, can know the user defined name. John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-399-0479 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 17:05 ` Steven Dake 2002-11-20 23:30 ` Lars Marowsky-Bree 2002-11-20 23:48 ` Neil Brown @ 2002-11-21 19:36 ` Joel Becker 2 siblings, 0 replies; 47+ messages in thread From: Joel Becker @ 2002-11-21 19:36 UTC (permalink / raw) To: Steven Dake; +Cc: Neil Brown, linux-kernel, linux-raid On Wed, Nov 20, 2002 at 10:05:29AM -0700, Steven Dake wrote: > per-device structure. This would allow a RAID volume to be locked to a > specific host, allowing the ability for true multihost operation. Locking to a specific host isn't the only thing to do though. Allowing multiple hosts to share the disk is quite interesting as well. Joel -- The zen have a saying: "When you learn how to listen, ANYONE can be your teacher." Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: RFC - new raid superblock layout for md driver 2002-11-20 4:09 Neil Brown ` (4 preceding siblings ...) 2002-11-20 17:05 ` Steven Dake @ 2002-11-22 7:11 ` Jeremy Fitzhardinge 5 siblings, 0 replies; 47+ messages in thread From: Jeremy Fitzhardinge @ 2002-11-22 7:11 UTC (permalink / raw) To: Neil Brown; +Cc: Linux Kernel List, linux-raid On Tue, 2002-11-19 at 20:09, Neil Brown wrote: > My current design looks like: > /* constant array information - 128 bytes */ > u32 md_magic > u32 major_version == 1 > u32 feature_map /* bit map of extra features in superblock */ > u32 set_uuid[4] > u32 ctime > u32 level > u32 layout > u64 size /* size of component devices, if they are all > * required to be the same (Raid 1/5 */ Can you make 64 bit fields 64 bit aligned? I think PPC will lay this structure out with padding before size, which may well cause confusion. If your routines to load and save the header don't depend on structure layout, then it doesn't matter. J ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2002-12-11 0:07 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-11-20 15:55 RFC - new raid superblock layout for md driver Steve Pratt 2002-11-20 23:24 ` Neil Brown -- strict thread matches above, loose matches on Subject: below -- 2002-11-20 23:47 Lars Marowsky-Bree 2002-11-21 0:31 ` Neil Brown 2002-11-21 0:35 ` Steven Dake 2002-11-21 1:10 ` Alan Cox 2002-12-08 22:35 ` Neil Brown 2002-11-21 19:39 ` Joel Becker 2002-11-20 4:09 Neil Brown 2002-11-20 10:03 ` Anton Altaparmakov 2002-11-20 23:02 ` Neil Brown 2002-11-22 0:08 ` Kenneth D. Merry 2002-12-09 3:52 ` Neil Brown 2002-12-10 6:28 ` Kenneth D. Merry 2002-12-11 0:07 ` Neil Brown 2002-11-20 13:58 ` Bill Rugolsky Jr. 2002-11-20 23:17 ` Neil Brown 2002-11-20 14:09 ` Alan Cox 2002-11-20 23:11 ` Neil Brown 2002-11-21 0:30 ` Alan Cox 2002-11-21 0:10 ` John Adams 2002-11-21 0:30 ` Alan Cox 2002-11-20 16:03 ` Joel Becker 2002-11-20 23:31 ` Neil Brown 2002-11-21 1:46 ` Doug Ledford 2002-11-21 19:34 ` Joel Becker 2002-11-21 19:54 ` Doug Ledford 2002-11-21 19:57 ` Steven Dake 2002-11-21 20:38 ` Doug Ledford 2002-11-21 20:49 ` Steven Dake 2002-11-21 20:35 ` Kevin Corry 2002-11-21 21:29 ` Alan Cox 2002-11-21 21:22 ` Doug Ledford 2002-11-21 20:53 ` Kevin Corry 2002-11-21 21:55 ` Doug Ledford 2002-11-21 20:06 ` Joel Becker 2002-11-21 23:35 ` Luca Berra 2002-11-22 10:13 ` Joe Thornber 2002-12-02 21:38 ` Neil Brown 2002-12-03 8:24 ` Luca Berra 2002-11-20 17:05 ` Steven Dake 2002-11-20 23:30 ` Lars Marowsky-Bree 2002-11-20 23:48 ` Neil Brown 2002-11-21 0:29 ` Steven Dake 2002-11-21 15:23 ` John Stoffel 2002-11-21 19:36 ` Joel Becker 2002-11-22 7:11 ` Jeremy Fitzhardinge
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).