* Trouble increasing md component size
@ 2008-06-18 18:26 Chris Webb
2008-06-18 19:22 ` Peter Rabbitson
0 siblings, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-18 18:26 UTC (permalink / raw)
To: linux-raid
Hi. I'm having some trouble increasing the component size of arrays to match
changes in the underlying block devices. I'm running mdadm 2.6.7 against a
stock 2.6.24.4 kernel.
I'm seeing the problem in a cluster management application I'm writing, but I
can reproduce it simply using two LV block devices, 150M each:
2# blockdev --getsz /dev/disk.4/{one,two}
311296
311296
I build an array that only fills two thirds of the component disks:
2# mdadm --create /dev/md4 --level=1 --raid-disks=2 --size 102400 \
--metadata=1.2 /dev/disk.4/{one,two}
mdadm: largest drive (/dev/disk.4/one) exceed size (102400K) by more than 1%
Continue creating array? y
mdadm: array /dev/md4 started.
2# blockdev --getsz /dev/md4
204800
When the array is fully synced and up [UU], I try to grow the array to use
more of the component disks:
2# mdadm --grow /dev/md4 --size max
2# blockdev --getsz /dev/md4
311272
and so this has worked fine.
However, if I increase the size of the underlying block devices:
2# lvm lvresize -L 200M /dev/disk.4/one
Extending logical volume one to 200.00 MB
Logical volume one successfully resized
2# lvm lvresize -L 200M /dev/disk.4/two
Extending logical volume two to 200.00 MB
Logical volume two successfully resized
2# blockdev --getsz /dev/disk.4/{one,two}
409600
409600
and then try to grow the array:
2# mdadm --grow /dev/md4 --size max
2# blockdev --getsz /dev/md4
311272
2# mdadm --grow /dev/md4 --size 155637 just 1kB more
mdadm: Cannot set device size for /dev/md4: No space left on device
it fails. Somehow, the change in the underlying component device sizes doesn't
seem to be noticed even though the kernel's idea (as returned by blockdev/ioctl
BLKGETSIZE[64]) has correctly changed. I can even remove and re-add both drives
in turn without it noticing the size change and allowing me to grow the array.
Trying to do something like
echo 155637 > /sys/block/md4/md/component_size
doesn't work either. It'll accept values up as far as the size of the block
device when the array was created, but no larger.
The only recipe I've found which works is, for each drive in turn, to fail it,
remove it, zero the superblock and then re-add it. This seems to mean the md
driver sees is as a new component (e.g. different desc_nr in /proc/mdstat) and
therefore re-reads its size? This isn't at all friendly to the raid array,
though. While each device is out, any redundancy has gone. Is there some less
drastic way to persuade the array to grow?
Cheers,
Chris.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-18 18:26 Trouble increasing md component size Chris Webb
@ 2008-06-18 19:22 ` Peter Rabbitson
2008-06-18 20:00 ` Chris Webb
0 siblings, 1 reply; 18+ messages in thread
From: Peter Rabbitson @ 2008-06-18 19:22 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-raid
Chris Webb wrote:
> Hi. I'm having some trouble increasing the component size of arrays to match
> changes in the underlying block devices. I'm running mdadm 2.6.7 against a
> stock 2.6.24.4 kernel.
>
> I'm seeing the problem in a cluster management application I'm writing, but I
> can reproduce it simply using two LV block devices, 150M each:
>
> 2# blockdev --getsz /dev/disk.4/{one,two}
> 311296
> 311296
>
> I build an array that only fills two thirds of the component disks:
>
> 2# mdadm --create /dev/md4 --level=1 --raid-disks=2 --size 102400 \
> --metadata=1.2 /dev/disk.4/{one,two}
> mdadm: largest drive (/dev/disk.4/one) exceed size (102400K) by more than 1%
> Continue creating array? y
> mdadm: array /dev/md4 started.
> 2# blockdev --getsz /dev/md4
> 204800
>
> When the array is fully synced and up [UU], I try to grow the array to use
> more of the component disks:
>
> 2# mdadm --grow /dev/md4 --size max
> 2# blockdev --getsz /dev/md4
> 311272
>
> and so this has worked fine.
>
> However, if I increase the size of the underlying block devices:
>
> 2# lvm lvresize -L 200M /dev/disk.4/one
> Extending logical volume one to 200.00 MB
> Logical volume one successfully resized
> 2# lvm lvresize -L 200M /dev/disk.4/two
> Extending logical volume two to 200.00 MB
> Logical volume two successfully resized
> 2# blockdev --getsz /dev/disk.4/{one,two}
> 409600
> 409600
>
> and then try to grow the array:
>
> 2# mdadm --grow /dev/md4 --size max
> 2# blockdev --getsz /dev/md4
> 311272
> 2# mdadm --grow /dev/md4 --size 155637 just 1kB more
> mdadm: Cannot set device size for /dev/md4: No space left on device
>
> it fails. Somehow, the change in the underlying component device sizes doesn't
> seem to be noticed even though the kernel's idea (as returned by blockdev/ioctl
> BLKGETSIZE[64]) has correctly changed. I can even remove and re-add both drives
> in turn without it noticing the size change and allowing me to grow the array.
>
> Trying to do something like
>
> echo 155637 > /sys/block/md4/md/component_size
>
> doesn't work either. It'll accept values up as far as the size of the block
> device when the array was created, but no larger.
>
> The only recipe I've found which works is, for each drive in turn, to fail it,
> remove it, zero the superblock and then re-add it. This seems to mean the md
> driver sees is as a new component (e.g. different desc_nr in /proc/mdstat) and
> therefore re-reads its size? This isn't at all friendly to the raid array,
> though. While each device is out, any redundancy has gone. Is there some less
> drastic way to persuade the array to grow?
>
Hi Chris,
I had exactly the same problem some months ago. The key is in the (obscure)
assembly option --update=devicesize. Find it in the mdadm manpage and the
whole picture will come together instantly. The wisdom is not very widespread
as this affects only V1.1 and 1.2 superblocks.
Cheers
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-18 19:22 ` Peter Rabbitson
@ 2008-06-18 20:00 ` Chris Webb
2008-06-19 3:42 ` Neil Brown
0 siblings, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-18 20:00 UTC (permalink / raw)
To: Peter Rabbitson; +Cc: linux-raid
Peter Rabbitson <rabbit+list@rabbit.us> writes:
> I had exactly the same problem some months ago. The key is in the
> (obscure) assembly option --update=devicesize. Find it in the mdadm
> manpage and the whole picture will come together instantly. The wisdom is
> not very widespread as this affects only V1.1 and 1.2 superblocks.
Thanks, that's great. I've been using a man page slightly older (2.5.6) than
the mdadm binary I'm using (2.6.7), which doesn't mention update=devicesize,
although I'd probably have missed it anyway!
However, is there a way to do this without stopping the array first? I'm
using it as an LVM PV and slicing it up into logical volumes which clients
will be running off while I'm wanting to grow it.
Cheers,
Chris.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-18 20:00 ` Chris Webb
@ 2008-06-19 3:42 ` Neil Brown
2008-06-19 15:45 ` Chris Webb
0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-06-19 3:42 UTC (permalink / raw)
To: Chris Webb; +Cc: Peter Rabbitson, linux-raid
On Wednesday June 18, chris@arachsys.com wrote:
> Peter Rabbitson <rabbit+list@rabbit.us> writes:
>
> > I had exactly the same problem some months ago. The key is in the
> > (obscure) assembly option --update=devicesize. Find it in the mdadm
> > manpage and the whole picture will come together instantly. The wisdom is
> > not very widespread as this affects only V1.1 and 1.2 superblocks.
>
> Thanks, that's great. I've been using a man page slightly older (2.5.6) than
> the mdadm binary I'm using (2.6.7), which doesn't mention update=devicesize,
> although I'd probably have missed it anyway!
>
> However, is there a way to do this without stopping the array first? I'm
> using it as an LVM PV and slicing it up into logical volumes which clients
> will be running off while I'm wanting to grow it.
Not currently, no.
It would require changing the 'size' as known by md of each device
first. i.e. /sys/block/mdX/md/dev-*/size.
However this currently cannot be changed while the array is active.
I cannot immediately see a good reason why not. It may be as simple
as removing
if (my_mddev->pers)
return -EBUSY;
from the start of rdev_size_store, but I wouldn't recommend
experimenting with that on a production system....
Then you would need some code to update the device sizes to match
their real size (but only if the metadata is at the start of the
device .. or maybe relocating the metadata first ...)
Then you could "--grow --size=max" happily.
I'd put this on my todo list if it wasn't so long already :-)
(Well, I've put it there anyway, but no promises).
NeilBrown
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-19 3:42 ` Neil Brown
@ 2008-06-19 15:45 ` Chris Webb
2008-06-19 23:10 ` Chris Webb
2008-06-19 23:49 ` Chris Webb
0 siblings, 2 replies; 18+ messages in thread
From: Chris Webb @ 2008-06-19 15:45 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2361 bytes --]
Hi. Thanks for the helpful pointers.
Neil Brown <neilb@suse.de> writes:
> I'd put this on my todo list if it wasn't so long already :-)
> (Well, I've put it there anyway, but no promises).
I'll have a stab at implementing this if you like: it's only fair as I'm the
one who wants the feature!
> It would require changing the 'size' as known by md of each device
> first. i.e. /sys/block/mdX/md/dev-*/size.
> However this currently cannot be changed while the array is active.
>
> I cannot immediately see a good reason why not. It may be as simple
> as removing
>
> if (my_mddev->pers)
> return -EBUSY;
>
> from the start of rdev_size_store, but I wouldn't recommend
> experimenting with that on a production system....
Looking through the kernel md.c driver, I can't see anywhere that will
obviously break if rdev->size increases on a running array; it's used very
rarely. Given this, I've tried taking this check out in a scratch virtual
machine, and it works nicely: it's sufficient to echo the new size into the
sysfs file exactly as you indicate, then --grow --size=max works correctly.
I added some checks that the new value of rdev->size won't allow the data to
overlap the superblock, with rdev_size_store truncating the size value if
it is too large for the device, and filling the device if size = 0 is passed
in.
The only glitch is that the available dev size doesn't get updated in the
metadata, so next time the device is assembled, it fails to add the device
with an invalid argument error. Calling --assemble with --update=devicesize
fixes this of course.
I guess that ideally rdev_size_store should write out the changed rdev->size
to the metadata when it's called on a running array, in a similar manner to
size_store for component_size? If the superblock is version 1.x, I'm already
updating the data_size in the superblock ready to write out:
if (sb->major_version == 1) {
((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(rdev->size*2);
/* TODO: write superblock out */
}
However, I'm not sure what the correct call to get the superblock written
out at this point would be. I tried
md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
rdev->sb_page);
after updating sb->data_size as above, but that doesn't seem to do the right
thing.
Cheers,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.patch --]
[-- Type: text/plain, Size: 1168 bytes --]
--- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
+++ linux-2.6.24.4-mdplay/drivers/md/md.c 2008-06-19 16:39:29.000000000 +0100
@@ -1946,8 +1946,21 @@ rdev_size_store(mdk_rdev_t *rdev, const
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ mdp_super_t *sb;
+ if (rdev->sb_offset > rdev->data_offset/2) {
+ if (!size || size > rdev->sb_offset - rdev->data_offset/2)
+ size = rdev->sb_offset - rdev->data_offset/2;
+ } else if (!size || 2*size + rdev->data_offset > rdev->bdev->bd_inode->i_size >> 9)
+ size = ((rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset)/2;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ if (sb->major_version == 1) {
+ ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(rdev->size*2);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ /* TODO: write superblock out */
+ }
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-19 15:45 ` Chris Webb
@ 2008-06-19 23:10 ` Chris Webb
2008-06-19 23:49 ` Chris Webb
1 sibling, 0 replies; 18+ messages in thread
From: Chris Webb @ 2008-06-19 23:10 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Chris Webb <chris@arachsys.com> writes:
> I guess that ideally rdev_size_store should write out the changed rdev->size
> to the metadata when it's called on a running array, in a similar manner to
> size_store for component_size? If the superblock is version 1.x, I'm already
> updating the data_size in the superblock ready to write out:
(Also, once I get the superblock writing out okay, I can presumably support
metadata format 1.0 too, updating the sb_offset and writing a new copy at
the end of the disk.)
Cheers,
Chris.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-19 15:45 ` Chris Webb
2008-06-19 23:10 ` Chris Webb
@ 2008-06-19 23:49 ` Chris Webb
2008-06-20 11:13 ` Chris Webb
1 sibling, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-19 23:49 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 530 bytes --]
Chris Webb <chris@arachsys.com> writes:
> However, I'm not sure what the correct call to get the superblock written
> out at this point would be. I tried
>
> md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
> rdev->sb_page);
>
> after updating sb->data_size as above, but that doesn't seem to do the right
> thing.
It works without the typo!
> + ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(rdev->size*2);
should read size*2 rather than rdev->size*2.
Cheers,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.patch --]
[-- Type: text/plain, Size: 1125 bytes --]
--- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
+++ linux-2.6.24.4-mdplay/drivers/md/md.c 2008-06-20 00:44:50.000000000 +0100
@@ -1946,8 +1946,20 @@ rdev_size_store(mdk_rdev_t *rdev, const
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ mdp_super_t *sb;
+ if (rdev->sb_offset > rdev->data_offset/2) {
+ if (!size || size > rdev->sb_offset - rdev->data_offset/2)
+ size = rdev->sb_offset - rdev->data_offset/2;
+ } else if (!size || 2*size + rdev->data_offset > rdev->bdev->bd_inode->i_size >> 9)
+ size = ((rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset)/2;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ if (sb->major_version == 1) {
+ ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(size*2);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ }
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-19 23:49 ` Chris Webb
@ 2008-06-20 11:13 ` Chris Webb
2008-06-20 14:24 ` Chris Webb
0 siblings, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-20 11:13 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2057 bytes --]
Here's an updated, cleaned up version of my patch. Since the action is
superblock version-specific, it now indirects through super_types[] as other
version-specific parts of md.c do. At present it allows the rdev->size to be
updated through sysfs for metadata types 1.1 and 1.2, making all the correct
checks on the supplied value. It also supports growing rdev->size as large
as possible for the block device using echo "0" >/sys/block/mdX/md/rdY/size.
If you try to extend the size of 1.0 and 0.90 devices so that the data area
would overlap the superblock, rdev_size_store just returns EBUSY, the same
behaviour as previously. I'd like to support metadata types 0.90 and 1.0
too, relocating the superblock. I think it'd be sufficient to write a new
one at the end of the device then update rdev->sb_offset. I'll try this next
week.
I'm a bit puzzled about what to do if there's an internal bitmap. I can
ignore the issue at present because I only handle 1.1 and 1.2 metadata,
where any bitmap is stored before the start of the data area.
For metadata at the end of the disk, the bitmap will either be just before
or just after the superblock, and so I'll presumably need to move it to the
new end of the device along with the superblock. However, it looks at first
glance like the bitmap offset is constant across the array, not a per-device
setting. This suggests the bitmap has to move when the array is grown, not
earlier when rdev->size changes.
Unfortunately, that doesn't square with rdev->size being changed for
individual disks first! I can't move the bitmap without changing rdev->size
on all disks, but I can't increase rdev->size to a valid value on any disk
without moving the bitmap at the same point. Either I'm missing something,
or changing rdev->size online, drive-by-drive won't be possible if there's
an internal bitmap on a 0.90 or 1.0 format array.
I'm unlikely to get a chance to work on this again before Monday, but
I'd be very grateful for any feedback or pointers. md.c is a fairly
involved piece of code!
Cheers,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.2.patch --]
[-- Type: text/plain, Size: 3633 bytes --]
--- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
+++ linux-2.6.24.4-cdwmd/drivers/md/md.c 2008-06-20 12:08:01.000000000 +0100
@@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup
*/
struct super_type {
- char *name;
- struct module *owner;
- int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
- int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
- void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ char *name;
+ struct module *owner;
+ int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
+ int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size);
};
/*
@@ -994,6 +995,19 @@ static void super_90_sync(mddev_t *mddev
}
/*
+ * rdev_size_change for 0.90.0
+ */
+static unsigned long long
+super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ /* TODO: relocate sb to use full device */
+ if (!size || size > rdev->sb_offset - rdev->data_offset/2)
+ size = rdev->sb_offset - rdev->data_offset/2;
+ return size;
+}
+
+
+/*
* version 1 superblock
*/
@@ -1310,21 +1324,52 @@ static void super_1_sync(mddev_t *mddev,
sb->sb_csum = calc_sb_1_csum(sb);
}
+static unsigned long long
+super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ struct mdp_superblock_1 *sb;
+ unsigned long long max_size;
+ sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+ if (rdev->sb_offset < rdev->data_offset/2) {
+ /* minor versions 1 and 2; superblock before data */
+ max_size = ((rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset)/2;
+ if (!size || size > max_size)
+ size = max_size;
+ } else {
+ /* minor version 0; superblock after data */
+ if (rdev->mddev->bitmap_offset < 0) {
+ /* don't overlap data with bitmap */
+ max_size = rdev->sb_offset*2 + rdev->mddev->bitmap_offset;
+ max_size = (max_size - rdev->data_offset)/2;
+ } else
+ max_size = rdev->sb_offset - rdev->data_offset/2;
+ /* TODO: relocate sb to use full device */
+ if (!size || size > max_size/2)
+ size = max_size/2;
+ }
+ sb->data_size = cpu_to_le64(size*2);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ return size;
+}
+
static struct super_type super_types[] = {
[0] = {
.name = "0.90.0",
.owner = THIS_MODULE,
- .load_super = super_90_load,
- .validate_super = super_90_validate,
- .sync_super = super_90_sync,
+ .load_super = super_90_load,
+ .validate_super = super_90_validate,
+ .sync_super = super_90_sync,
+ .rdev_size_change = super_90_rdev_size_change,
},
[1] = {
.name = "md-1",
.owner = THIS_MODULE,
- .load_super = super_1_load,
- .validate_super = super_1_validate,
- .sync_super = super_1_sync,
+ .load_super = super_1_load,
+ .validate_super = super_1_validate,
+ .sync_super = super_1_sync,
+ .rdev_size_change = super_1_rdev_size_change,
},
};
@@ -1946,8 +1991,13 @@ rdev_size_store(mdk_rdev_t *rdev, const
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ mdp_super_t *sb;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ size = super_types[sb->major_version].rdev_size_change(rdev, size);
+ if (!size)
+ return -EBUSY;
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-20 11:13 ` Chris Webb
@ 2008-06-20 14:24 ` Chris Webb
2008-06-23 1:26 ` Neil Brown
0 siblings, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-20 14:24 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 546 bytes --]
Chris Webb <chris@arachsys.com> writes:
> I'd like to support metadata types 0.90 and 1.0 too, relocating the
> superblock. I think it'd be sufficient to write a new one at the end of
> the device then update rdev->sb_offset. I'll try this next week.
I've had a quick stab at this before I leave along with fixing a couple of
other oversights. (I just return EBUSY if there's a bitmap present in 0.90
and 1.0.) It appears to work correctly for 0.90 but generates a corrupt
superblock for 1.0, which I'll need to fix on Monday.
Cheers,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.2.patch --]
[-- Type: text/plain, Size: 3721 bytes --]
--- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
+++ linux-2.6.24.4-cdwmd/drivers/md/md.c 2008-06-20 15:02:01.000000000 +0100
@@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup
*/
struct super_type {
- char *name;
- struct module *owner;
- int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
- int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
- void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ char *name;
+ struct module *owner;
+ int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
+ int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size);
};
/*
@@ -994,6 +995,23 @@ static void super_90_sync(mddev_t *mddev
}
/*
+ * rdev_size_change for 0.90.0
+ */
+static unsigned long long
+super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ if (rdev->mddev->bitmap_offset)
+ return 0; /* can't move bitmap */
+ rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
+ if (!size || size > rdev->sb_offset)
+ size = rdev->sb_offset;
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ return size;
+}
+
+
+/*
* version 1 superblock
*/
@@ -1310,21 +1328,49 @@ static void super_1_sync(mddev_t *mddev,
sb->sb_csum = calc_sb_1_csum(sb);
}
+static unsigned long long
+super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ struct mdp_superblock_1 *sb;
+ sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+ if (rdev->sb_offset < rdev->data_offset/2) {
+ /* minor versions 1 and 2; superblock before data */
+ unsigned long long max_size;
+ max_size = (rdev->bdev->bd_inode->i_size >> 10) - rdev->data_offset/2;
+ if (!size || size > max_size)
+ size = max_size;
+ } else {
+ /* minor version 0; superblock after data */
+ if (rdev->mddev->bitmap_offset)
+ return 0; /* can't move bitmap */
+ rdev->sb_offset = (rdev->bdev->bd_inode->i_size >> 10) - 8;
+ rdev->sb_offset &= ~(sector_t)(4 - 1);
+ if (!size || size > rdev->sb_offset - rdev->data_offset/2)
+ size = rdev->sb_offset - rdev->data_offset/2;
+ }
+ sb->data_size = cpu_to_le64(size*2);
+ sb->sb_csum = calc_sb_1_csum(sb);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ return size;
+}
static struct super_type super_types[] = {
[0] = {
.name = "0.90.0",
.owner = THIS_MODULE,
- .load_super = super_90_load,
- .validate_super = super_90_validate,
- .sync_super = super_90_sync,
+ .load_super = super_90_load,
+ .validate_super = super_90_validate,
+ .sync_super = super_90_sync,
+ .rdev_size_change = super_90_rdev_size_change,
},
[1] = {
.name = "md-1",
.owner = THIS_MODULE,
- .load_super = super_1_load,
- .validate_super = super_1_validate,
- .sync_super = super_1_sync,
+ .load_super = super_1_load,
+ .validate_super = super_1_validate,
+ .sync_super = super_1_sync,
+ .rdev_size_change = super_1_rdev_size_change,
},
};
@@ -1946,8 +1992,13 @@ rdev_size_store(mdk_rdev_t *rdev, const
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ mdp_super_t *sb;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ size = super_types[sb->major_version].rdev_size_change(rdev, size);
+ if (!size)
+ return -EBUSY;
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-20 14:24 ` Chris Webb
@ 2008-06-23 1:26 ` Neil Brown
2008-06-23 11:18 ` Chris Webb
0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-06-23 1:26 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-raid
On Friday June 20, chris@arachsys.com wrote:
> Chris Webb <chris@arachsys.com> writes:
>
> > I'd like to support metadata types 0.90 and 1.0 too, relocating the
> > superblock. I think it'd be sufficient to write a new one at the end of
> > the device then update rdev->sb_offset. I'll try this next week.
>
> I've had a quick stab at this before I leave along with fixing a couple of
> other oversights. (I just return EBUSY if there's a bitmap present in 0.90
> and 1.0.) It appears to work correctly for 0.90 but generates a corrupt
> superblock for 1.0, which I'll need to fix on Monday.
Thanks for doing this!!!
- I am perfectly happy with getting -EBUSY if there is an internal
bitmap. It is quite easy to turn off the bitmap, resize the
devices, then turn it on again, and there is little cost in
doing this.
- I would much prefer using "sector" numbers rather than "K" numbers
in any new code. I'd eventually like to alway use sector numbers
internally, but that's a lower priority. So if you could change
the rdev_size_change methods to convert to sectors and then use
that I'd appreciate it.
- I think you really should call md_super_wait after md_super_write.
You really don't want the device to appear bigger until the
new metadata really is safe of disk.
- I think you need some protection to make sure that size doesn't get
set below my_mddev->size while the array is active. That would be
bad.
What I'd really like is for md to get a call-back when the device size
changes, so that the metadata can be relocated immediately. However
that is a little way off, and I think this is a useful thing to have
now.
Thanks,
NeilBrown
>
> Cheers,
>
> Chris.
> --- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
> +++ linux-2.6.24.4-cdwmd/drivers/md/md.c 2008-06-20 15:02:01.000000000 +0100
> @@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup
> */
>
> struct super_type {
> - char *name;
> - struct module *owner;
> - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
> - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
> - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
> + char *name;
> + struct module *owner;
> + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
> + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
> + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
> + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size);
> };
>
> /*
> @@ -994,6 +995,23 @@ static void super_90_sync(mddev_t *mddev
> }
>
> /*
> + * rdev_size_change for 0.90.0
> + */
> +static unsigned long long
> +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
> +{
> + if (rdev->mddev->bitmap_offset)
> + return 0; /* can't move bitmap */
> + rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
> + if (!size || size > rdev->sb_offset)
> + size = rdev->sb_offset;
> + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
> + rdev->sb_page);
> + return size;
> +}
> +
> +
> +/*
> * version 1 superblock
> */
>
> @@ -1310,21 +1328,49 @@ static void super_1_sync(mddev_t *mddev,
> sb->sb_csum = calc_sb_1_csum(sb);
> }
>
> +static unsigned long long
> +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
> +{
> + struct mdp_superblock_1 *sb;
> + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
> + if (rdev->sb_offset < rdev->data_offset/2) {
> + /* minor versions 1 and 2; superblock before data */
> + unsigned long long max_size;
> + max_size = (rdev->bdev->bd_inode->i_size >> 10) - rdev->data_offset/2;
> + if (!size || size > max_size)
> + size = max_size;
> + } else {
> + /* minor version 0; superblock after data */
> + if (rdev->mddev->bitmap_offset)
> + return 0; /* can't move bitmap */
> + rdev->sb_offset = (rdev->bdev->bd_inode->i_size >> 10) - 8;
> + rdev->sb_offset &= ~(sector_t)(4 - 1);
> + if (!size || size > rdev->sb_offset - rdev->data_offset/2)
> + size = rdev->sb_offset - rdev->data_offset/2;
> + }
> + sb->data_size = cpu_to_le64(size*2);
> + sb->sb_csum = calc_sb_1_csum(sb);
> + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
> + rdev->sb_page);
> + return size;
> +}
>
> static struct super_type super_types[] = {
> [0] = {
> .name = "0.90.0",
> .owner = THIS_MODULE,
> - .load_super = super_90_load,
> - .validate_super = super_90_validate,
> - .sync_super = super_90_sync,
> + .load_super = super_90_load,
> + .validate_super = super_90_validate,
> + .sync_super = super_90_sync,
> + .rdev_size_change = super_90_rdev_size_change,
> },
> [1] = {
> .name = "md-1",
> .owner = THIS_MODULE,
> - .load_super = super_1_load,
> - .validate_super = super_1_validate,
> - .sync_super = super_1_sync,
> + .load_super = super_1_load,
> + .validate_super = super_1_validate,
> + .sync_super = super_1_sync,
> + .rdev_size_change = super_1_rdev_size_change,
> },
> };
>
> @@ -1946,8 +1992,13 @@ rdev_size_store(mdk_rdev_t *rdev, const
> unsigned long long size = simple_strtoull(buf, &e, 10);
> if (e==buf || (*e && *e != '\n'))
> return -EINVAL;
> - if (rdev->mddev->pers)
> - return -EBUSY;
> + if (rdev->mddev->pers) {
> + mdp_super_t *sb;
> + sb = (mdp_super_t *) page_address(rdev->sb_page);
> + size = super_types[sb->major_version].rdev_size_change(rdev, size);
> + if (!size)
> + return -EBUSY;
> + }
> rdev->size = size;
> if (size < rdev->mddev->size || rdev->mddev->size == 0)
> rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-23 1:26 ` Neil Brown
@ 2008-06-23 11:18 ` Chris Webb
2008-06-23 22:53 ` Neil Brown
2008-06-24 2:40 ` Trouble increasing md component size Neil Brown
0 siblings, 2 replies; 18+ messages in thread
From: Chris Webb @ 2008-06-23 11:18 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2934 bytes --]
Hi Neil. Thanks for your advice on this patch.
Neil Brown <neilb@suse.de> writes:
> Thanks for doing this!!!
No problem at all; it's great to be able to help out with the md driver
which I use and rely upon so heavily!
> - I would much prefer using "sector" numbers rather than "K" numbers
> in any new code. I'd eventually like to alway use sector numbers
> internally, but that's a lower priority. So if you could change
> the rdev_size_change methods to convert to sectors and then use
> that I'd appreciate it.
I've done this, and commented the conversion so that it'll be easy to remove
if you later decide to change rdev->size to work with sectors instead of kB.
To be honest, one of the things I found most confusing reading the md driver
for the first time was the use of kilobyte blocks for some sizes, and
sectors for others. Migrating to sectors throughout would be great for
readability, especially with most other kernel code using 512-byte sectors.
It looks like quite a big job, though, and changing the sysfs interface is
presumably out unless you rename the files?
> - I think you really should call md_super_wait after md_super_write.
> You really don't want the device to appear bigger until the
> new metadata really is safe of disk.
Ah, thanks for pointing out this one. I wasn't aware that md_super_write
could return before the write had finished. Looking at the description of
md_super_wait, it looks like it might even have to retry, so the superblock
might never get written at all without it? I've added md_super_wait after
both md_super_write calls.
> - I think you need some protection to make sure that size doesn't get
> set below my_mddev->size while the array is active. That would be
> bad.
Definitely! I've put in the appropriate size guards for both metadata
formats. This morning I've also fixed the mysterious corruption of metadata
1.0 superblocks: turns out I forgot to update the sb->super_offset to match
rdev->sb_offset before writing out the superblock.
As far as I can tell from repeated testing on scratch arrays here, this
patch (attached) now behaves correctly and reliably for all supported
metadata formats.
> What I'd really like is for md to get a call-back when the device size
> changes, so that the metadata can be relocated immediately. However
> that is a little way off, and I think this is a useful thing to have
> now.
If it's easy to register for such a call-back (?), I think it would be
sufficient for the call-back to run that new rdev_size_change superblock
function as
super_types[sb->major_version].rdev_size_change(rdev, 0)
to update the rdev->size & superblock, and move the metadata if necessary.
For a shrink you probably want to resize before the block device changes
size rather than afterwards, although that's presumably not going to be
easy/possible to achieve for many block device changes.
Cheers,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.3.patch --]
[-- Type: text/plain, Size: 4795 bytes --]
Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to
zero to fill all the available space on the underlying device.
Signed-off-by: Chris Webb <chris@arachsys.com>
---
md.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 81 insertions(+), 13 deletions(-)
--- linux-2.6.24.4.orig/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
+++ linux-2.6.24.4/drivers/md/md.c 2008-06-23 10:29:34.000000000 +0100
@@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup
*/
struct super_type {
- char *name;
- struct module *owner;
- int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
- int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
- void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ char *name;
+ struct module *owner;
+ int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
+ int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size);
};
/*
@@ -994,6 +995,27 @@ static void super_90_sync(mddev_t *mddev
}
/*
+ * rdev_size_change for 0.90.0
+ */
+static unsigned long long
+super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ if (size && size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ size *= 2; /* convert to sectors */
+ if (rdev->mddev->bitmap_offset)
+ return 0; /* can't move bitmap */
+ rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
+ if (!size || size > rdev->sb_offset*2)
+ size = rdev->sb_offset*2;
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ md_super_wait(rdev->mddev);
+ return size/2; /* kB for sysfs */
+}
+
+
+/*
* version 1 superblock
*/
@@ -1310,21 +1332,62 @@ static void super_1_sync(mddev_t *mddev,
sb->sb_csum = calc_sb_1_csum(sb);
}
+static unsigned long long
+super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ struct mdp_superblock_1 *sb;
+ unsigned long long max_size;
+ if (size && size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ size *= 2; /* convert to sectors */
+ sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+ if (rdev->sb_offset < rdev->data_offset/2) {
+ /* minor versions 1 and 2; superblock before data */
+ max_size = (rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset;
+ if (!size || size > max_size)
+ size = max_size;
+ } else {
+ /* minor version 0; superblock after data */
+ if (rdev->mddev->bitmap_offset)
+ return 0; /* can't move bitmap */
+ rdev->sb_offset = (rdev->bdev->bd_inode->i_size >> 10) - 8;
+ rdev->sb_offset &= ~(sector_t)(4 - 1);
+ max_size = rdev->sb_offset*2 - rdev->data_offset;
+ /* space reservation for bitmap */
+ if (max_size - 64*2 >= 200*1024*1024*2)
+ max_size -= 128*2;
+ else if (max_size - 4*2 >= 8*1024*1024*2)
+ max_size -= 64*2;
+ else if (max_size >= 64*2)
+ max_size -= 4*2;
+ if (!size || size > max_size)
+ size = max_size;
+ }
+ sb->data_size = cpu_to_le64(size);
+ sb->super_offset = rdev->sb_offset*2;
+ sb->sb_csum = calc_sb_1_csum(sb);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ md_super_wait(rdev->mddev);
+ return size/2; /* kB for sysfs */
+}
static struct super_type super_types[] = {
[0] = {
.name = "0.90.0",
.owner = THIS_MODULE,
- .load_super = super_90_load,
- .validate_super = super_90_validate,
- .sync_super = super_90_sync,
+ .load_super = super_90_load,
+ .validate_super = super_90_validate,
+ .sync_super = super_90_sync,
+ .rdev_size_change = super_90_rdev_size_change,
},
[1] = {
.name = "md-1",
.owner = THIS_MODULE,
- .load_super = super_1_load,
- .validate_super = super_1_validate,
- .sync_super = super_1_sync,
+ .load_super = super_1_load,
+ .validate_super = super_1_validate,
+ .sync_super = super_1_sync,
+ .rdev_size_change = super_1_rdev_size_change,
},
};
@@ -1946,8 +2009,13 @@ rdev_size_store(mdk_rdev_t *rdev, const
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ mdp_super_t *sb;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ size = super_types[sb->major_version].rdev_size_change(rdev, size);
+ if (!size)
+ return -EBUSY;
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-23 11:18 ` Chris Webb
@ 2008-06-23 22:53 ` Neil Brown
2008-06-24 11:47 ` Chris Webb
2008-06-24 2:40 ` Trouble increasing md component size Neil Brown
1 sibling, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-06-23 22:53 UTC (permalink / raw)
To: Chris Webb; +Cc: Neil Brown, linux-raid
On Monday June 23, chris@arachsys.com wrote:
> Hi Neil. Thanks for your advice on this patch.
And thank you for fixing it up. It looks good.
I'll try to give is a proper review and do some testing myself in a
few days.
I'm a bit bothered about the code for leaving space for a bitmap for a
1.0 array. I'd rather avoid too much of this sort of 'policy' in the
kernel. We could just leave it out and expect user-space to write a
size that leaves room for a bitmap if that is needed, but maybe that
is going too far the other way.
I thought of just scaling the "space between end of data and start of
metadata" by the same amount that the device is scaled, rounding up to
the next sector. That feels "right" if it doesn't turn out to be too
clumsy.
Also, you add 4 lines greater than 80 columns :-) if you run
./scripts/checkpatch.pl name-of-patch-file
it will tell you these things.
NeilBrown
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-23 11:18 ` Chris Webb
2008-06-23 22:53 ` Neil Brown
@ 2008-06-24 2:40 ` Neil Brown
1 sibling, 0 replies; 18+ messages in thread
From: Neil Brown @ 2008-06-24 2:40 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-raid
On Monday June 23, chris@arachsys.com wrote:
> > What I'd really like is for md to get a call-back when the device size
> > changes, so that the metadata can be relocated immediately. However
> > that is a little way off, and I think this is a useful thing to have
> > now.
>
> If it's easy to register for such a call-back (?), I think it would be
> sufficient for the call-back to run that new rdev_size_change superblock
> function as
>
> super_types[sb->major_version].rdev_size_change(rdev, 0)
>
> to update the rdev->size & superblock, and move the metadata if necessary.
> For a shrink you probably want to resize before the block device changes
> size rather than afterwards, although that's presumably not going to be
> easy/possible to achieve for many block device changes.
I'd meant to respond to this bit in my first reply, but got
distracted.
There currently is no mechanism for registering callbacks. One day I
would like to create one.
The approach I have in mind involves leveraging the bd_claim/bd_holder
stuff.
Current when someone "claim"s a block_device, they give a unique (void *)
to identify them. My idea is to change that to be a struct with
defined contents.
e.g.
struct bd_holder {
struct block_dev_callback_operations *ops;
};
Where
struct block_dev_callback_operations {
int (*size_change_request)(struct block_dev *bdev, sector_t newsize);
void (*size_change_commit)(struct block_dev *bdev, sector_t newsize);
....
}
so if a blockdev wants to change it's size, and someone has claimed
it, it first calls
bdev->bd_holder->ops->size_change_request()
with the new size. If that fails, it has to give up.
If it succeeds, it makes the change, the calls ->size_change_commit.
I think dm and md are currently the only devices which spontaneously
change size, so they would be the first place to make these calls.
Possibly we could then get the partition management code to
allow size changes of active partitions if there was a
size_change_request that could be called and would return success.
There are quite a lot of places where bd_claim is called.
Filesystems claim the block device they use, md and dm and swap do as
well.
In the first instance, we could make the "ops" pointer "NULL" and get
the calling code to cope with that. Then one by one we could
introduce useful functionality.
I would then use these callbacks to also implement freeze_bdev.
It currently hunts through the mount table for a filesystem on the
bdev, and calls the s_op->write_super_lockfs method on that
filesystem. This is somewhat ugly. Doing a callback through the
bd_holder structure would be much more elegant.
The only difficult issue is locking. Exactly what lock should be
required when calling various block_dev_callback_operations?
The easiest would be to hold the bdev_lock spinlock. That would
be enough to make sure the holder doesn't disappear on us. But
there isn't much you can do under a spinlock. You certainly cannot
write new metadata to a device.
Maybe you could get by with ->trylock and ->unlock
block_dev_callback_operations which could be called under the
spinlock, and all other operations much be called with that lock
held. One would probably need to try writing code and see what falls
out.
And yes, with that in place, rdev_size_change(rdev, 0) would be very
close to what you want the size_change_commit to do.
NeilBrown
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-23 22:53 ` Neil Brown
@ 2008-06-24 11:47 ` Chris Webb
2008-06-24 23:19 ` Chris Webb
0 siblings, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-24 11:47 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2986 bytes --]
Neil Brown <neilb@suse.de> writes:
> I'm a bit bothered about the code for leaving space for a bitmap for a
> 1.0 array. I'd rather avoid too much of this sort of 'policy' in the
> kernel.
You're right, it does seem a bit arbitrary. I'll take it out. Originally I
put it in there to avoid a discrepancy between the array you'd get if you
used mdadm to create (which hard-codes this policy), and one reshaped in the
kernel. That said, you get discrepancy anyway for v1.1 and v1.2 metadata,
where the bitmap reserved area is at the start of the array and can't grow.
Maybe the right thing to do is to copy the behaviour for v1.1 and v1.2
arrays, and keep the reserved area constant for v1.0 arrays too, rather than
attempting to scale it all? I've implemented this in the attached patch.
This also means that v1.0 arrays share the nice property of v0.9, v1.1 and
v1.2 arrays that, modulo chunk-size rounding, growing a device by some
amount always leads the same growth in the component_size (and hence it isn't
too awkward to calculate the resulting change in the available array size).
More generally, being a user-space consumer of md arrays which change size
isn't especially convenient at the moment. If, say, you use your md as an
lvm pv, and build it out of lower-level lvm volumes so you can change the
underlying device sizes, growing the top level isn't too painful. You can
use the 'maximum size' shortcut: increase the underlying devices, grow the
array to max size, then grow the top level structure to max size.
However, shrinking an md backed pv is quite a bit harder. You need to
calculate the array size for a given underlying device size in advance of
actually making the change, so you can shrink the top level pv before
resizing the array then the underlying devices. There's no defined route for
doing this calculation, either kernel or userspace. The only reasonable way
forward is to do everything relative to the existing sizes, comparing
against the values in sysfs and relying on the change to device sizes being
the same as the change to component sizes (allowing for chunk-size
rounding), then using a switch dispatching on RAID level and layout (for
raid10).
Even using mdadm create --size=X isn't especially easy because the size
you're specifying is the usable component size, whose connection to the
underlying device size is non-trivial: the metadata offset and size vary
depending on type, mdadm will automatically reserve space for a bitmap
according to a set of heuristics (which can't be overridden). You end up
having to know the details of mdadm's internal operation just to reliable
calculate the correct size to ask for to get the right component size.
I'd like to do something about this... but it's not yet clear to me what the
right thing would be to fit in tidily with the existing interface,
especially for the 'dry run' calculation.
> Also, you add 4 lines greater than 80 columns :-)
Oops. I've fixed these!
Best wishes,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.4.patch --]
[-- Type: text/plain, Size: 4816 bytes --]
From: Chris Webb <chris@arachsys.com>
Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to
zero to fill all the available space on the underlying device.
Signed-off-by: Chris Webb <chris@arachsys.com>
---
drivers/md/md.c | 94 +++++++++++++++++++++++++++++++++++++++-------
1 file changed, 81 insertions(+), 13 deletions(-)
diff -uNpr linux-2.6.24.4.orig/drivers/md/md.c linux-2.6.24.4/drivers/md/md.c
--- linux-2.6.24.4.orig/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000
+++ linux-2.6.24.4/drivers/md/md.c 2008-06-24 11:35:47.000000000 +0100
@@ -652,11 +652,14 @@ static unsigned int calc_sb_csum(mdp_sup
*/
struct super_type {
- char *name;
- struct module *owner;
- int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
- int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
- void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ char *name;
+ struct module *owner;
+ int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev,
+ int minor_version);
+ int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev,
+ unsigned long long size);
};
/*
@@ -994,6 +997,27 @@ static void super_90_sync(mddev_t *mddev
}
/*
+ * rdev_size_change for 0.90.0
+ */
+static unsigned long long
+super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ if (size && size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ size *= 2; /* convert to sectors */
+ if (rdev->mddev->bitmap_offset)
+ return 0; /* can't move bitmap */
+ rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
+ if (!size || size > rdev->sb_offset*2)
+ size = rdev->sb_offset*2;
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ md_super_wait(rdev->mddev);
+ return size/2; /* kB for sysfs */
+}
+
+
+/*
* version 1 superblock
*/
@@ -1310,21 +1334,59 @@ static void super_1_sync(mddev_t *mddev,
sb->sb_csum = calc_sb_1_csum(sb);
}
+static unsigned long long
+super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ struct mdp_superblock_1 *sb;
+ unsigned long long max_size;
+ if (size && size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ size *= 2; /* convert to sectors */
+ sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+ if (rdev->sb_offset < rdev->data_offset/2) {
+ /* minor versions 1 and 2; superblock before data */
+ max_size = (rdev->bdev->bd_inode->i_size >> 9);
+ max_size -= rdev->data_offset;
+ if (!size || size > max_size)
+ size = max_size;
+ } else if (rdev->mddev->bitmap_offset) {
+ /* minor version 0 with bitmap we can't move */
+ return 0;
+ } else {
+ /* minor version 0; superblock after data */
+ sector_t sb_offset;
+ sb_offset = (rdev->bdev->bd_inode->i_size >> 9) - 8*2;
+ sb_offset &= ~(sector_t)(4*2 - 1);
+ max_size = rdev->size*2 + sb_offset - rdev->sb_offset*2;
+ if (!size || size > max_size)
+ size = max_size;
+ rdev->sb_offset = sb_offset/2;
+ }
+ sb->data_size = cpu_to_le64(size);
+ sb->super_offset = rdev->sb_offset*2;
+ sb->sb_csum = calc_sb_1_csum(sb);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ md_super_wait(rdev->mddev);
+ return size/2; /* kB for sysfs */
+}
static struct super_type super_types[] = {
[0] = {
.name = "0.90.0",
.owner = THIS_MODULE,
- .load_super = super_90_load,
- .validate_super = super_90_validate,
- .sync_super = super_90_sync,
+ .load_super = super_90_load,
+ .validate_super = super_90_validate,
+ .sync_super = super_90_sync,
+ .rdev_size_change = super_90_rdev_size_change,
},
[1] = {
.name = "md-1",
.owner = THIS_MODULE,
- .load_super = super_1_load,
- .validate_super = super_1_validate,
- .sync_super = super_1_sync,
+ .load_super = super_1_load,
+ .validate_super = super_1_validate,
+ .sync_super = super_1_sync,
+ .rdev_size_change = super_1_rdev_size_change,
},
};
@@ -1946,8 +2008,14 @@ rdev_size_store(mdk_rdev_t *rdev, const
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ mdp_super_t *sb;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ size = super_types[sb->major_version].
+ rdev_size_change(rdev, size);
+ if (!size)
+ return -EBUSY;
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size
2008-06-24 11:47 ` Chris Webb
@ 2008-06-24 23:19 ` Chris Webb
2008-09-17 18:11 ` [PATCH] md: Fix rdev_size_store with size = 0 Chris Webb
2008-10-07 12:40 ` [Resend] " Chris Webb
0 siblings, 2 replies; 18+ messages in thread
From: Chris Webb @ 2008-06-24 23:19 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 237 bytes --]
Chris Webb <chris@arachsys.com> writes:
> Oops.
One final (hopefully!) correction: my patch didn't do the quite right thing
in the case of externally managed metadata, where there's no persistent
superblock to update.
Cheers,
Chris.
[-- Attachment #2: linux-2.6.24.4-md-rdev-size.5.patch --]
[-- Type: text/plain, Size: 4819 bytes --]
From: Chris Webb <chris@arachsys.com>
Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.
Signed-off-by: Chris Webb <chris@arachsys.com>
---
drivers/md/md.c | 100 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 87 insertions(+), 13 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -652,11 +652,14 @@
*/
struct super_type {
- char *name;
- struct module *owner;
- int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version);
- int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
- void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ char *name;
+ struct module *owner;
+ int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev,
+ int minor_version);
+ int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev);
+ unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev,
+ unsigned long long size);
};
/*
@@ -992,6 +995,27 @@
sb->this_disk = sb->disks[rdev->desc_nr];
sb->sb_csum = calc_sb_csum(sb);
}
+
+/*
+ * rdev_size_change for 0.90.0
+ */
+static unsigned long long
+super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ if (size && size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ size *= 2; /* convert to sectors */
+ if (rdev->mddev->bitmap_offset)
+ return 0; /* can't move bitmap */
+ rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
+ if (!size || size > rdev->sb_offset*2)
+ size = rdev->sb_offset*2;
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ md_super_wait(rdev->mddev);
+ return size/2; /* kB for sysfs */
+}
+
/*
* version 1 superblock
@@ -1310,21 +1334,59 @@
sb->sb_csum = calc_sb_1_csum(sb);
}
+static unsigned long long
+super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size)
+{
+ struct mdp_superblock_1 *sb;
+ unsigned long long max_size;
+ if (size && size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ size *= 2; /* convert to sectors */
+ sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+ if (rdev->sb_offset < rdev->data_offset/2) {
+ /* minor versions 1 and 2; superblock before data */
+ max_size = (rdev->bdev->bd_inode->i_size >> 9);
+ max_size -= rdev->data_offset;
+ if (!size || size > max_size)
+ size = max_size;
+ } else if (rdev->mddev->bitmap_offset) {
+ /* minor version 0 with bitmap we can't move */
+ return 0;
+ } else {
+ /* minor version 0; superblock after data */
+ sector_t sb_offset;
+ sb_offset = (rdev->bdev->bd_inode->i_size >> 9) - 8*2;
+ sb_offset &= ~(sector_t)(4*2 - 1);
+ max_size = rdev->size*2 + sb_offset - rdev->sb_offset*2;
+ if (!size || size > max_size)
+ size = max_size;
+ rdev->sb_offset = sb_offset/2;
+ }
+ sb->data_size = cpu_to_le64(size);
+ sb->super_offset = rdev->sb_offset*2;
+ sb->sb_csum = calc_sb_1_csum(sb);
+ md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size,
+ rdev->sb_page);
+ md_super_wait(rdev->mddev);
+ return size/2; /* kB for sysfs */
+}
static struct super_type super_types[] = {
[0] = {
.name = "0.90.0",
.owner = THIS_MODULE,
- .load_super = super_90_load,
- .validate_super = super_90_validate,
- .sync_super = super_90_sync,
+ .load_super = super_90_load,
+ .validate_super = super_90_validate,
+ .sync_super = super_90_sync,
+ .rdev_size_change = super_90_rdev_size_change,
},
[1] = {
.name = "md-1",
.owner = THIS_MODULE,
- .load_super = super_1_load,
- .validate_super = super_1_validate,
- .sync_super = super_1_sync,
+ .load_super = super_1_load,
+ .validate_super = super_1_validate,
+ .sync_super = super_1_sync,
+ .rdev_size_change = super_1_rdev_size_change,
},
};
@@ -1946,8 +2008,20 @@
unsigned long long size = simple_strtoull(buf, &e, 10);
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
- return -EBUSY;
+ if (rdev->mddev->pers) {
+ if (rdev->mddev->persistent) {
+ mdp_super_t *sb;
+ sb = (mdp_super_t *) page_address(rdev->sb_page);
+ size = super_types[sb->major_version].
+ rdev_size_change(rdev, size);
+ if (!size)
+ return -EBUSY;
+ } else if (!size) {
+ size = (rdev->bdev->bd_inode->i_size >> 10);
+ size -= rdev->data_offset/2;
+ } else if (size < rdev->mddev->size)
+ return 0; /* component must fit device */
+ }
rdev->size = size;
if (size < rdev->mddev->size || rdev->mddev->size == 0)
rdev->mddev->size = size;
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] md: Fix rdev_size_store with size = 0
2008-06-24 23:19 ` Chris Webb
@ 2008-09-17 18:11 ` Chris Webb
2008-10-07 12:40 ` [Resend] " Chris Webb
1 sibling, 0 replies; 18+ messages in thread
From: Chris Webb @ 2008-09-17 18:11 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Hi Neil. You accepted a patch from me back at the end of June which allowed
/sys/block/mdX/md/rdY/size to change on running arrays. It supported a value
of zero to fill all the space on the underlying device. However, a later
patch tidying up rdev_size_store broke the size = 0 functionality by testing
for size < my_mddev->size slightly too early. The following patch fixes
this.
Best wishes,
Chris.
From: Chris Webb <chris@arachsys.com>
Fix rdev_size_store with size = 0
Signed-off-by: Chris Webb <chris@arachsys.com>
diff --git a/drivers/md/md.c b/drivers/md/md.c
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2109,8 +2109,6 @@
if (strict_strtoull(buf, 10, &size) < 0)
return -EINVAL;
- if (size < my_mddev->size)
- return -EINVAL;
if (my_mddev->pers && rdev->raid_disk >= 0) {
if (my_mddev->persistent) {
size = super_types[my_mddev->major_version].
@@ -2121,9 +2119,9 @@
size = (rdev->bdev->bd_inode->i_size >> 10);
size -= rdev->data_offset/2;
}
- if (size < my_mddev->size)
- return -EINVAL; /* component must fit device */
- }
+ }
+ if (size < my_mddev->size)
+ return -EINVAL; /* component must fit device */
rdev->size = size;
if (size > oldsize && my_mddev->external) {
^ permalink raw reply [flat|nested] 18+ messages in thread
* [Resend] [PATCH] md: Fix rdev_size_store with size = 0
2008-06-24 23:19 ` Chris Webb
2008-09-17 18:11 ` [PATCH] md: Fix rdev_size_store with size = 0 Chris Webb
@ 2008-10-07 12:40 ` Chris Webb
2008-10-13 0:54 ` Neil Brown
1 sibling, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-10-07 12:40 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Hi Neil. You accepted a patch from me back at the end of June which allowed
/sys/block/mdX/md/rdY/size to change on running arrays. It supported a value
of zero to fill all the space on the underlying device. However, a later
patch tidying up rdev_size_store broke the size = 0 functionality by testing
for size < my_mddev->size slightly too early. The following patch fixes
this.
Best wishes,
Chris.
From: Chris Webb <chris@arachsys.com>
Fix rdev_size_store with size = 0
Signed-off-by: Chris Webb <chris@arachsys.com>
diff --git a/drivers/md/md.c b/drivers/md/md.c
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2109,8 +2109,6 @@
if (strict_strtoull(buf, 10, &size) < 0)
return -EINVAL;
- if (size < my_mddev->size)
- return -EINVAL;
if (my_mddev->pers && rdev->raid_disk >= 0) {
if (my_mddev->persistent) {
size = super_types[my_mddev->major_version].
@@ -2121,9 +2119,9 @@
size = (rdev->bdev->bd_inode->i_size >> 10);
size -= rdev->data_offset/2;
}
- if (size < my_mddev->size)
- return -EINVAL; /* component must fit device */
- }
+ }
+ if (size < my_mddev->size)
+ return -EINVAL; /* component must fit device */
rdev->size = size;
if (size > oldsize && my_mddev->external) {
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Resend] [PATCH] md: Fix rdev_size_store with size = 0
2008-10-07 12:40 ` [Resend] " Chris Webb
@ 2008-10-13 0:54 ` Neil Brown
0 siblings, 0 replies; 18+ messages in thread
From: Neil Brown @ 2008-10-13 0:54 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-raid
On Tuesday October 7, chris@arachsys.com wrote:
> Hi Neil. You accepted a patch from me back at the end of June which allowed
> /sys/block/mdX/md/rdY/size to change on running arrays. It supported a value
> of zero to fill all the space on the underlying device. However, a later
> patch tidying up rdev_size_store broke the size = 0 functionality by testing
> for size < my_mddev->size slightly too early. The following patch fixes
> this.
Thanks - and sorry for breaking this :-(
I'll send this upstream soon and probably aim it for -stable too.
Thanks
NeilBrown
>
> Best wishes,
>
> Chris.
>
>
> From: Chris Webb <chris@arachsys.com>
>
> Fix rdev_size_store with size = 0
>
> Signed-off-by: Chris Webb <chris@arachsys.com>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2109,8 +2109,6 @@
>
> if (strict_strtoull(buf, 10, &size) < 0)
> return -EINVAL;
> - if (size < my_mddev->size)
> - return -EINVAL;
> if (my_mddev->pers && rdev->raid_disk >= 0) {
> if (my_mddev->persistent) {
> size = super_types[my_mddev->major_version].
> @@ -2121,9 +2119,9 @@
> size = (rdev->bdev->bd_inode->i_size >> 10);
> size -= rdev->data_offset/2;
> }
> - if (size < my_mddev->size)
> - return -EINVAL; /* component must fit device */
> - }
> + }
> + if (size < my_mddev->size)
> + return -EINVAL; /* component must fit device */
>
> rdev->size = size;
> if (size > oldsize && my_mddev->external) {
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2008-10-13 0:54 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-18 18:26 Trouble increasing md component size Chris Webb
2008-06-18 19:22 ` Peter Rabbitson
2008-06-18 20:00 ` Chris Webb
2008-06-19 3:42 ` Neil Brown
2008-06-19 15:45 ` Chris Webb
2008-06-19 23:10 ` Chris Webb
2008-06-19 23:49 ` Chris Webb
2008-06-20 11:13 ` Chris Webb
2008-06-20 14:24 ` Chris Webb
2008-06-23 1:26 ` Neil Brown
2008-06-23 11:18 ` Chris Webb
2008-06-23 22:53 ` Neil Brown
2008-06-24 11:47 ` Chris Webb
2008-06-24 23:19 ` Chris Webb
2008-09-17 18:11 ` [PATCH] md: Fix rdev_size_store with size = 0 Chris Webb
2008-10-07 12:40 ` [Resend] " Chris Webb
2008-10-13 0:54 ` Neil Brown
2008-06-24 2:40 ` Trouble increasing md component size Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).