* Trouble increasing md component size
@ 2008-06-18 18:26 Chris Webb
2008-06-18 19:22 ` Peter Rabbitson
0 siblings, 1 reply; 18+ messages in thread
From: Chris Webb @ 2008-06-18 18:26 UTC (permalink / raw)
To: linux-raid
Hi. I'm having some trouble increasing the component size of arrays to match
changes in the underlying block devices. I'm running mdadm 2.6.7 against a
stock 2.6.24.4 kernel.
I'm seeing the problem in a cluster management application I'm writing, but I
can reproduce it simply using two LV block devices, 150M each:
2# blockdev --getsz /dev/disk.4/{one,two}
311296
311296
I build an array that only fills two thirds of the component disks:
2# mdadm --create /dev/md4 --level=1 --raid-disks=2 --size 102400 \
--metadata=1.2 /dev/disk.4/{one,two}
mdadm: largest drive (/dev/disk.4/one) exceed size (102400K) by more than 1%
Continue creating array? y
mdadm: array /dev/md4 started.
2# blockdev --getsz /dev/md4
204800
When the array is fully synced and up [UU], I try to grow the array to use
more of the component disks:
2# mdadm --grow /dev/md4 --size max
2# blockdev --getsz /dev/md4
311272
and so this has worked fine.
However, if I increase the size of the underlying block devices:
2# lvm lvresize -L 200M /dev/disk.4/one
Extending logical volume one to 200.00 MB
Logical volume one successfully resized
2# lvm lvresize -L 200M /dev/disk.4/two
Extending logical volume two to 200.00 MB
Logical volume two successfully resized
2# blockdev --getsz /dev/disk.4/{one,two}
409600
409600
and then try to grow the array:
2# mdadm --grow /dev/md4 --size max
2# blockdev --getsz /dev/md4
311272
2# mdadm --grow /dev/md4 --size 155637 just 1kB more
mdadm: Cannot set device size for /dev/md4: No space left on device
it fails. Somehow, the change in the underlying component device sizes doesn't
seem to be noticed even though the kernel's idea (as returned by blockdev/ioctl
BLKGETSIZE[64]) has correctly changed. I can even remove and re-add both drives
in turn without it noticing the size change and allowing me to grow the array.
Trying to do something like
echo 155637 > /sys/block/md4/md/component_size
doesn't work either. It'll accept values up as far as the size of the block
device when the array was created, but no larger.
The only recipe I've found which works is, for each drive in turn, to fail it,
remove it, zero the superblock and then re-add it. This seems to mean the md
driver sees is as a new component (e.g. different desc_nr in /proc/mdstat) and
therefore re-reads its size? This isn't at all friendly to the raid array,
though. While each device is out, any redundancy has gone. Is there some less
drastic way to persuade the array to grow?
Cheers,
Chris.
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Trouble increasing md component size 2008-06-18 18:26 Trouble increasing md component size Chris Webb @ 2008-06-18 19:22 ` Peter Rabbitson 2008-06-18 20:00 ` Chris Webb 0 siblings, 1 reply; 18+ messages in thread From: Peter Rabbitson @ 2008-06-18 19:22 UTC (permalink / raw) To: Chris Webb; +Cc: linux-raid Chris Webb wrote: > Hi. I'm having some trouble increasing the component size of arrays to match > changes in the underlying block devices. I'm running mdadm 2.6.7 against a > stock 2.6.24.4 kernel. > > I'm seeing the problem in a cluster management application I'm writing, but I > can reproduce it simply using two LV block devices, 150M each: > > 2# blockdev --getsz /dev/disk.4/{one,two} > 311296 > 311296 > > I build an array that only fills two thirds of the component disks: > > 2# mdadm --create /dev/md4 --level=1 --raid-disks=2 --size 102400 \ > --metadata=1.2 /dev/disk.4/{one,two} > mdadm: largest drive (/dev/disk.4/one) exceed size (102400K) by more than 1% > Continue creating array? y > mdadm: array /dev/md4 started. > 2# blockdev --getsz /dev/md4 > 204800 > > When the array is fully synced and up [UU], I try to grow the array to use > more of the component disks: > > 2# mdadm --grow /dev/md4 --size max > 2# blockdev --getsz /dev/md4 > 311272 > > and so this has worked fine. > > However, if I increase the size of the underlying block devices: > > 2# lvm lvresize -L 200M /dev/disk.4/one > Extending logical volume one to 200.00 MB > Logical volume one successfully resized > 2# lvm lvresize -L 200M /dev/disk.4/two > Extending logical volume two to 200.00 MB > Logical volume two successfully resized > 2# blockdev --getsz /dev/disk.4/{one,two} > 409600 > 409600 > > and then try to grow the array: > > 2# mdadm --grow /dev/md4 --size max > 2# blockdev --getsz /dev/md4 > 311272 > 2# mdadm --grow /dev/md4 --size 155637 just 1kB more > mdadm: Cannot set device size for /dev/md4: No space left on device > > it fails. Somehow, the change in the underlying component device sizes doesn't > seem to be noticed even though the kernel's idea (as returned by blockdev/ioctl > BLKGETSIZE[64]) has correctly changed. I can even remove and re-add both drives > in turn without it noticing the size change and allowing me to grow the array. > > Trying to do something like > > echo 155637 > /sys/block/md4/md/component_size > > doesn't work either. It'll accept values up as far as the size of the block > device when the array was created, but no larger. > > The only recipe I've found which works is, for each drive in turn, to fail it, > remove it, zero the superblock and then re-add it. This seems to mean the md > driver sees is as a new component (e.g. different desc_nr in /proc/mdstat) and > therefore re-reads its size? This isn't at all friendly to the raid array, > though. While each device is out, any redundancy has gone. Is there some less > drastic way to persuade the array to grow? > Hi Chris, I had exactly the same problem some months ago. The key is in the (obscure) assembly option --update=devicesize. Find it in the mdadm manpage and the whole picture will come together instantly. The wisdom is not very widespread as this affects only V1.1 and 1.2 superblocks. Cheers ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-18 19:22 ` Peter Rabbitson @ 2008-06-18 20:00 ` Chris Webb 2008-06-19 3:42 ` Neil Brown 0 siblings, 1 reply; 18+ messages in thread From: Chris Webb @ 2008-06-18 20:00 UTC (permalink / raw) To: Peter Rabbitson; +Cc: linux-raid Peter Rabbitson <rabbit+list@rabbit.us> writes: > I had exactly the same problem some months ago. The key is in the > (obscure) assembly option --update=devicesize. Find it in the mdadm > manpage and the whole picture will come together instantly. The wisdom is > not very widespread as this affects only V1.1 and 1.2 superblocks. Thanks, that's great. I've been using a man page slightly older (2.5.6) than the mdadm binary I'm using (2.6.7), which doesn't mention update=devicesize, although I'd probably have missed it anyway! However, is there a way to do this without stopping the array first? I'm using it as an LVM PV and slicing it up into logical volumes which clients will be running off while I'm wanting to grow it. Cheers, Chris. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-18 20:00 ` Chris Webb @ 2008-06-19 3:42 ` Neil Brown 2008-06-19 15:45 ` Chris Webb 0 siblings, 1 reply; 18+ messages in thread From: Neil Brown @ 2008-06-19 3:42 UTC (permalink / raw) To: Chris Webb; +Cc: Peter Rabbitson, linux-raid On Wednesday June 18, chris@arachsys.com wrote: > Peter Rabbitson <rabbit+list@rabbit.us> writes: > > > I had exactly the same problem some months ago. The key is in the > > (obscure) assembly option --update=devicesize. Find it in the mdadm > > manpage and the whole picture will come together instantly. The wisdom is > > not very widespread as this affects only V1.1 and 1.2 superblocks. > > Thanks, that's great. I've been using a man page slightly older (2.5.6) than > the mdadm binary I'm using (2.6.7), which doesn't mention update=devicesize, > although I'd probably have missed it anyway! > > However, is there a way to do this without stopping the array first? I'm > using it as an LVM PV and slicing it up into logical volumes which clients > will be running off while I'm wanting to grow it. Not currently, no. It would require changing the 'size' as known by md of each device first. i.e. /sys/block/mdX/md/dev-*/size. However this currently cannot be changed while the array is active. I cannot immediately see a good reason why not. It may be as simple as removing if (my_mddev->pers) return -EBUSY; from the start of rdev_size_store, but I wouldn't recommend experimenting with that on a production system.... Then you would need some code to update the device sizes to match their real size (but only if the metadata is at the start of the device .. or maybe relocating the metadata first ...) Then you could "--grow --size=max" happily. I'd put this on my todo list if it wasn't so long already :-) (Well, I've put it there anyway, but no promises). NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-19 3:42 ` Neil Brown @ 2008-06-19 15:45 ` Chris Webb 2008-06-19 23:10 ` Chris Webb 2008-06-19 23:49 ` Chris Webb 0 siblings, 2 replies; 18+ messages in thread From: Chris Webb @ 2008-06-19 15:45 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2361 bytes --] Hi. Thanks for the helpful pointers. Neil Brown <neilb@suse.de> writes: > I'd put this on my todo list if it wasn't so long already :-) > (Well, I've put it there anyway, but no promises). I'll have a stab at implementing this if you like: it's only fair as I'm the one who wants the feature! > It would require changing the 'size' as known by md of each device > first. i.e. /sys/block/mdX/md/dev-*/size. > However this currently cannot be changed while the array is active. > > I cannot immediately see a good reason why not. It may be as simple > as removing > > if (my_mddev->pers) > return -EBUSY; > > from the start of rdev_size_store, but I wouldn't recommend > experimenting with that on a production system.... Looking through the kernel md.c driver, I can't see anywhere that will obviously break if rdev->size increases on a running array; it's used very rarely. Given this, I've tried taking this check out in a scratch virtual machine, and it works nicely: it's sufficient to echo the new size into the sysfs file exactly as you indicate, then --grow --size=max works correctly. I added some checks that the new value of rdev->size won't allow the data to overlap the superblock, with rdev_size_store truncating the size value if it is too large for the device, and filling the device if size = 0 is passed in. The only glitch is that the available dev size doesn't get updated in the metadata, so next time the device is assembled, it fails to add the device with an invalid argument error. Calling --assemble with --update=devicesize fixes this of course. I guess that ideally rdev_size_store should write out the changed rdev->size to the metadata when it's called on a running array, in a similar manner to size_store for component_size? If the superblock is version 1.x, I'm already updating the data_size in the superblock ready to write out: if (sb->major_version == 1) { ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(rdev->size*2); /* TODO: write superblock out */ } However, I'm not sure what the correct call to get the superblock written out at this point would be. I tried md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, rdev->sb_page); after updating sb->data_size as above, but that doesn't seem to do the right thing. Cheers, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.patch --] [-- Type: text/plain, Size: 1168 bytes --] --- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 +++ linux-2.6.24.4-mdplay/drivers/md/md.c 2008-06-19 16:39:29.000000000 +0100 @@ -1946,8 +1946,21 @@ rdev_size_store(mdk_rdev_t *rdev, const unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + mdp_super_t *sb; + if (rdev->sb_offset > rdev->data_offset/2) { + if (!size || size > rdev->sb_offset - rdev->data_offset/2) + size = rdev->sb_offset - rdev->data_offset/2; + } else if (!size || 2*size + rdev->data_offset > rdev->bdev->bd_inode->i_size >> 9) + size = ((rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset)/2; + sb = (mdp_super_t *) page_address(rdev->sb_page); + if (sb->major_version == 1) { + ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(rdev->size*2); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + /* TODO: write superblock out */ + } + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-19 15:45 ` Chris Webb @ 2008-06-19 23:10 ` Chris Webb 2008-06-19 23:49 ` Chris Webb 1 sibling, 0 replies; 18+ messages in thread From: Chris Webb @ 2008-06-19 23:10 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Chris Webb <chris@arachsys.com> writes: > I guess that ideally rdev_size_store should write out the changed rdev->size > to the metadata when it's called on a running array, in a similar manner to > size_store for component_size? If the superblock is version 1.x, I'm already > updating the data_size in the superblock ready to write out: (Also, once I get the superblock writing out okay, I can presumably support metadata format 1.0 too, updating the sb_offset and writing a new copy at the end of the disk.) Cheers, Chris. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-19 15:45 ` Chris Webb 2008-06-19 23:10 ` Chris Webb @ 2008-06-19 23:49 ` Chris Webb 2008-06-20 11:13 ` Chris Webb 1 sibling, 1 reply; 18+ messages in thread From: Chris Webb @ 2008-06-19 23:49 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 530 bytes --] Chris Webb <chris@arachsys.com> writes: > However, I'm not sure what the correct call to get the superblock written > out at this point would be. I tried > > md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, > rdev->sb_page); > > after updating sb->data_size as above, but that doesn't seem to do the right > thing. It works without the typo! > + ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(rdev->size*2); should read size*2 rather than rdev->size*2. Cheers, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.patch --] [-- Type: text/plain, Size: 1125 bytes --] --- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 +++ linux-2.6.24.4-mdplay/drivers/md/md.c 2008-06-20 00:44:50.000000000 +0100 @@ -1946,8 +1946,20 @@ rdev_size_store(mdk_rdev_t *rdev, const unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + mdp_super_t *sb; + if (rdev->sb_offset > rdev->data_offset/2) { + if (!size || size > rdev->sb_offset - rdev->data_offset/2) + size = rdev->sb_offset - rdev->data_offset/2; + } else if (!size || 2*size + rdev->data_offset > rdev->bdev->bd_inode->i_size >> 9) + size = ((rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset)/2; + sb = (mdp_super_t *) page_address(rdev->sb_page); + if (sb->major_version == 1) { + ((struct mdp_superblock_1 *) sb)->data_size = cpu_to_le64(size*2); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + } + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-19 23:49 ` Chris Webb @ 2008-06-20 11:13 ` Chris Webb 2008-06-20 14:24 ` Chris Webb 0 siblings, 1 reply; 18+ messages in thread From: Chris Webb @ 2008-06-20 11:13 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2057 bytes --] Here's an updated, cleaned up version of my patch. Since the action is superblock version-specific, it now indirects through super_types[] as other version-specific parts of md.c do. At present it allows the rdev->size to be updated through sysfs for metadata types 1.1 and 1.2, making all the correct checks on the supplied value. It also supports growing rdev->size as large as possible for the block device using echo "0" >/sys/block/mdX/md/rdY/size. If you try to extend the size of 1.0 and 0.90 devices so that the data area would overlap the superblock, rdev_size_store just returns EBUSY, the same behaviour as previously. I'd like to support metadata types 0.90 and 1.0 too, relocating the superblock. I think it'd be sufficient to write a new one at the end of the device then update rdev->sb_offset. I'll try this next week. I'm a bit puzzled about what to do if there's an internal bitmap. I can ignore the issue at present because I only handle 1.1 and 1.2 metadata, where any bitmap is stored before the start of the data area. For metadata at the end of the disk, the bitmap will either be just before or just after the superblock, and so I'll presumably need to move it to the new end of the device along with the superblock. However, it looks at first glance like the bitmap offset is constant across the array, not a per-device setting. This suggests the bitmap has to move when the array is grown, not earlier when rdev->size changes. Unfortunately, that doesn't square with rdev->size being changed for individual disks first! I can't move the bitmap without changing rdev->size on all disks, but I can't increase rdev->size to a valid value on any disk without moving the bitmap at the same point. Either I'm missing something, or changing rdev->size online, drive-by-drive won't be possible if there's an internal bitmap on a 0.90 or 1.0 format array. I'm unlikely to get a chance to work on this again before Monday, but I'd be very grateful for any feedback or pointers. md.c is a fairly involved piece of code! Cheers, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.2.patch --] [-- Type: text/plain, Size: 3633 bytes --] --- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 +++ linux-2.6.24.4-cdwmd/drivers/md/md.c 2008-06-20 12:08:01.000000000 +0100 @@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup */ struct super_type { - char *name; - struct module *owner; - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + char *name; + struct module *owner; + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size); }; /* @@ -994,6 +995,19 @@ static void super_90_sync(mddev_t *mddev } /* + * rdev_size_change for 0.90.0 + */ +static unsigned long long +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + /* TODO: relocate sb to use full device */ + if (!size || size > rdev->sb_offset - rdev->data_offset/2) + size = rdev->sb_offset - rdev->data_offset/2; + return size; +} + + +/* * version 1 superblock */ @@ -1310,21 +1324,52 @@ static void super_1_sync(mddev_t *mddev, sb->sb_csum = calc_sb_1_csum(sb); } +static unsigned long long +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + struct mdp_superblock_1 *sb; + unsigned long long max_size; + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page); + if (rdev->sb_offset < rdev->data_offset/2) { + /* minor versions 1 and 2; superblock before data */ + max_size = ((rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset)/2; + if (!size || size > max_size) + size = max_size; + } else { + /* minor version 0; superblock after data */ + if (rdev->mddev->bitmap_offset < 0) { + /* don't overlap data with bitmap */ + max_size = rdev->sb_offset*2 + rdev->mddev->bitmap_offset; + max_size = (max_size - rdev->data_offset)/2; + } else + max_size = rdev->sb_offset - rdev->data_offset/2; + /* TODO: relocate sb to use full device */ + if (!size || size > max_size/2) + size = max_size/2; + } + sb->data_size = cpu_to_le64(size*2); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + return size; +} + static struct super_type super_types[] = { [0] = { .name = "0.90.0", .owner = THIS_MODULE, - .load_super = super_90_load, - .validate_super = super_90_validate, - .sync_super = super_90_sync, + .load_super = super_90_load, + .validate_super = super_90_validate, + .sync_super = super_90_sync, + .rdev_size_change = super_90_rdev_size_change, }, [1] = { .name = "md-1", .owner = THIS_MODULE, - .load_super = super_1_load, - .validate_super = super_1_validate, - .sync_super = super_1_sync, + .load_super = super_1_load, + .validate_super = super_1_validate, + .sync_super = super_1_sync, + .rdev_size_change = super_1_rdev_size_change, }, }; @@ -1946,8 +1991,13 @@ rdev_size_store(mdk_rdev_t *rdev, const unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + mdp_super_t *sb; + sb = (mdp_super_t *) page_address(rdev->sb_page); + size = super_types[sb->major_version].rdev_size_change(rdev, size); + if (!size) + return -EBUSY; + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-20 11:13 ` Chris Webb @ 2008-06-20 14:24 ` Chris Webb 2008-06-23 1:26 ` Neil Brown 0 siblings, 1 reply; 18+ messages in thread From: Chris Webb @ 2008-06-20 14:24 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 546 bytes --] Chris Webb <chris@arachsys.com> writes: > I'd like to support metadata types 0.90 and 1.0 too, relocating the > superblock. I think it'd be sufficient to write a new one at the end of > the device then update rdev->sb_offset. I'll try this next week. I've had a quick stab at this before I leave along with fixing a couple of other oversights. (I just return EBUSY if there's a bitmap present in 0.90 and 1.0.) It appears to work correctly for 0.90 but generates a corrupt superblock for 1.0, which I'll need to fix on Monday. Cheers, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.2.patch --] [-- Type: text/plain, Size: 3721 bytes --] --- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 +++ linux-2.6.24.4-cdwmd/drivers/md/md.c 2008-06-20 15:02:01.000000000 +0100 @@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup */ struct super_type { - char *name; - struct module *owner; - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + char *name; + struct module *owner; + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size); }; /* @@ -994,6 +995,23 @@ static void super_90_sync(mddev_t *mddev } /* + * rdev_size_change for 0.90.0 + */ +static unsigned long long +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + if (rdev->mddev->bitmap_offset) + return 0; /* can't move bitmap */ + rdev->sb_offset = calc_dev_sboffset(rdev->bdev); + if (!size || size > rdev->sb_offset) + size = rdev->sb_offset; + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + return size; +} + + +/* * version 1 superblock */ @@ -1310,21 +1328,49 @@ static void super_1_sync(mddev_t *mddev, sb->sb_csum = calc_sb_1_csum(sb); } +static unsigned long long +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + struct mdp_superblock_1 *sb; + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page); + if (rdev->sb_offset < rdev->data_offset/2) { + /* minor versions 1 and 2; superblock before data */ + unsigned long long max_size; + max_size = (rdev->bdev->bd_inode->i_size >> 10) - rdev->data_offset/2; + if (!size || size > max_size) + size = max_size; + } else { + /* minor version 0; superblock after data */ + if (rdev->mddev->bitmap_offset) + return 0; /* can't move bitmap */ + rdev->sb_offset = (rdev->bdev->bd_inode->i_size >> 10) - 8; + rdev->sb_offset &= ~(sector_t)(4 - 1); + if (!size || size > rdev->sb_offset - rdev->data_offset/2) + size = rdev->sb_offset - rdev->data_offset/2; + } + sb->data_size = cpu_to_le64(size*2); + sb->sb_csum = calc_sb_1_csum(sb); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + return size; +} static struct super_type super_types[] = { [0] = { .name = "0.90.0", .owner = THIS_MODULE, - .load_super = super_90_load, - .validate_super = super_90_validate, - .sync_super = super_90_sync, + .load_super = super_90_load, + .validate_super = super_90_validate, + .sync_super = super_90_sync, + .rdev_size_change = super_90_rdev_size_change, }, [1] = { .name = "md-1", .owner = THIS_MODULE, - .load_super = super_1_load, - .validate_super = super_1_validate, - .sync_super = super_1_sync, + .load_super = super_1_load, + .validate_super = super_1_validate, + .sync_super = super_1_sync, + .rdev_size_change = super_1_rdev_size_change, }, }; @@ -1946,8 +1992,13 @@ rdev_size_store(mdk_rdev_t *rdev, const unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + mdp_super_t *sb; + sb = (mdp_super_t *) page_address(rdev->sb_page); + size = super_types[sb->major_version].rdev_size_change(rdev, size); + if (!size) + return -EBUSY; + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-20 14:24 ` Chris Webb @ 2008-06-23 1:26 ` Neil Brown 2008-06-23 11:18 ` Chris Webb 0 siblings, 1 reply; 18+ messages in thread From: Neil Brown @ 2008-06-23 1:26 UTC (permalink / raw) To: Chris Webb; +Cc: linux-raid On Friday June 20, chris@arachsys.com wrote: > Chris Webb <chris@arachsys.com> writes: > > > I'd like to support metadata types 0.90 and 1.0 too, relocating the > > superblock. I think it'd be sufficient to write a new one at the end of > > the device then update rdev->sb_offset. I'll try this next week. > > I've had a quick stab at this before I leave along with fixing a couple of > other oversights. (I just return EBUSY if there's a bitmap present in 0.90 > and 1.0.) It appears to work correctly for 0.90 but generates a corrupt > superblock for 1.0, which I'll need to fix on Monday. Thanks for doing this!!! - I am perfectly happy with getting -EBUSY if there is an internal bitmap. It is quite easy to turn off the bitmap, resize the devices, then turn it on again, and there is little cost in doing this. - I would much prefer using "sector" numbers rather than "K" numbers in any new code. I'd eventually like to alway use sector numbers internally, but that's a lower priority. So if you could change the rdev_size_change methods to convert to sectors and then use that I'd appreciate it. - I think you really should call md_super_wait after md_super_write. You really don't want the device to appear bigger until the new metadata really is safe of disk. - I think you need some protection to make sure that size doesn't get set below my_mddev->size while the array is active. That would be bad. What I'd really like is for md to get a call-back when the device size changes, so that the metadata can be relocated immediately. However that is a little way off, and I think this is a useful thing to have now. Thanks, NeilBrown > > Cheers, > > Chris. > --- linux-2.6.24.4/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 > +++ linux-2.6.24.4-cdwmd/drivers/md/md.c 2008-06-20 15:02:01.000000000 +0100 > @@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup > */ > > struct super_type { > - char *name; > - struct module *owner; > - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); > - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); > - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); > + char *name; > + struct module *owner; > + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); > + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); > + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); > + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size); > }; > > /* > @@ -994,6 +995,23 @@ static void super_90_sync(mddev_t *mddev > } > > /* > + * rdev_size_change for 0.90.0 > + */ > +static unsigned long long > +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) > +{ > + if (rdev->mddev->bitmap_offset) > + return 0; /* can't move bitmap */ > + rdev->sb_offset = calc_dev_sboffset(rdev->bdev); > + if (!size || size > rdev->sb_offset) > + size = rdev->sb_offset; > + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, > + rdev->sb_page); > + return size; > +} > + > + > +/* > * version 1 superblock > */ > > @@ -1310,21 +1328,49 @@ static void super_1_sync(mddev_t *mddev, > sb->sb_csum = calc_sb_1_csum(sb); > } > > +static unsigned long long > +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) > +{ > + struct mdp_superblock_1 *sb; > + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page); > + if (rdev->sb_offset < rdev->data_offset/2) { > + /* minor versions 1 and 2; superblock before data */ > + unsigned long long max_size; > + max_size = (rdev->bdev->bd_inode->i_size >> 10) - rdev->data_offset/2; > + if (!size || size > max_size) > + size = max_size; > + } else { > + /* minor version 0; superblock after data */ > + if (rdev->mddev->bitmap_offset) > + return 0; /* can't move bitmap */ > + rdev->sb_offset = (rdev->bdev->bd_inode->i_size >> 10) - 8; > + rdev->sb_offset &= ~(sector_t)(4 - 1); > + if (!size || size > rdev->sb_offset - rdev->data_offset/2) > + size = rdev->sb_offset - rdev->data_offset/2; > + } > + sb->data_size = cpu_to_le64(size*2); > + sb->sb_csum = calc_sb_1_csum(sb); > + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, > + rdev->sb_page); > + return size; > +} > > static struct super_type super_types[] = { > [0] = { > .name = "0.90.0", > .owner = THIS_MODULE, > - .load_super = super_90_load, > - .validate_super = super_90_validate, > - .sync_super = super_90_sync, > + .load_super = super_90_load, > + .validate_super = super_90_validate, > + .sync_super = super_90_sync, > + .rdev_size_change = super_90_rdev_size_change, > }, > [1] = { > .name = "md-1", > .owner = THIS_MODULE, > - .load_super = super_1_load, > - .validate_super = super_1_validate, > - .sync_super = super_1_sync, > + .load_super = super_1_load, > + .validate_super = super_1_validate, > + .sync_super = super_1_sync, > + .rdev_size_change = super_1_rdev_size_change, > }, > }; > > @@ -1946,8 +1992,13 @@ rdev_size_store(mdk_rdev_t *rdev, const > unsigned long long size = simple_strtoull(buf, &e, 10); > if (e==buf || (*e && *e != '\n')) > return -EINVAL; > - if (rdev->mddev->pers) > - return -EBUSY; > + if (rdev->mddev->pers) { > + mdp_super_t *sb; > + sb = (mdp_super_t *) page_address(rdev->sb_page); > + size = super_types[sb->major_version].rdev_size_change(rdev, size); > + if (!size) > + return -EBUSY; > + } > rdev->size = size; > if (size < rdev->mddev->size || rdev->mddev->size == 0) > rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-23 1:26 ` Neil Brown @ 2008-06-23 11:18 ` Chris Webb 2008-06-23 22:53 ` Neil Brown 2008-06-24 2:40 ` Trouble increasing md component size Neil Brown 0 siblings, 2 replies; 18+ messages in thread From: Chris Webb @ 2008-06-23 11:18 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2934 bytes --] Hi Neil. Thanks for your advice on this patch. Neil Brown <neilb@suse.de> writes: > Thanks for doing this!!! No problem at all; it's great to be able to help out with the md driver which I use and rely upon so heavily! > - I would much prefer using "sector" numbers rather than "K" numbers > in any new code. I'd eventually like to alway use sector numbers > internally, but that's a lower priority. So if you could change > the rdev_size_change methods to convert to sectors and then use > that I'd appreciate it. I've done this, and commented the conversion so that it'll be easy to remove if you later decide to change rdev->size to work with sectors instead of kB. To be honest, one of the things I found most confusing reading the md driver for the first time was the use of kilobyte blocks for some sizes, and sectors for others. Migrating to sectors throughout would be great for readability, especially with most other kernel code using 512-byte sectors. It looks like quite a big job, though, and changing the sysfs interface is presumably out unless you rename the files? > - I think you really should call md_super_wait after md_super_write. > You really don't want the device to appear bigger until the > new metadata really is safe of disk. Ah, thanks for pointing out this one. I wasn't aware that md_super_write could return before the write had finished. Looking at the description of md_super_wait, it looks like it might even have to retry, so the superblock might never get written at all without it? I've added md_super_wait after both md_super_write calls. > - I think you need some protection to make sure that size doesn't get > set below my_mddev->size while the array is active. That would be > bad. Definitely! I've put in the appropriate size guards for both metadata formats. This morning I've also fixed the mysterious corruption of metadata 1.0 superblocks: turns out I forgot to update the sb->super_offset to match rdev->sb_offset before writing out the superblock. As far as I can tell from repeated testing on scratch arrays here, this patch (attached) now behaves correctly and reliably for all supported metadata formats. > What I'd really like is for md to get a call-back when the device size > changes, so that the metadata can be relocated immediately. However > that is a little way off, and I think this is a useful thing to have > now. If it's easy to register for such a call-back (?), I think it would be sufficient for the call-back to run that new rdev_size_change superblock function as super_types[sb->major_version].rdev_size_change(rdev, 0) to update the rdev->size & superblock, and move the metadata if necessary. For a shrink you probably want to resize before the block device changes size rather than afterwards, although that's presumably not going to be easy/possible to achieve for many block device changes. Cheers, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.3.patch --] [-- Type: text/plain, Size: 4795 bytes --] Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the superblock if necessary for this metadata version. We prevent the available space from shrinking to less than the used size, and allow it to be set to zero to fill all the available space on the underlying device. Signed-off-by: Chris Webb <chris@arachsys.com> --- md.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 81 insertions(+), 13 deletions(-) --- linux-2.6.24.4.orig/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 +++ linux-2.6.24.4/drivers/md/md.c 2008-06-23 10:29:34.000000000 +0100 @@ -652,11 +652,12 @@ static unsigned int calc_sb_csum(mdp_sup */ struct super_type { - char *name; - struct module *owner; - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + char *name; + struct module *owner; + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, unsigned long long size); }; /* @@ -994,6 +995,27 @@ static void super_90_sync(mddev_t *mddev } /* + * rdev_size_change for 0.90.0 + */ +static unsigned long long +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + if (size && size < rdev->mddev->size) + return 0; /* component must fit device */ + size *= 2; /* convert to sectors */ + if (rdev->mddev->bitmap_offset) + return 0; /* can't move bitmap */ + rdev->sb_offset = calc_dev_sboffset(rdev->bdev); + if (!size || size > rdev->sb_offset*2) + size = rdev->sb_offset*2; + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + md_super_wait(rdev->mddev); + return size/2; /* kB for sysfs */ +} + + +/* * version 1 superblock */ @@ -1310,21 +1332,62 @@ static void super_1_sync(mddev_t *mddev, sb->sb_csum = calc_sb_1_csum(sb); } +static unsigned long long +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + struct mdp_superblock_1 *sb; + unsigned long long max_size; + if (size && size < rdev->mddev->size) + return 0; /* component must fit device */ + size *= 2; /* convert to sectors */ + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page); + if (rdev->sb_offset < rdev->data_offset/2) { + /* minor versions 1 and 2; superblock before data */ + max_size = (rdev->bdev->bd_inode->i_size >> 9) - rdev->data_offset; + if (!size || size > max_size) + size = max_size; + } else { + /* minor version 0; superblock after data */ + if (rdev->mddev->bitmap_offset) + return 0; /* can't move bitmap */ + rdev->sb_offset = (rdev->bdev->bd_inode->i_size >> 10) - 8; + rdev->sb_offset &= ~(sector_t)(4 - 1); + max_size = rdev->sb_offset*2 - rdev->data_offset; + /* space reservation for bitmap */ + if (max_size - 64*2 >= 200*1024*1024*2) + max_size -= 128*2; + else if (max_size - 4*2 >= 8*1024*1024*2) + max_size -= 64*2; + else if (max_size >= 64*2) + max_size -= 4*2; + if (!size || size > max_size) + size = max_size; + } + sb->data_size = cpu_to_le64(size); + sb->super_offset = rdev->sb_offset*2; + sb->sb_csum = calc_sb_1_csum(sb); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + md_super_wait(rdev->mddev); + return size/2; /* kB for sysfs */ +} static struct super_type super_types[] = { [0] = { .name = "0.90.0", .owner = THIS_MODULE, - .load_super = super_90_load, - .validate_super = super_90_validate, - .sync_super = super_90_sync, + .load_super = super_90_load, + .validate_super = super_90_validate, + .sync_super = super_90_sync, + .rdev_size_change = super_90_rdev_size_change, }, [1] = { .name = "md-1", .owner = THIS_MODULE, - .load_super = super_1_load, - .validate_super = super_1_validate, - .sync_super = super_1_sync, + .load_super = super_1_load, + .validate_super = super_1_validate, + .sync_super = super_1_sync, + .rdev_size_change = super_1_rdev_size_change, }, }; @@ -1946,8 +2009,13 @@ rdev_size_store(mdk_rdev_t *rdev, const unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + mdp_super_t *sb; + sb = (mdp_super_t *) page_address(rdev->sb_page); + size = super_types[sb->major_version].rdev_size_change(rdev, size); + if (!size) + return -EBUSY; + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-23 11:18 ` Chris Webb @ 2008-06-23 22:53 ` Neil Brown 2008-06-24 11:47 ` Chris Webb 2008-06-24 2:40 ` Trouble increasing md component size Neil Brown 1 sibling, 1 reply; 18+ messages in thread From: Neil Brown @ 2008-06-23 22:53 UTC (permalink / raw) To: Chris Webb; +Cc: Neil Brown, linux-raid On Monday June 23, chris@arachsys.com wrote: > Hi Neil. Thanks for your advice on this patch. And thank you for fixing it up. It looks good. I'll try to give is a proper review and do some testing myself in a few days. I'm a bit bothered about the code for leaving space for a bitmap for a 1.0 array. I'd rather avoid too much of this sort of 'policy' in the kernel. We could just leave it out and expect user-space to write a size that leaves room for a bitmap if that is needed, but maybe that is going too far the other way. I thought of just scaling the "space between end of data and start of metadata" by the same amount that the device is scaled, rounding up to the next sector. That feels "right" if it doesn't turn out to be too clumsy. Also, you add 4 lines greater than 80 columns :-) if you run ./scripts/checkpatch.pl name-of-patch-file it will tell you these things. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-23 22:53 ` Neil Brown @ 2008-06-24 11:47 ` Chris Webb 2008-06-24 23:19 ` Chris Webb 0 siblings, 1 reply; 18+ messages in thread From: Chris Webb @ 2008-06-24 11:47 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2986 bytes --] Neil Brown <neilb@suse.de> writes: > I'm a bit bothered about the code for leaving space for a bitmap for a > 1.0 array. I'd rather avoid too much of this sort of 'policy' in the > kernel. You're right, it does seem a bit arbitrary. I'll take it out. Originally I put it in there to avoid a discrepancy between the array you'd get if you used mdadm to create (which hard-codes this policy), and one reshaped in the kernel. That said, you get discrepancy anyway for v1.1 and v1.2 metadata, where the bitmap reserved area is at the start of the array and can't grow. Maybe the right thing to do is to copy the behaviour for v1.1 and v1.2 arrays, and keep the reserved area constant for v1.0 arrays too, rather than attempting to scale it all? I've implemented this in the attached patch. This also means that v1.0 arrays share the nice property of v0.9, v1.1 and v1.2 arrays that, modulo chunk-size rounding, growing a device by some amount always leads the same growth in the component_size (and hence it isn't too awkward to calculate the resulting change in the available array size). More generally, being a user-space consumer of md arrays which change size isn't especially convenient at the moment. If, say, you use your md as an lvm pv, and build it out of lower-level lvm volumes so you can change the underlying device sizes, growing the top level isn't too painful. You can use the 'maximum size' shortcut: increase the underlying devices, grow the array to max size, then grow the top level structure to max size. However, shrinking an md backed pv is quite a bit harder. You need to calculate the array size for a given underlying device size in advance of actually making the change, so you can shrink the top level pv before resizing the array then the underlying devices. There's no defined route for doing this calculation, either kernel or userspace. The only reasonable way forward is to do everything relative to the existing sizes, comparing against the values in sysfs and relying on the change to device sizes being the same as the change to component sizes (allowing for chunk-size rounding), then using a switch dispatching on RAID level and layout (for raid10). Even using mdadm create --size=X isn't especially easy because the size you're specifying is the usable component size, whose connection to the underlying device size is non-trivial: the metadata offset and size vary depending on type, mdadm will automatically reserve space for a bitmap according to a set of heuristics (which can't be overridden). You end up having to know the details of mdadm's internal operation just to reliable calculate the correct size to ask for to get the right component size. I'd like to do something about this... but it's not yet clear to me what the right thing would be to fit in tidily with the existing interface, especially for the 'dry run' calculation. > Also, you add 4 lines greater than 80 columns :-) Oops. I've fixed these! Best wishes, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.4.patch --] [-- Type: text/plain, Size: 4816 bytes --] From: Chris Webb <chris@arachsys.com> Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the superblock if necessary for this metadata version. We prevent the available space from shrinking to less than the used size, and allow it to be set to zero to fill all the available space on the underlying device. Signed-off-by: Chris Webb <chris@arachsys.com> --- drivers/md/md.c | 94 +++++++++++++++++++++++++++++++++++++++------- 1 file changed, 81 insertions(+), 13 deletions(-) diff -uNpr linux-2.6.24.4.orig/drivers/md/md.c linux-2.6.24.4/drivers/md/md.c --- linux-2.6.24.4.orig/drivers/md/md.c 2008-03-24 18:49:18.000000000 +0000 +++ linux-2.6.24.4/drivers/md/md.c 2008-06-24 11:35:47.000000000 +0100 @@ -652,11 +652,14 @@ static unsigned int calc_sb_csum(mdp_sup */ struct super_type { - char *name; - struct module *owner; - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + char *name; + struct module *owner; + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, + int minor_version); + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, + unsigned long long size); }; /* @@ -994,6 +997,27 @@ static void super_90_sync(mddev_t *mddev } /* + * rdev_size_change for 0.90.0 + */ +static unsigned long long +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + if (size && size < rdev->mddev->size) + return 0; /* component must fit device */ + size *= 2; /* convert to sectors */ + if (rdev->mddev->bitmap_offset) + return 0; /* can't move bitmap */ + rdev->sb_offset = calc_dev_sboffset(rdev->bdev); + if (!size || size > rdev->sb_offset*2) + size = rdev->sb_offset*2; + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + md_super_wait(rdev->mddev); + return size/2; /* kB for sysfs */ +} + + +/* * version 1 superblock */ @@ -1310,21 +1334,59 @@ static void super_1_sync(mddev_t *mddev, sb->sb_csum = calc_sb_1_csum(sb); } +static unsigned long long +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + struct mdp_superblock_1 *sb; + unsigned long long max_size; + if (size && size < rdev->mddev->size) + return 0; /* component must fit device */ + size *= 2; /* convert to sectors */ + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page); + if (rdev->sb_offset < rdev->data_offset/2) { + /* minor versions 1 and 2; superblock before data */ + max_size = (rdev->bdev->bd_inode->i_size >> 9); + max_size -= rdev->data_offset; + if (!size || size > max_size) + size = max_size; + } else if (rdev->mddev->bitmap_offset) { + /* minor version 0 with bitmap we can't move */ + return 0; + } else { + /* minor version 0; superblock after data */ + sector_t sb_offset; + sb_offset = (rdev->bdev->bd_inode->i_size >> 9) - 8*2; + sb_offset &= ~(sector_t)(4*2 - 1); + max_size = rdev->size*2 + sb_offset - rdev->sb_offset*2; + if (!size || size > max_size) + size = max_size; + rdev->sb_offset = sb_offset/2; + } + sb->data_size = cpu_to_le64(size); + sb->super_offset = rdev->sb_offset*2; + sb->sb_csum = calc_sb_1_csum(sb); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + md_super_wait(rdev->mddev); + return size/2; /* kB for sysfs */ +} static struct super_type super_types[] = { [0] = { .name = "0.90.0", .owner = THIS_MODULE, - .load_super = super_90_load, - .validate_super = super_90_validate, - .sync_super = super_90_sync, + .load_super = super_90_load, + .validate_super = super_90_validate, + .sync_super = super_90_sync, + .rdev_size_change = super_90_rdev_size_change, }, [1] = { .name = "md-1", .owner = THIS_MODULE, - .load_super = super_1_load, - .validate_super = super_1_validate, - .sync_super = super_1_sync, + .load_super = super_1_load, + .validate_super = super_1_validate, + .sync_super = super_1_sync, + .rdev_size_change = super_1_rdev_size_change, }, }; @@ -1946,8 +2008,14 @@ rdev_size_store(mdk_rdev_t *rdev, const unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + mdp_super_t *sb; + sb = (mdp_super_t *) page_address(rdev->sb_page); + size = super_types[sb->major_version]. + rdev_size_change(rdev, size); + if (!size) + return -EBUSY; + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-24 11:47 ` Chris Webb @ 2008-06-24 23:19 ` Chris Webb 2008-09-17 18:11 ` [PATCH] md: Fix rdev_size_store with size = 0 Chris Webb 2008-10-07 12:40 ` [Resend] " Chris Webb 0 siblings, 2 replies; 18+ messages in thread From: Chris Webb @ 2008-06-24 23:19 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 237 bytes --] Chris Webb <chris@arachsys.com> writes: > Oops. One final (hopefully!) correction: my patch didn't do the quite right thing in the case of externally managed metadata, where there's no persistent superblock to update. Cheers, Chris. [-- Attachment #2: linux-2.6.24.4-md-rdev-size.5.patch --] [-- Type: text/plain, Size: 4819 bytes --] From: Chris Webb <chris@arachsys.com> Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the superblock if necessary for this metadata version. We prevent the available space from shrinking to less than the used size, and allow it to be set to zero to fill all the available space on the underlying device. Signed-off-by: Chris Webb <chris@arachsys.com> --- drivers/md/md.c | 100 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 87 insertions(+), 13 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -652,11 +652,14 @@ */ struct super_type { - char *name; - struct module *owner; - int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version); - int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); - void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + char *name; + struct module *owner; + int (*load_super)(mdk_rdev_t *rdev, mdk_rdev_t *refdev, + int minor_version); + int (*validate_super)(mddev_t *mddev, mdk_rdev_t *rdev); + void (*sync_super)(mddev_t *mddev, mdk_rdev_t *rdev); + unsigned long long (*rdev_size_change)(mdk_rdev_t *rdev, + unsigned long long size); }; /* @@ -992,6 +995,27 @@ sb->this_disk = sb->disks[rdev->desc_nr]; sb->sb_csum = calc_sb_csum(sb); } + +/* + * rdev_size_change for 0.90.0 + */ +static unsigned long long +super_90_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + if (size && size < rdev->mddev->size) + return 0; /* component must fit device */ + size *= 2; /* convert to sectors */ + if (rdev->mddev->bitmap_offset) + return 0; /* can't move bitmap */ + rdev->sb_offset = calc_dev_sboffset(rdev->bdev); + if (!size || size > rdev->sb_offset*2) + size = rdev->sb_offset*2; + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + md_super_wait(rdev->mddev); + return size/2; /* kB for sysfs */ +} + /* * version 1 superblock @@ -1310,21 +1334,59 @@ sb->sb_csum = calc_sb_1_csum(sb); } +static unsigned long long +super_1_rdev_size_change(mdk_rdev_t *rdev, unsigned long long size) +{ + struct mdp_superblock_1 *sb; + unsigned long long max_size; + if (size && size < rdev->mddev->size) + return 0; /* component must fit device */ + size *= 2; /* convert to sectors */ + sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page); + if (rdev->sb_offset < rdev->data_offset/2) { + /* minor versions 1 and 2; superblock before data */ + max_size = (rdev->bdev->bd_inode->i_size >> 9); + max_size -= rdev->data_offset; + if (!size || size > max_size) + size = max_size; + } else if (rdev->mddev->bitmap_offset) { + /* minor version 0 with bitmap we can't move */ + return 0; + } else { + /* minor version 0; superblock after data */ + sector_t sb_offset; + sb_offset = (rdev->bdev->bd_inode->i_size >> 9) - 8*2; + sb_offset &= ~(sector_t)(4*2 - 1); + max_size = rdev->size*2 + sb_offset - rdev->sb_offset*2; + if (!size || size > max_size) + size = max_size; + rdev->sb_offset = sb_offset/2; + } + sb->data_size = cpu_to_le64(size); + sb->super_offset = rdev->sb_offset*2; + sb->sb_csum = calc_sb_1_csum(sb); + md_super_write(rdev->mddev, rdev, rdev->sb_offset << 1, rdev->sb_size, + rdev->sb_page); + md_super_wait(rdev->mddev); + return size/2; /* kB for sysfs */ +} static struct super_type super_types[] = { [0] = { .name = "0.90.0", .owner = THIS_MODULE, - .load_super = super_90_load, - .validate_super = super_90_validate, - .sync_super = super_90_sync, + .load_super = super_90_load, + .validate_super = super_90_validate, + .sync_super = super_90_sync, + .rdev_size_change = super_90_rdev_size_change, }, [1] = { .name = "md-1", .owner = THIS_MODULE, - .load_super = super_1_load, - .validate_super = super_1_validate, - .sync_super = super_1_sync, + .load_super = super_1_load, + .validate_super = super_1_validate, + .sync_super = super_1_sync, + .rdev_size_change = super_1_rdev_size_change, }, }; @@ -1946,8 +2008,20 @@ unsigned long long size = simple_strtoull(buf, &e, 10); if (e==buf || (*e && *e != '\n')) return -EINVAL; - if (rdev->mddev->pers) - return -EBUSY; + if (rdev->mddev->pers) { + if (rdev->mddev->persistent) { + mdp_super_t *sb; + sb = (mdp_super_t *) page_address(rdev->sb_page); + size = super_types[sb->major_version]. + rdev_size_change(rdev, size); + if (!size) + return -EBUSY; + } else if (!size) { + size = (rdev->bdev->bd_inode->i_size >> 10); + size -= rdev->data_offset/2; + } else if (size < rdev->mddev->size) + return 0; /* component must fit device */ + } rdev->size = size; if (size < rdev->mddev->size || rdev->mddev->size == 0) rdev->mddev->size = size; ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] md: Fix rdev_size_store with size = 0 2008-06-24 23:19 ` Chris Webb @ 2008-09-17 18:11 ` Chris Webb 2008-10-07 12:40 ` [Resend] " Chris Webb 1 sibling, 0 replies; 18+ messages in thread From: Chris Webb @ 2008-09-17 18:11 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Hi Neil. You accepted a patch from me back at the end of June which allowed /sys/block/mdX/md/rdY/size to change on running arrays. It supported a value of zero to fill all the space on the underlying device. However, a later patch tidying up rdev_size_store broke the size = 0 functionality by testing for size < my_mddev->size slightly too early. The following patch fixes this. Best wishes, Chris. From: Chris Webb <chris@arachsys.com> Fix rdev_size_store with size = 0 Signed-off-by: Chris Webb <chris@arachsys.com> diff --git a/drivers/md/md.c b/drivers/md/md.c --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2109,8 +2109,6 @@ if (strict_strtoull(buf, 10, &size) < 0) return -EINVAL; - if (size < my_mddev->size) - return -EINVAL; if (my_mddev->pers && rdev->raid_disk >= 0) { if (my_mddev->persistent) { size = super_types[my_mddev->major_version]. @@ -2121,9 +2119,9 @@ size = (rdev->bdev->bd_inode->i_size >> 10); size -= rdev->data_offset/2; } - if (size < my_mddev->size) - return -EINVAL; /* component must fit device */ - } + } + if (size < my_mddev->size) + return -EINVAL; /* component must fit device */ rdev->size = size; if (size > oldsize && my_mddev->external) { ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Resend] [PATCH] md: Fix rdev_size_store with size = 0 2008-06-24 23:19 ` Chris Webb 2008-09-17 18:11 ` [PATCH] md: Fix rdev_size_store with size = 0 Chris Webb @ 2008-10-07 12:40 ` Chris Webb 2008-10-13 0:54 ` Neil Brown 1 sibling, 1 reply; 18+ messages in thread From: Chris Webb @ 2008-10-07 12:40 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Hi Neil. You accepted a patch from me back at the end of June which allowed /sys/block/mdX/md/rdY/size to change on running arrays. It supported a value of zero to fill all the space on the underlying device. However, a later patch tidying up rdev_size_store broke the size = 0 functionality by testing for size < my_mddev->size slightly too early. The following patch fixes this. Best wishes, Chris. From: Chris Webb <chris@arachsys.com> Fix rdev_size_store with size = 0 Signed-off-by: Chris Webb <chris@arachsys.com> diff --git a/drivers/md/md.c b/drivers/md/md.c --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2109,8 +2109,6 @@ if (strict_strtoull(buf, 10, &size) < 0) return -EINVAL; - if (size < my_mddev->size) - return -EINVAL; if (my_mddev->pers && rdev->raid_disk >= 0) { if (my_mddev->persistent) { size = super_types[my_mddev->major_version]. @@ -2121,9 +2119,9 @@ size = (rdev->bdev->bd_inode->i_size >> 10); size -= rdev->data_offset/2; } - if (size < my_mddev->size) - return -EINVAL; /* component must fit device */ - } + } + if (size < my_mddev->size) + return -EINVAL; /* component must fit device */ rdev->size = size; if (size > oldsize && my_mddev->external) { ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Resend] [PATCH] md: Fix rdev_size_store with size = 0 2008-10-07 12:40 ` [Resend] " Chris Webb @ 2008-10-13 0:54 ` Neil Brown 0 siblings, 0 replies; 18+ messages in thread From: Neil Brown @ 2008-10-13 0:54 UTC (permalink / raw) To: Chris Webb; +Cc: linux-raid On Tuesday October 7, chris@arachsys.com wrote: > Hi Neil. You accepted a patch from me back at the end of June which allowed > /sys/block/mdX/md/rdY/size to change on running arrays. It supported a value > of zero to fill all the space on the underlying device. However, a later > patch tidying up rdev_size_store broke the size = 0 functionality by testing > for size < my_mddev->size slightly too early. The following patch fixes > this. Thanks - and sorry for breaking this :-( I'll send this upstream soon and probably aim it for -stable too. Thanks NeilBrown > > Best wishes, > > Chris. > > > From: Chris Webb <chris@arachsys.com> > > Fix rdev_size_store with size = 0 > > Signed-off-by: Chris Webb <chris@arachsys.com> > > diff --git a/drivers/md/md.c b/drivers/md/md.c > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -2109,8 +2109,6 @@ > > if (strict_strtoull(buf, 10, &size) < 0) > return -EINVAL; > - if (size < my_mddev->size) > - return -EINVAL; > if (my_mddev->pers && rdev->raid_disk >= 0) { > if (my_mddev->persistent) { > size = super_types[my_mddev->major_version]. > @@ -2121,9 +2119,9 @@ > size = (rdev->bdev->bd_inode->i_size >> 10); > size -= rdev->data_offset/2; > } > - if (size < my_mddev->size) > - return -EINVAL; /* component must fit device */ > - } > + } > + if (size < my_mddev->size) > + return -EINVAL; /* component must fit device */ > > rdev->size = size; > if (size > oldsize && my_mddev->external) { ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Trouble increasing md component size 2008-06-23 11:18 ` Chris Webb 2008-06-23 22:53 ` Neil Brown @ 2008-06-24 2:40 ` Neil Brown 1 sibling, 0 replies; 18+ messages in thread From: Neil Brown @ 2008-06-24 2:40 UTC (permalink / raw) To: Chris Webb; +Cc: linux-raid On Monday June 23, chris@arachsys.com wrote: > > What I'd really like is for md to get a call-back when the device size > > changes, so that the metadata can be relocated immediately. However > > that is a little way off, and I think this is a useful thing to have > > now. > > If it's easy to register for such a call-back (?), I think it would be > sufficient for the call-back to run that new rdev_size_change superblock > function as > > super_types[sb->major_version].rdev_size_change(rdev, 0) > > to update the rdev->size & superblock, and move the metadata if necessary. > For a shrink you probably want to resize before the block device changes > size rather than afterwards, although that's presumably not going to be > easy/possible to achieve for many block device changes. I'd meant to respond to this bit in my first reply, but got distracted. There currently is no mechanism for registering callbacks. One day I would like to create one. The approach I have in mind involves leveraging the bd_claim/bd_holder stuff. Current when someone "claim"s a block_device, they give a unique (void *) to identify them. My idea is to change that to be a struct with defined contents. e.g. struct bd_holder { struct block_dev_callback_operations *ops; }; Where struct block_dev_callback_operations { int (*size_change_request)(struct block_dev *bdev, sector_t newsize); void (*size_change_commit)(struct block_dev *bdev, sector_t newsize); .... } so if a blockdev wants to change it's size, and someone has claimed it, it first calls bdev->bd_holder->ops->size_change_request() with the new size. If that fails, it has to give up. If it succeeds, it makes the change, the calls ->size_change_commit. I think dm and md are currently the only devices which spontaneously change size, so they would be the first place to make these calls. Possibly we could then get the partition management code to allow size changes of active partitions if there was a size_change_request that could be called and would return success. There are quite a lot of places where bd_claim is called. Filesystems claim the block device they use, md and dm and swap do as well. In the first instance, we could make the "ops" pointer "NULL" and get the calling code to cope with that. Then one by one we could introduce useful functionality. I would then use these callbacks to also implement freeze_bdev. It currently hunts through the mount table for a filesystem on the bdev, and calls the s_op->write_super_lockfs method on that filesystem. This is somewhat ugly. Doing a callback through the bd_holder structure would be much more elegant. The only difficult issue is locking. Exactly what lock should be required when calling various block_dev_callback_operations? The easiest would be to hold the bdev_lock spinlock. That would be enough to make sure the holder doesn't disappear on us. But there isn't much you can do under a spinlock. You certainly cannot write new metadata to a device. Maybe you could get by with ->trylock and ->unlock block_dev_callback_operations which could be called under the spinlock, and all other operations much be called with that lock held. One would probably need to try writing code and see what falls out. And yes, with that in place, rdev_size_change(rdev, 0) would be very close to what you want the size_change_commit to do. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2008-10-13 0:54 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-18 18:26 Trouble increasing md component size Chris Webb 2008-06-18 19:22 ` Peter Rabbitson 2008-06-18 20:00 ` Chris Webb 2008-06-19 3:42 ` Neil Brown 2008-06-19 15:45 ` Chris Webb 2008-06-19 23:10 ` Chris Webb 2008-06-19 23:49 ` Chris Webb 2008-06-20 11:13 ` Chris Webb 2008-06-20 14:24 ` Chris Webb 2008-06-23 1:26 ` Neil Brown 2008-06-23 11:18 ` Chris Webb 2008-06-23 22:53 ` Neil Brown 2008-06-24 11:47 ` Chris Webb 2008-06-24 23:19 ` Chris Webb 2008-09-17 18:11 ` [PATCH] md: Fix rdev_size_store with size = 0 Chris Webb 2008-10-07 12:40 ` [Resend] " Chris Webb 2008-10-13 0:54 ` Neil Brown 2008-06-24 2:40 ` Trouble increasing md component size Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).