* [PATCH 001 of 11] md: Reformat code in raid1_end_write_request to avoid goto
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:30 ` [PATCH 002 of 11] md: Remove arbitrary limit on chunk size NeilBrown
` (9 subsequent siblings)
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
A recent change made this goto unnecessary, so reformat the
code to make it clearer what is happening.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/raid1.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./drivers/md/raid1.c 2006-05-01 15:10:00.000000000 +1000
@@ -374,26 +374,26 @@ static int raid1_end_write_request(struc
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
+ if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
reschedule_retry(r1_bio);
- goto out;
+ else {
+ /* it really is the end of this request */
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
+ }
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
}
- out:
+
if (to_put)
bio_put(to_put);
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 002 of 11] md: Remove arbitrary limit on chunk size.
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
2006-05-01 5:30 ` [PATCH 001 of 11] md: Reformat code in raid1_end_write_request to avoid goto NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:30 ` [PATCH 003 of 11] md: Remove useless ioctl warning NeilBrown
` (8 subsequent siblings)
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
The largest chunk size the code can support without substantial
surgery is 2^30 bytes, so make that the limit instead of an arbitrary
4Meg.
Some day, the 'chunksize' should change to a sector-shift
instead of a byte-count. Then no limit would be needed.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/raid10.c | 2 +-
./drivers/md/raid5.c | 4 ++--
./drivers/md/raid6main.c | 4 ++--
./include/linux/raid/md_k.h | 3 ++-
4 files changed, 7 insertions(+), 6 deletions(-)
diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./drivers/md/raid10.c 2006-05-01 15:10:17.000000000 +1000
@@ -2050,7 +2050,7 @@ static int run(mddev_t *mddev)
* maybe...
*/
{
- int stripe = conf->raid_disks * mddev->chunk_size / PAGE_SIZE;
+ int stripe = conf->raid_disks * (mddev->chunk_size / PAGE_SIZE);
stripe /= conf->near_copies;
if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)
mddev->queue->backing_dev_info.ra_pages = 2* stripe;
diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./drivers/md/raid5.c 2006-05-01 15:10:17.000000000 +1000
@@ -2382,8 +2382,8 @@ static int run(mddev_t *mddev)
* 2 * (n-1) * chunksize where 'n' is the number of raid devices
*/
{
- int stripe = (mddev->raid_disks-1) * mddev->chunk_size
- / PAGE_SIZE;
+ int stripe = (mddev->raid_disks-1) *
+ (mddev->chunk_size / PAGE_SIZE);
if (mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
}
diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./drivers/md/raid6main.c 2006-05-01 15:10:17.000000000 +1000
@@ -2135,8 +2135,8 @@ static int run(mddev_t *mddev)
* 2 * (n-2) * chunksize where 'n' is the number of raid devices
*/
{
- int stripe = (mddev->raid_disks-2) * mddev->chunk_size
- / PAGE_SIZE;
+ int stripe = (mddev->raid_disks-2) *
+ (mddev->chunk_size / PAGE_SIZE);
if (mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
}
diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./include/linux/raid/md_k.h 2006-05-01 15:10:17.000000000 +1000
@@ -40,7 +40,8 @@ typedef struct mdk_rdev_s mdk_rdev_t;
* options passed in raidrun:
*/
-#define MAX_CHUNK_SIZE (4096*1024)
+/* Currently this must fix in an 'int' */
+#define MAX_CHUNK_SIZE (1<<30)
/*
* MD's 'extended' device
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 003 of 11] md: Remove useless ioctl warning.
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
2006-05-01 5:30 ` [PATCH 001 of 11] md: Reformat code in raid1_end_write_request to avoid goto NeilBrown
2006-05-01 5:30 ` [PATCH 002 of 11] md: Remove arbitrary limit on chunk size NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:30 ` [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable NeilBrown
` (7 subsequent siblings)
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
This warning was slightly useful back in 2.2 days, but is more
an annoyance now. It makes it awkward to add new ioctls (that we we are
likely to do that in the current climate, but it is possible).
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 5 -----
1 file changed, 5 deletions(-)
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:10:17.000000000 +1000
@@ -3964,11 +3964,6 @@ static int md_ioctl(struct inode *inode,
goto done_unlock;
default:
- if (_IOC_TYPE(cmd) == MD_MAJOR)
- printk(KERN_WARNING "md: %s(pid %d) used"
- " obsolete MD ioctl, upgrade your"
- " software to use new ictls.\n",
- current->comm, current->pid);
err = -EINVAL;
goto abort_unlock;
}
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (2 preceding siblings ...)
2006-05-01 5:30 ` [PATCH 003 of 11] md: Remove useless ioctl warning NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:44 ` Andrew Morton
2006-05-02 5:56 ` bert hubert
2006-05-01 5:30 ` [PATCH 006 of 11] md: Remove nuisance message at shutdown NeilBrown
` (6 subsequent siblings)
10 siblings, 2 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
When a md array has been idle (no writes) for 20msecs it is marked as
'clean'. This delay turns out to be too short for some real
workloads. So increase it to 200msec (the time to update the metadata
should be a tiny fraction of that) and make it sysfs-configurable.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./Documentation/md.txt | 9 ++++++++
./drivers/md/md.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 61 insertions(+), 2 deletions(-)
diff ./Documentation/md.txt~current~ ./Documentation/md.txt
--- ./Documentation/md.txt~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./Documentation/md.txt 2006-05-01 15:10:18.000000000 +1000
@@ -207,6 +207,15 @@ All md devices contain:
available. It will then appear at md/dev-XXX (depending on the
name of the device) and further configuration is then possible.
+ safe_mode_delay
+ When an md array has seen no write requests for a certain period
+ of time, it will be marked as 'clean'. When another write
+ request arrive, the array is marked as 'dirty' before the write
+ commenses. This is known as 'safe_mode'.
+ The 'certain period' is controlled by this file which stores the
+ period as a number of seconds. The default is 200msec (0.200).
+ Writing a value of 0 disables safemode.
+
sync_speed_min
sync_speed_max
This are similar to /proc/sys/dev/raid/speed_limit_{min,max}
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:10:17.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:10:18.000000000 +1000
@@ -43,6 +43,7 @@
#include <linux/suspend.h>
#include <linux/poll.h>
#include <linux/mutex.h>
+#include <linux/ctype.h>
#include <linux/init.h>
@@ -1968,6 +1969,54 @@ static void analyze_sbs(mddev_t * mddev)
}
static ssize_t
+safe_delay_show(mddev_t *mddev, char *page)
+{
+ int msec = (mddev->safemode_delay*1000)/HZ;
+ return sprintf(page, "%d.%03d\n", msec/1000, msec%1000);
+}
+static ssize_t
+safe_delay_store(mddev_t *mddev, const char *cbuf, size_t len)
+{
+ int scale=1;
+ int dot=0;
+ int i;
+ unsigned long msec;
+ char buf[30];
+ char *e;
+ /* remove a period, and count digits after it */
+ if (len >= sizeof(buf))
+ return -EINVAL;
+ strlcpy(buf, cbuf, len);
+ buf[len] = 0;
+ for (i=0; i<len; i++) {
+ if (dot) {
+ if (isdigit(buf[i])) {
+ buf[i-1] = buf[i];
+ scale *= 10;
+ }
+ buf[i] = 0;
+ } else if (buf[i] == '.') {
+ dot=1;
+ buf[i] = 0;
+ }
+ }
+ msec = simple_strtoul(buf, &e, 10);
+ if (e == buf || (*e && *e != '\n'))
+ return -EINVAL;
+ msec = (msec * 1000) / scale;
+ if (msec == 0)
+ mddev->safemode_delay = 0;
+ else {
+ mddev->safemode_delay = (msec*HZ)/1000;
+ if (mddev->safemode_delay == 0)
+ mddev->safemode_delay = 1;
+ }
+ return len;
+}
+static struct md_sysfs_entry md_safe_delay =
+__ATTR(safe_mode_delay, 0644,safe_delay_show, safe_delay_store);
+
+static ssize_t
level_show(mddev_t *mddev, char *page)
{
struct mdk_personality *p = mddev->pers;
@@ -2423,6 +2472,7 @@ static struct attribute *md_default_attr
&md_size.attr,
&md_metadata.attr,
&md_new_device.attr,
+ &md_safe_delay.attr,
NULL,
};
@@ -2695,7 +2745,7 @@ static int do_md_run(mddev_t * mddev)
mddev->safemode = 0;
mddev->safemode_timer.function = md_safemode_timeout;
mddev->safemode_timer.data = (unsigned long) mddev;
- mddev->safemode_delay = (20 * HZ)/1000 +1; /* 20 msec delay */
+ mddev->safemode_delay = (200 * HZ)/1000 +1; /* 200 msec delay */
mddev->in_sync = 1;
ITERATE_RDEV(mddev,rdev,tmp)
@@ -4581,7 +4631,7 @@ void md_write_end(mddev_t *mddev)
if (atomic_dec_and_test(&mddev->writes_pending)) {
if (mddev->safemode == 2)
md_wakeup_thread(mddev->thread);
- else
+ else if (mddev->safemode_delay)
mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
}
}
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 5:30 ` [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable NeilBrown
@ 2006-05-01 5:44 ` Andrew Morton
2006-05-01 6:02 ` Neil Brown
2006-05-02 5:56 ` bert hubert
1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-05-01 5:44 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, linux-kernel
NeilBrown <neilb@suse.de> wrote:
>
>
> When a md array has been idle (no writes) for 20msecs it is marked as
> 'clean'. This delay turns out to be too short for some real
> workloads. So increase it to 200msec (the time to update the metadata
> should be a tiny fraction of that) and make it sysfs-configurable.
>
>
> ...
>
> + safe_mode_delay
> + When an md array has seen no write requests for a certain period
> + of time, it will be marked as 'clean'. When another write
> + request arrive, the array is marked as 'dirty' before the write
> + commenses. This is known as 'safe_mode'.
> + The 'certain period' is controlled by this file which stores the
> + period as a number of seconds. The default is 200msec (0.200).
> + Writing a value of 0 disables safemode.
> +
Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
to remove any doubt.
> +static ssize_t
> +safe_delay_store(mddev_t *mddev, const char *cbuf, size_t len)
> +{
> + int scale=1;
> + int dot=0;
> + int i;
> + unsigned long msec;
> + char buf[30];
> + char *e;
> + /* remove a period, and count digits after it */
> + if (len >= sizeof(buf))
> + return -EINVAL;
> + strlcpy(buf, cbuf, len);
> + buf[len] = 0;
> + for (i=0; i<len; i++) {
> + if (dot) {
> + if (isdigit(buf[i])) {
> + buf[i-1] = buf[i];
> + scale *= 10;
> + }
> + buf[i] = 0;
> + } else if (buf[i] == '.') {
> + dot=1;
> + buf[i] = 0;
> + }
> + }
> + msec = simple_strtoul(buf, &e, 10);
> + if (e == buf || (*e && *e != '\n'))
> + return -EINVAL;
> + msec = (msec * 1000) / scale;
> + if (msec == 0)
> + mddev->safemode_delay = 0;
> + else {
> + mddev->safemode_delay = (msec*HZ)/1000;
> + if (mddev->safemode_delay == 0)
> + mddev->safemode_delay = 1;
> + }
> + return len;
And most of that goes away.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 5:44 ` Andrew Morton
@ 2006-05-01 6:02 ` Neil Brown
2006-05-01 6:13 ` Andrew Morton
2006-05-01 6:15 ` Nick Piggin
0 siblings, 2 replies; 29+ messages in thread
From: Neil Brown @ 2006-05-01 6:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, linux-raid, linux-kernel
On Sunday April 30, akpm@osdl.org wrote:
> NeilBrown <neilb@suse.de> wrote:
> >
> >
> > When a md array has been idle (no writes) for 20msecs it is marked as
> > 'clean'. This delay turns out to be too short for some real
> > workloads. So increase it to 200msec (the time to update the metadata
> > should be a tiny fraction of that) and make it sysfs-configurable.
> >
> >
> > ...
> >
> > + safe_mode_delay
> > + When an md array has seen no write requests for a certain period
> > + of time, it will be marked as 'clean'. When another write
> > + request arrive, the array is marked as 'dirty' before the write
> > + commenses. This is known as 'safe_mode'.
> > + The 'certain period' is controlled by this file which stores the
> > + period as a number of seconds. The default is 200msec (0.200).
> > + Writing a value of 0 disables safemode.
> > +
>
> Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
> to remove any doubt.
Because umpteen years ago when I was adding thread-usage statistics to
/proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
seconds - a much more "obvious" unit. See Email below.
It seems very sensible to me.
...
> > + msec = simple_strtoul(buf, &e, 10);
> > + if (e == buf || (*e && *e != '\n'))
> > + return -EINVAL;
> > + msec = (msec * 1000) / scale;
> > + if (msec == 0)
> > + mddev->safemode_delay = 0;
> > + else {
> > + mddev->safemode_delay = (msec*HZ)/1000;
> > + if (mddev->safemode_delay == 0)
> > + mddev->safemode_delay = 1;
> > + }
> > + return len;
>
> And most of that goes away.
Maybe it could go in a library :-?
NeilBrown
------------------------------------------------------------
From: Linus Torvalds <torvalds@transmeta.com>
To: Neil Brown <neilb@cse.unsw.edu.au>
cc: nfs-devel@linux.kernel.org
Subject: Re: PATCH knfsd - stats tidy up.
Date: Tue, 18 Jul 2000 12:21:12 -0700 (PDT)
Content-Type: TEXT/PLAIN; charset=US-ASCII
On Tue, 18 Jul 2000, Neil Brown wrote:
>
> The following patch converts jiffies to milliseconds for output, and
> also makes the number wrap predicatably at 1,000,000 seconds
> (approximately one fortnight).
If no programs depend on the format, I actually prefer format changes like
this to be of the "obvious" kind. One such obvious kind is the format
0.001
which obviously means 0.001 seconds.
And yes, I'm _really_ sorry that a lot of the old /proc files contain
jiffies. Lazy. Ugly. Bad. Much of it my bad.
Doing 0.001 doesn't mean that you have to use floating point, in fact
you've done most of the work already in your ms patch, just splitting
things out a bit works well:
/* gcc knows to combine / and % - generate one "divl" */
unsigned int sec = time / HZ, msec = time % HZ;
msec = (msec * 1000) / HZ;
sprintf(" %d.%03d", sec, msec)
(It's basically the same thing you already do, except it doesn't
re-combine the seconds and milliseconds but just prints them out
separately.. And it has the advantage that if you want to change it to
microseconds some day, you can do so very trivially without breaking the
format. Plus it's readable as hell.)
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 6:02 ` Neil Brown
@ 2006-05-01 6:13 ` Andrew Morton
2006-05-01 15:17 ` Linus Torvalds
2006-05-01 6:15 ` Nick Piggin
1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-05-01 6:13 UTC (permalink / raw)
To: Neil Brown; +Cc: torvalds, linux-raid, linux-kernel
Neil Brown <neilb@suse.de> wrote:
>
> On Sunday April 30, akpm@osdl.org wrote:
> > NeilBrown <neilb@suse.de> wrote:
> > >
> > >
> > > When a md array has been idle (no writes) for 20msecs it is marked as
> > > 'clean'. This delay turns out to be too short for some real
> > > workloads. So increase it to 200msec (the time to update the metadata
> > > should be a tiny fraction of that) and make it sysfs-configurable.
> > >
> > >
> > > ...
> > >
> > > + safe_mode_delay
> > > + When an md array has seen no write requests for a certain period
> > > + of time, it will be marked as 'clean'. When another write
> > > + request arrive, the array is marked as 'dirty' before the write
> > > + commenses. This is known as 'safe_mode'.
> > > + The 'certain period' is controlled by this file which stores the
> > > + period as a number of seconds. The default is 200msec (0.200).
> > > + Writing a value of 0 disables safemode.
> > > +
> >
> > Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
> > to remove any doubt.
>
> Because umpteen years ago when I was adding thread-usage statistics to
> /proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
> seconds - a much more "obvious" unit. See Email below.
> It seems very sensible to me.
That's output. It's easier to do the conversion with output. And I guess
one could argue that lots of people read /proc files, but few write to
them.
Generally I don't think we should be teaching the kernel to accept
pretend-floating-point numbers like this, especially when a) "delay in
milliseconds" is such a simple concept and b) it's so easy to go from float
to milliseconds in userspace.
Do you really expect that humans (really dumb ones ;)) will be echoing
numbers into this file? Or will it mainly be a thing for mdadm to fiddle
with?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 6:13 ` Andrew Morton
@ 2006-05-01 15:17 ` Linus Torvalds
0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2006-05-01 15:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: Neil Brown, linux-raid, linux-kernel
On Sun, 30 Apr 2006, Andrew Morton wrote:
>
> Generally I don't think we should be teaching the kernel to accept
> pretend-floating-point numbers like this, especially when a) "delay in
> milliseconds" is such a simple concept and b) it's so easy to go from float
> to milliseconds in userspace.
>
> Do you really expect that humans (really dumb ones ;)) will be echoing
> numbers into this file? Or will it mainly be a thing for mdadm to fiddle
> with?
I generally hate interfaces that have some "random base".
So "delay in seconds" is not a random base, because "seconds" is a good SI
base unit, and there's not a lot of question about it. But once you start
talking milliseconds on microseconds, I'd actually much rather have a
"fake floating point number" over having different files have different
(magic) base constants. How do you remember which are milliseconds, which
are microseconds, and which are just seconds?
It should be easy to have a helper function or two that takes a "struct
timeval" and reads/writes a "float".
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 6:02 ` Neil Brown
2006-05-01 6:13 ` Andrew Morton
@ 2006-05-01 6:15 ` Nick Piggin
1 sibling, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2006-05-01 6:15 UTC (permalink / raw)
To: Neil Brown; +Cc: Andrew Morton, Linus Torvalds, linux-raid, linux-kernel
Neil Brown wrote:
>On Sunday April 30, akpm@osdl.org wrote:
>
>>NeilBrown <neilb@suse.de> wrote:
>>
>>>
>>>When a md array has been idle (no writes) for 20msecs it is marked as
>>>'clean'. This delay turns out to be too short for some real
>>>workloads. So increase it to 200msec (the time to update the metadata
>>>should be a tiny fraction of that) and make it sysfs-configurable.
>>>
>>>
>>>...
>>>
>>>+ safe_mode_delay
>>>+ When an md array has seen no write requests for a certain period
>>>+ of time, it will be marked as 'clean'. When another write
>>>+ request arrive, the array is marked as 'dirty' before the write
>>>+ commenses. This is known as 'safe_mode'.
>>>+ The 'certain period' is controlled by this file which stores the
>>>+ period as a number of seconds. The default is 200msec (0.200).
>>>+ Writing a value of 0 disables safemode.
>>>+
>>>
>>Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
>>to remove any doubt.
>>
>
>Because umpteen years ago when I was adding thread-usage statistics to
>/proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
>seconds - a much more "obvious" unit. See Email below.
>It seems very sensible to me.
>
Either way, all ambiguity is removed if you put the unit in the name. And
don't use jiffies because that obviously is not portable (which sounds like
it was Linus' biggest concern).
Once you do that, I don't much care whether you use seconds or milliseconds.
Other than to note that many of our units now are ms, especially when
they're
measuring things at or around the ms order of magnitude. But I'm not
aware of
so many proc values that don't work in integers.
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-01 5:30 ` [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable NeilBrown
2006-05-01 5:44 ` Andrew Morton
@ 2006-05-02 5:56 ` bert hubert
2006-05-09 1:40 ` Neil Brown
1 sibling, 1 reply; 29+ messages in thread
From: bert hubert @ 2006-05-02 5:56 UTC (permalink / raw)
To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel
On Mon, May 01, 2006 at 03:30:19PM +1000, NeilBrown wrote:
> When a md array has been idle (no writes) for 20msecs it is marked as
> 'clean'. This delay turns out to be too short for some real
> workloads. So increase it to 200msec (the time to update the metadata
> should be a tiny fraction of that) and make it sysfs-configurable.
What does this mean, 'too short'? What happens in that case, backing block
devices are still busy writing? When making this configurable, the help text
better explain what the trade offs are.
Thanks.
--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
2006-05-02 5:56 ` bert hubert
@ 2006-05-09 1:40 ` Neil Brown
0 siblings, 0 replies; 29+ messages in thread
From: Neil Brown @ 2006-05-09 1:40 UTC (permalink / raw)
To: bert hubert; +Cc: Andrew Morton, linux-raid, linux-kernel
On Tuesday May 2, bert.hubert@netherlabs.nl wrote:
> On Mon, May 01, 2006 at 03:30:19PM +1000, NeilBrown wrote:
> > When a md array has been idle (no writes) for 20msecs it is marked as
> > 'clean'. This delay turns out to be too short for some real
> > workloads. So increase it to 200msec (the time to update the metadata
> > should be a tiny fraction of that) and make it sysfs-configurable.
>
> What does this mean, 'too short'? What happens in that case, backing block
> devices are still busy writing? When making this configurable, the help text
> better explain what the trade offs are.
"too short" means that the update happens often enough to cause a
noticeable performance degradation.
In an application writes steadily very 21msecs (or maybe 30msecs) then
there will be 2 superblock writes and 1 application write every
21msecs, and this causes enough disk io to close the app down. - I
guess all the updates fill up the 21msec space.
With a larger delay - 200msec - you could still get bad situations
e.g. with the app writing every 210msecs. However 2 superblock
updates plus one app write is a much smaller fraction of 200msecs, so
there shouldn't be as many problems.
Yes, a more detailed explanation should go in Documentation/md.txt
NeilBrown
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 006 of 11] md: Remove nuisance message at shutdown
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (3 preceding siblings ...)
2006-05-01 5:30 ` [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:30 ` [PATCH 007 of 11] md: Allow checkpoint of recovery with version-1 superblock NeilBrown
` (5 subsequent siblings)
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
At shutdown, we switch all arrays to read-only, which creates
a message for every instantiated array, even those which aren't
actually active.
So remove the message for non-active arrays.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:10:18.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:10:18.000000000 +1000
@@ -2898,7 +2898,7 @@ static int do_md_stop(mddev_t * mddev, i
if (disk)
set_capacity(disk, 0);
mddev->changed = 1;
- } else
+ } else if (mddev->pers)
printk(KERN_INFO "md: %s switched to read-only mode.\n",
mdname(mddev));
err = 0;
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 007 of 11] md: Allow checkpoint of recovery with version-1 superblock.
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (4 preceding siblings ...)
2006-05-01 5:30 ` [PATCH 006 of 11] md: Remove nuisance message at shutdown NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:30 ` [PATCH 008 of 11] md: Allow a linear array to have drives added while active NeilBrown
` (4 subsequent siblings)
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
For a while we have had checkpointing of resync.
The version-1 superblock allows recovery to be checkpointed
as well, and this patch implements that.
Due to early carelessness we need to add a feature flag
to signal that the recovery_offset field is in use, otherwise
older kernels would assume that a partially recovered array
is in fact fully recovered.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 115 +++++++++++++++++++++++++++++++++++---------
./drivers/md/raid1.c | 3 -
./drivers/md/raid10.c | 3 -
./drivers/md/raid5.c | 1
./include/linux/raid/md_k.h | 6 ++
./include/linux/raid/md_p.h | 5 +
6 files changed, 109 insertions(+), 24 deletions(-)
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:10:18.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:12:34.000000000 +1000
@@ -1165,7 +1165,11 @@ static int super_1_validate(mddev_t *mdd
set_bit(Faulty, &rdev->flags);
break;
default:
- set_bit(In_sync, &rdev->flags);
+ if ((le32_to_cpu(sb->feature_map) &
+ MD_FEATURE_RECOVERY_OFFSET))
+ rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
+ else
+ set_bit(In_sync, &rdev->flags);
rdev->raid_disk = role;
break;
}
@@ -1189,6 +1193,7 @@ static void super_1_sync(mddev_t *mddev,
sb->feature_map = 0;
sb->pad0 = 0;
+ sb->recovery_offset = cpu_to_le64(0);
memset(sb->pad1, 0, sizeof(sb->pad1));
memset(sb->pad2, 0, sizeof(sb->pad2));
memset(sb->pad3, 0, sizeof(sb->pad3));
@@ -1209,6 +1214,14 @@ static void super_1_sync(mddev_t *mddev,
sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_offset);
sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
}
+
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset > 0) {
+ sb->feature_map |= cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
+ sb->recovery_offset = cpu_to_le64(rdev->recovery_offset);
+ }
+
if (mddev->reshape_position != MaxSector) {
sb->feature_map |= cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE);
sb->reshape_position = cpu_to_le64(mddev->reshape_position);
@@ -1233,11 +1246,12 @@ static void super_1_sync(mddev_t *mddev,
sb->dev_roles[i] = cpu_to_le16(0xfffe);
else if (test_bit(In_sync, &rdev2->flags))
sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
+ else if (rdev2->raid_disk >= 0 && rdev2->recovery_offset > 0)
+ sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
else
sb->dev_roles[i] = cpu_to_le16(0xffff);
}
- sb->recovery_offset = cpu_to_le64(0); /* not supported yet */
sb->sb_csum = calc_sb_1_csum(sb);
}
@@ -2590,8 +2604,6 @@ static struct kobject *md_probe(dev_t de
return NULL;
}
-void md_wakeup_thread(mdk_thread_t *thread);
-
static void md_safemode_timeout(unsigned long data)
{
mddev_t *mddev = (mddev_t *) data;
@@ -2773,6 +2785,36 @@ static int do_md_run(mddev_t * mddev)
mddev->queue->queuedata = mddev;
mddev->queue->make_request_fn = mddev->pers->make_request;
+ /* If there is a partially-recovered drive we need to
+ * start recovery here. If we leave it to md_check_recovery,
+ * it will remove the drives and not do the right thing
+ */
+ if (mddev->degraded) {
+ struct list_head *rtmp;
+ int spares = 0;
+ ITERATE_RDEV(mddev,rdev,rtmp)
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(In_sync, &rdev->flags) &&
+ !test_bit(Faulty, &rdev->flags))
+ /* complete an interrupted recovery */
+ spares++;
+ if (spares && mddev->pers->sync_request) {
+ mddev->recovery = 0;
+ set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+ mddev->sync_thread = md_register_thread(md_do_sync,
+ mddev,
+ "%s_resync");
+ if (!mddev->sync_thread) {
+ printk(KERN_ERR "%s: could not start resync"
+ " thread...\n",
+ mdname(mddev));
+ /* leave the spares where they are, it shouldn't hurt */
+ mddev->recovery = 0;
+ } else
+ md_wakeup_thread(mddev->sync_thread);
+ }
+ }
+
mddev->changed = 1;
md_new_event(mddev);
return 0;
@@ -2806,6 +2848,7 @@ static int restart_array(mddev_t *mddev)
*/
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
md_wakeup_thread(mddev->thread);
+ md_wakeup_thread(mddev->sync_thread);
err = 0;
} else {
printk(KERN_ERR "md: %s has no personality assigned.\n",
@@ -2829,6 +2872,7 @@ static int do_md_stop(mddev_t * mddev, i
}
if (mddev->sync_thread) {
+ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_unregister_thread(mddev->sync_thread);
mddev->sync_thread = NULL;
@@ -2858,13 +2902,14 @@ static int do_md_stop(mddev_t * mddev, i
if (mddev->ro)
mddev->ro = 0;
}
- if (!mddev->in_sync) {
+ if (!mddev->in_sync || mddev->sb_dirty) {
/* mark array as shutdown cleanly */
mddev->in_sync = 1;
md_update_sb(mddev);
}
if (ro)
set_disk_ro(disk, 1);
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
}
/*
@@ -4652,10 +4697,14 @@ void md_do_sync(mddev_t *mddev)
struct list_head *tmp;
sector_t last_check;
int skipped = 0;
+ struct list_head *rtmp;
+ mdk_rdev_t *rdev;
/* just incase thread restarts... */
if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
return;
+ if (mddev->ro) /* never try to sync a read-only array */
+ return;
/* we overload curr_resync somewhat here.
* 0 == not engaged in resync at all
@@ -4714,17 +4763,30 @@ void md_do_sync(mddev_t *mddev)
}
} while (mddev->curr_resync < 2);
+ j = 0;
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
/* resync follows the size requested by the personality,
* which defaults to physical size, but can be virtual size
*/
max_sectors = mddev->resync_max_sectors;
mddev->resync_mismatches = 0;
+ /* we don't use the checkpoint if there's a bitmap */
+ if (!mddev->bitmap &&
+ !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+ j = mddev->recovery_cp;
} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
max_sectors = mddev->size << 1;
- else
+ else {
/* recovery follows the physical size of devices */
max_sectors = mddev->size << 1;
+ j = MaxSector;
+ ITERATE_RDEV(mddev,rdev,rtmp)
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(Faulty, &rdev->flags) &&
+ !test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset < j)
+ j = rdev->recovery_offset;
+ }
printk(KERN_INFO "md: syncing RAID array %s\n", mdname(mddev));
printk(KERN_INFO "md: minimum _guaranteed_ reconstruction speed:"
@@ -4734,12 +4796,7 @@ void md_do_sync(mddev_t *mddev)
speed_max(mddev));
is_mddev_idle(mddev); /* this also initializes IO event counters */
- /* we don't use the checkpoint if there's a bitmap */
- if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && !mddev->bitmap
- && ! test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
- j = mddev->recovery_cp;
- else
- j = 0;
+
io_sectors = 0;
for (m = 0; m < SYNC_MARKS; m++) {
mark[m] = jiffies;
@@ -4860,15 +4917,28 @@ void md_do_sync(mddev_t *mddev)
if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) &&
test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
- mddev->curr_resync > 2 &&
- mddev->curr_resync >= mddev->recovery_cp) {
- if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
- printk(KERN_INFO
- "md: checkpointing recovery of %s.\n",
- mdname(mddev));
- mddev->recovery_cp = mddev->curr_resync;
- } else
- mddev->recovery_cp = MaxSector;
+ mddev->curr_resync > 2) {
+ if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+ if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+ if (mddev->curr_resync >= mddev->recovery_cp) {
+ printk(KERN_INFO
+ "md: checkpointing recovery of %s.\n",
+ mdname(mddev));
+ mddev->recovery_cp = mddev->curr_resync;
+ }
+ } else
+ mddev->recovery_cp = MaxSector;
+ } else {
+ if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+ mddev->curr_resync = MaxSector;
+ ITERATE_RDEV(mddev,rdev,rtmp)
+ if (rdev->raid_disk >= 0 &&
+ !test_bit(Faulty, &rdev->flags) &&
+ !test_bit(In_sync, &rdev->flags) &&
+ rdev->recovery_offset < mddev->curr_resync)
+ rdev->recovery_offset = mddev->curr_resync;
+ mddev->sb_dirty = 1;
+ }
}
skip:
@@ -4989,6 +5059,8 @@ void md_check_recovery(mddev_t *mddev)
clear_bit(MD_RECOVERY_INTR, &mddev->recovery);
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+ if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
+ goto unlock;
/* no recovery is running.
* remove any failed drives, then
* add spares if possible.
@@ -5011,6 +5083,7 @@ void md_check_recovery(mddev_t *mddev)
ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev->raid_disk < 0
&& !test_bit(Faulty, &rdev->flags)) {
+ rdev->recovery_offset = 0;
if (mddev->pers->hot_add_disk(mddev,rdev)) {
char nm[20];
sprintf(nm, "rd%d", rdev->raid_disk);
diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~ 2006-05-01 15:10:00.000000000 +1000
+++ ./drivers/md/raid1.c 2006-05-01 15:12:34.000000000 +1000
@@ -1888,7 +1888,8 @@ static int run(mddev_t *mddev)
disk = conf->mirrors + i;
- if (!disk->rdev) {
+ if (!disk->rdev ||
+ !test_bit(In_sync, &rdev->flags)) {
disk->head_position = 0;
mddev->degraded++;
}
diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~ 2006-05-01 15:10:17.000000000 +1000
+++ ./drivers/md/raid10.c 2006-05-01 15:12:34.000000000 +1000
@@ -2015,7 +2015,8 @@ static int run(mddev_t *mddev)
disk = conf->mirrors + i;
- if (!disk->rdev) {
+ if (!disk->rdev ||
+ !test_bit(In_sync, &rdev->flags)) {
disk->head_position = 0;
mddev->degraded++;
}
diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~ 2006-05-01 15:10:18.000000000 +1000
+++ ./drivers/md/raid5.c 2006-05-01 15:12:34.000000000 +1000
@@ -3555,6 +3555,7 @@ static int raid5_start_reshape(mddev_t *
set_bit(In_sync, &rdev->flags);
conf->working_disks++;
added_devices++;
+ rdev->recovery_offset = 0;
sprintf(nm, "rd%d", rdev->raid_disk);
sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
} else
diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~ 2006-05-01 15:10:17.000000000 +1000
+++ ./include/linux/raid/md_k.h 2006-05-01 15:12:34.000000000 +1000
@@ -88,6 +88,10 @@ struct mdk_rdev_s
* array and could again if we did a partial
* resync from the bitmap
*/
+ sector_t recovery_offset;/* If this device has been partially
+ * recovered, this is where we were
+ * up to.
+ */
atomic_t nr_pending; /* number of pending requests.
* only maintained for arrays that
@@ -183,6 +187,8 @@ struct mddev_s
#define MD_RECOVERY_REQUESTED 6
#define MD_RECOVERY_CHECK 7
#define MD_RECOVERY_RESHAPE 8
+#define MD_RECOVERY_FROZEN 9
+
unsigned long recovery;
int in_sync; /* know to not need resync */
diff ./include/linux/raid/md_p.h~current~ ./include/linux/raid/md_p.h
--- ./include/linux/raid/md_p.h~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./include/linux/raid/md_p.h 2006-05-01 15:12:34.000000000 +1000
@@ -265,9 +265,12 @@ struct mdp_superblock_1 {
/* feature_map bits */
#define MD_FEATURE_BITMAP_OFFSET 1
+#define MD_FEATURE_RECOVERY_OFFSET 2 /* recovery_offset is present and
+ * must be honoured
+ */
#define MD_FEATURE_RESHAPE_ACTIVE 4
-#define MD_FEATURE_ALL 5
+#define MD_FEATURE_ALL (1|2|4)
#endif
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 008 of 11] md: Allow a linear array to have drives added while active.
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (5 preceding siblings ...)
2006-05-01 5:30 ` [PATCH 007 of 11] md: Allow checkpoint of recovery with version-1 superblock NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-01 5:30 ` [PATCH 009 of 11] md: Support stripe/offset mode in raid10 NeilBrown
` (3 subsequent siblings)
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/linear.c | 74 ++++++++++++++++++++++++++++++++++--------
./drivers/md/md.c | 15 +++++++-
./include/linux/raid/linear.h | 2 +
3 files changed, 76 insertions(+), 15 deletions(-)
diff ./drivers/md/linear.c~current~ ./drivers/md/linear.c
--- ./drivers/md/linear.c~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./drivers/md/linear.c 2006-05-01 15:13:14.000000000 +1000
@@ -111,7 +111,7 @@ static int linear_issue_flush(request_qu
return ret;
}
-static int linear_run (mddev_t *mddev)
+static linear_conf_t *linear_conf(mddev_t *mddev, int raid_disks)
{
linear_conf_t *conf;
dev_info_t **table;
@@ -121,20 +121,21 @@ static int linear_run (mddev_t *mddev)
sector_t curr_offset;
struct list_head *tmp;
- conf = kzalloc (sizeof (*conf) + mddev->raid_disks*sizeof(dev_info_t),
+ conf = kzalloc (sizeof (*conf) + raid_disks*sizeof(dev_info_t),
GFP_KERNEL);
if (!conf)
- goto out;
+ return NULL;
+
mddev->private = conf;
cnt = 0;
- mddev->array_size = 0;
+ conf->array_size = 0;
ITERATE_RDEV(mddev,rdev,tmp) {
int j = rdev->raid_disk;
dev_info_t *disk = conf->disks + j;
- if (j < 0 || j > mddev->raid_disks || disk->rdev) {
+ if (j < 0 || j > raid_disks || disk->rdev) {
printk("linear: disk numbering problem. Aborting!\n");
goto out;
}
@@ -152,11 +153,11 @@ static int linear_run (mddev_t *mddev)
blk_queue_max_sectors(mddev->queue, PAGE_SIZE>>9);
disk->size = rdev->size;
- mddev->array_size += rdev->size;
+ conf->array_size += rdev->size;
cnt++;
}
- if (cnt != mddev->raid_disks) {
+ if (cnt != raid_disks) {
printk("linear: not enough drives present. Aborting!\n");
goto out;
}
@@ -200,7 +201,7 @@ static int linear_run (mddev_t *mddev)
unsigned round;
unsigned long base;
- sz = mddev->array_size >> conf->preshift;
+ sz = conf->array_size >> conf->preshift;
sz += 1; /* force round-up */
base = conf->hash_spacing >> conf->preshift;
round = sector_div(sz, base);
@@ -247,14 +248,56 @@ static int linear_run (mddev_t *mddev)
BUG_ON(table - conf->hash_table > nb_zone);
+ return conf;
+
+out:
+ kfree(conf);
+ return NULL;
+}
+
+static int linear_run (mddev_t *mddev)
+{
+ linear_conf_t *conf;
+
+ conf = linear_conf(mddev, mddev->raid_disks);
+
+ if (!conf)
+ return 1;
+ mddev->private = conf;
+ mddev->array_size = conf->array_size;
+
blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
mddev->queue->unplug_fn = linear_unplug;
mddev->queue->issue_flush_fn = linear_issue_flush;
return 0;
+}
-out:
- kfree(conf);
- return 1;
+static int linear_add(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+ /* Adding a drive to a linear array allows the array to grow.
+ * It is permitted if the new drive has a matching superblock
+ * already on it, with raid_disk equal to raid_disks.
+ * It is achieved by creating a new linear_private_data structure
+ * and swapping it in in-place of the current one.
+ * The current one is never freed until the array is stopped.
+ * This avoids races.
+ */
+ linear_conf_t *newconf;
+
+ if (rdev->raid_disk != mddev->raid_disks)
+ return -EINVAL;
+
+ newconf = linear_conf(mddev,mddev->raid_disks+1);
+
+ if (!newconf)
+ return -ENOMEM;
+
+ newconf->prev = mddev_to_conf(mddev);
+ mddev->private = newconf;
+ mddev->raid_disks++;
+ mddev->array_size = newconf->array_size;
+ set_capacity(mddev->gendisk, mddev->array_size << 1);
+ return 0;
}
static int linear_stop (mddev_t *mddev)
@@ -262,8 +305,12 @@ static int linear_stop (mddev_t *mddev)
linear_conf_t *conf = mddev_to_conf(mddev);
blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
- kfree(conf->hash_table);
- kfree(conf);
+ do {
+ linear_conf_t *t = conf->prev;
+ kfree(conf->hash_table);
+ kfree(conf);
+ conf = t;
+ } while (conf);
return 0;
}
@@ -360,6 +407,7 @@ static struct mdk_personality linear_per
.run = linear_run,
.stop = linear_stop,
.status = linear_status,
+ .hot_add_disk = linear_add,
};
static int __init linear_init (void)
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:12:34.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:13:14.000000000 +1000
@@ -807,8 +807,8 @@ static int super_90_validate(mddev_t *md
if (desc->state & (1<<MD_DISK_FAULTY))
set_bit(Faulty, &rdev->flags);
- else if (desc->state & (1<<MD_DISK_SYNC) &&
- desc->raid_disk < mddev->raid_disks) {
+ else if (desc->state & (1<<MD_DISK_SYNC) /* &&
+ desc->raid_disk < mddev->raid_disks */) {
set_bit(In_sync, &rdev->flags);
rdev->raid_disk = desc->raid_disk;
}
@@ -3346,6 +3346,17 @@ static int add_new_disk(mddev_t * mddev,
rdev->raid_disk = -1;
err = bind_rdev_to_array(rdev, mddev);
+ if (!err && !mddev->pers->hot_remove_disk) {
+ /* If there is hot_add_disk but no hot_remove_disk
+ * then added disks for geometry changes,
+ * and should be added immediately.
+ */
+ super_types[mddev->major_version].
+ validate_super(mddev, rdev);
+ err = mddev->pers->hot_add_disk(mddev, rdev);
+ if (err)
+ unbind_rdev_from_array(rdev);
+ }
if (err)
export_rdev(rdev);
diff ./include/linux/raid/linear.h~current~ ./include/linux/raid/linear.h
--- ./include/linux/raid/linear.h~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./include/linux/raid/linear.h 2006-05-01 15:13:14.000000000 +1000
@@ -13,8 +13,10 @@ typedef struct dev_info dev_info_t;
struct linear_private_data
{
+ struct linear_private_data *prev; /* earlier version */
dev_info_t **hash_table;
sector_t hash_spacing;
+ sector_t array_size;
int preshift; /* shift before dividing by hash_spacing */
dev_info_t disks[0];
};
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 009 of 11] md: Support stripe/offset mode in raid10
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (6 preceding siblings ...)
2006-05-01 5:30 ` [PATCH 008 of 11] md: Allow a linear array to have drives added while active NeilBrown
@ 2006-05-01 5:30 ` NeilBrown
2006-05-02 16:38 ` Al Boldi
2006-05-01 5:31 ` [PATCH 010 of 11] md: make md_print_devices() static NeilBrown
` (2 subsequent siblings)
10 siblings, 1 reply; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
The "industry standard" DDF format allows for a stripe/offset layout
where data is duplicated on different stripes. e.g.
A B C D
D A B C
E F G H
H E F G
(columns are drives, rows are stripes, LETTERS are chunks of data).
This is similar to raid10's 'far' mode, but not quite the same. So
enhance 'far' mode with a 'far/offset' option which follows the layout
of DDFs stripe/offset.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/raid10.c | 64 ++++++++++++++++++++++++++++--------------
./include/linux/raid/raid10.h | 7 +++-
2 files changed, 49 insertions(+), 22 deletions(-)
diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~ 2006-05-01 15:12:34.000000000 +1000
+++ ./drivers/md/raid10.c 2006-05-01 15:13:22.000000000 +1000
@@ -29,6 +29,7 @@
* raid_disks
* near_copies (stored in low byte of layout)
* far_copies (stored in second byte of layout)
+ * far_offset (stored in bit 16 of layout )
*
* The data to be stored is divided into chunks using chunksize.
* Each device is divided into far_copies sections.
@@ -36,10 +37,14 @@
* near_copies copies of each chunk is stored (each on a different drive).
* The starting device for each section is offset near_copies from the starting
* device of the previous section.
- * Thus there are (near_copies*far_copies) of each chunk, and each is on a different
+ * Thus they are (near_copies*far_copies) of each chunk, and each is on a different
* drive.
* near_copies and far_copies must be at least one, and their product is at most
* raid_disks.
+ *
+ * If far_offset is true, then the far_copies are handled a bit differently.
+ * The copies are still in different stripes, but instead of be very far apart
+ * on disk, there are adjacent stripes.
*/
/*
@@ -357,8 +362,7 @@ static int raid10_end_write_request(stru
* With this layout, and block is never stored twice on the one device.
*
* raid10_find_phys finds the sector offset of a given virtual sector
- * on each device that it is on. If a block isn't on a device,
- * that entry in the array is set to MaxSector.
+ * on each device that it is on.
*
* raid10_find_virt does the reverse mapping, from a device and a
* sector offset to a virtual address
@@ -381,6 +385,8 @@ static void raid10_find_phys(conf_t *con
chunk *= conf->near_copies;
stripe = chunk;
dev = sector_div(stripe, conf->raid_disks);
+ if (conf->far_offset)
+ stripe *= conf->far_copies;
sector += stripe << conf->chunk_shift;
@@ -414,16 +420,24 @@ static sector_t raid10_find_virt(conf_t
{
sector_t offset, chunk, vchunk;
- while (sector > conf->stride) {
- sector -= conf->stride;
- if (dev < conf->near_copies)
- dev += conf->raid_disks - conf->near_copies;
- else
- dev -= conf->near_copies;
- }
-
offset = sector & conf->chunk_mask;
- chunk = sector >> conf->chunk_shift;
+ if (conf->far_offset) {
+ int fc;
+ chunk = sector >> conf->chunk_shift;
+ fc = sector_div(chunk, conf->far_copies);
+ dev -= fc * conf->near_copies;
+ if (dev < 0)
+ dev += conf->raid_disks;
+ } else {
+ while (sector > conf->stride) {
+ sector -= conf->stride;
+ if (dev < conf->near_copies)
+ dev += conf->raid_disks - conf->near_copies;
+ else
+ dev -= conf->near_copies;
+ }
+ chunk = sector >> conf->chunk_shift;
+ }
vchunk = chunk * conf->raid_disks + dev;
sector_div(vchunk, conf->near_copies);
return (vchunk << conf->chunk_shift) + offset;
@@ -900,9 +914,12 @@ static void status(struct seq_file *seq,
seq_printf(seq, " %dK chunks", mddev->chunk_size/1024);
if (conf->near_copies > 1)
seq_printf(seq, " %d near-copies", conf->near_copies);
- if (conf->far_copies > 1)
- seq_printf(seq, " %d far-copies", conf->far_copies);
-
+ if (conf->far_copies > 1) {
+ if (conf->far_offset)
+ seq_printf(seq, " %d offset-copies", conf->far_copies);
+ else
+ seq_printf(seq, " %d far-copies", conf->far_copies);
+ }
seq_printf(seq, " [%d/%d] [", conf->raid_disks,
conf->working_disks);
for (i = 0; i < conf->raid_disks; i++)
@@ -1915,7 +1932,7 @@ static int run(mddev_t *mddev)
mirror_info_t *disk;
mdk_rdev_t *rdev;
struct list_head *tmp;
- int nc, fc;
+ int nc, fc, fo;
sector_t stride, size;
if (mddev->chunk_size == 0) {
@@ -1925,8 +1942,9 @@ static int run(mddev_t *mddev)
nc = mddev->layout & 255;
fc = (mddev->layout >> 8) & 255;
+ fo = mddev->layout & (1<<16);
if ((nc*fc) <2 || (nc*fc) > mddev->raid_disks ||
- (mddev->layout >> 16)) {
+ (mddev->layout >> 17)) {
printk(KERN_ERR "raid10: %s: unsupported raid10 layout: 0x%8x\n",
mdname(mddev), mddev->layout);
goto out;
@@ -1958,12 +1976,16 @@ static int run(mddev_t *mddev)
conf->near_copies = nc;
conf->far_copies = fc;
conf->copies = nc*fc;
+ conf->far_offset = fo;
conf->chunk_mask = (sector_t)(mddev->chunk_size>>9)-1;
conf->chunk_shift = ffz(~mddev->chunk_size) - 9;
- stride = mddev->size >> (conf->chunk_shift-1);
- sector_div(stride, fc);
- conf->stride = stride << conf->chunk_shift;
-
+ if (fo)
+ conf->stride = 1 << conf->chunk_shift;
+ else {
+ stride = mddev->size >> (conf->chunk_shift-1);
+ sector_div(stride, fc);
+ conf->stride = stride << conf->chunk_shift;
+ }
conf->r10bio_pool = mempool_create(NR_RAID10_BIOS, r10bio_pool_alloc,
r10bio_pool_free, conf);
if (!conf->r10bio_pool) {
diff ./include/linux/raid/raid10.h~current~ ./include/linux/raid/raid10.h
--- ./include/linux/raid/raid10.h~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./include/linux/raid/raid10.h 2006-05-01 15:13:22.000000000 +1000
@@ -24,11 +24,16 @@ struct r10_private_data_s {
int far_copies; /* number of copies layed out
* at large strides across drives
*/
+ int far_offset; /* far_copies are offset by 1 stripe
+ * instead of many
+ */
int copies; /* near_copies * far_copies.
* must be <= raid_disks
*/
sector_t stride; /* distance between far copies.
- * This is size / far_copies
+ * This is size / far_copies unless
+ * far_offset, in which case it is
+ * 1 stripe.
*/
int chunk_shift; /* shift from chunks to sectors */
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
2006-05-01 5:30 ` [PATCH 009 of 11] md: Support stripe/offset mode in raid10 NeilBrown
@ 2006-05-02 16:38 ` Al Boldi
2006-05-03 0:05 ` Neil Brown
0 siblings, 1 reply; 29+ messages in thread
From: Al Boldi @ 2006-05-02 16:38 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, linux-kernel, Andrew Morton
NeilBrown wrote:
> The "industry standard" DDF format allows for a stripe/offset layout
> where data is duplicated on different stripes. e.g.
>
> A B C D
> D A B C
> E F G H
> H E F G
>
> (columns are drives, rows are stripes, LETTERS are chunks of data).
Presumably, this is the case for --layout=f2 ?
If so, would --layout=f4 result in a 4-mirror/striped array?
Also, would it be possible to have a staged write-back mechanism across
multiple stripes?
Thanks!
--
Al
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
2006-05-02 16:38 ` Al Boldi
@ 2006-05-03 0:05 ` Neil Brown
2006-05-03 4:00 ` Al Boldi
0 siblings, 1 reply; 29+ messages in thread
From: Neil Brown @ 2006-05-03 0:05 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-raid, linux-kernel, Andrew Morton
On Tuesday May 2, a1426z@gawab.com wrote:
> NeilBrown wrote:
> > The "industry standard" DDF format allows for a stripe/offset layout
> > where data is duplicated on different stripes. e.g.
> >
> > A B C D
> > D A B C
> > E F G H
> > H E F G
> >
> > (columns are drives, rows are stripes, LETTERS are chunks of data).
>
> Presumably, this is the case for --layout=f2 ?
Almost. mdadm doesn't support this layout yet.
'f2' is a similar layout, but the offset stripes are a lot further
down the drives.
It will possibly be called 'o2' or 'offset2'.
> If so, would --layout=f4 result in a 4-mirror/striped array?
o4 on a 4 drive array would be
A B C D
D A B C
C D A B
B C D A
E F G H
....
>
> Also, would it be possible to have a staged write-back mechanism across
> multiple stripes?
What exactly would that mean? And what would be the advantage?
NeilBrown
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
2006-05-03 0:05 ` Neil Brown
@ 2006-05-03 4:00 ` Al Boldi
2006-05-08 7:17 ` Neil Brown
0 siblings, 1 reply; 29+ messages in thread
From: Al Boldi @ 2006-05-03 4:00 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid, linux-kernel, Andrew Morton
Neil Brown wrote:
> On Tuesday May 2, a1426z@gawab.com wrote:
> > NeilBrown wrote:
> > > The "industry standard" DDF format allows for a stripe/offset layout
> > > where data is duplicated on different stripes. e.g.
> > >
> > > A B C D
> > > D A B C
> > > E F G H
> > > H E F G
> > >
> > > (columns are drives, rows are stripes, LETTERS are chunks of data).
> >
> > Presumably, this is the case for --layout=f2 ?
>
> Almost. mdadm doesn't support this layout yet.
> 'f2' is a similar layout, but the offset stripes are a lot further
> down the drives.
> It will possibly be called 'o2' or 'offset2'.
>
> > If so, would --layout=f4 result in a 4-mirror/striped array?
>
> o4 on a 4 drive array would be
>
> A B C D
> D A B C
> C D A B
> B C D A
> E F G H
> ....
Yes, so would this give us 4 physically duplicate mirrors?
If not, would it be possible to add a far-offset mode to yield such a layout?
> > Also, would it be possible to have a staged write-back mechanism across
> > multiple stripes?
>
> What exactly would that mean?
Write the first stripe, then write subsequent duplicate stripes based on idle
with a max delay for each delayed stripe.
> And what would be the advantage?
Faster burst writes, probably.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
2006-05-03 4:00 ` Al Boldi
@ 2006-05-08 7:17 ` Neil Brown
2006-05-08 16:59 ` Al Boldi
2006-05-17 21:32 ` Raid5 resize "testing opportunity" Patrik Jonsson
0 siblings, 2 replies; 29+ messages in thread
From: Neil Brown @ 2006-05-08 7:17 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-raid, linux-kernel
On Wednesday May 3, a1426z@gawab.com wrote:
> Neil Brown wrote:
> > On Tuesday May 2, a1426z@gawab.com wrote:
> > > NeilBrown wrote:
> > > > The "industry standard" DDF format allows for a stripe/offset layout
> > > > where data is duplicated on different stripes. e.g.
> > > >
> > > > A B C D
> > > > D A B C
> > > > E F G H
> > > > H E F G
> > > >
> > > > (columns are drives, rows are stripes, LETTERS are chunks of data).
> > >
> > > Presumably, this is the case for --layout=f2 ?
> >
> > Almost. mdadm doesn't support this layout yet.
> > 'f2' is a similar layout, but the offset stripes are a lot further
> > down the drives.
> > It will possibly be called 'o2' or 'offset2'.
> >
> > > If so, would --layout=f4 result in a 4-mirror/striped array?
> >
> > o4 on a 4 drive array would be
> >
> > A B C D
> > D A B C
> > C D A B
> > B C D A
> > E F G H
> > ....
>
> Yes, so would this give us 4 physically duplicate mirrors?
It would give 4 devices each containing the same data, but in a
different layout - much as the picture shows....
> If not, would it be possible to add a far-offset mode to yield such
> a layout?
Exactly what sort of layout do you want?
>
> > > Also, would it be possible to have a staged write-back mechanism across
> > > multiple stripes?
> >
> > What exactly would that mean?
>
> Write the first stripe, then write subsequent duplicate stripes based on idle
> with a max delay for each delayed stripe.
>
> > And what would be the advantage?
>
> Faster burst writes, probably.
I still don't get what you are after.
You always need to wait for writes of all copies to complete before
acknowledging the write to the filesystem, otherwise you risk
corruption if there is a crash and a device failure.
So inserting any delays (other than the per-device plugging which
helps to group adjacent requests) isn't going to make things go
faster.
NeilBrown
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
2006-05-08 7:17 ` Neil Brown
@ 2006-05-08 16:59 ` Al Boldi
2006-05-17 21:32 ` Raid5 resize "testing opportunity" Patrik Jonsson
1 sibling, 0 replies; 29+ messages in thread
From: Al Boldi @ 2006-05-08 16:59 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid, linux-kernel
Neil Brown wrote:
> On Wednesday May 3, a1426z@gawab.com wrote:
> > Neil Brown wrote:
> > > On Tuesday May 2, a1426z@gawab.com wrote:
> > > > NeilBrown wrote:
> > > > > The "industry standard" DDF format allows for a stripe/offset
> > > > > layout where data is duplicated on different stripes. e.g.
> > > > >
> > > > > A B C D
> > > > > D A B C
> > > > > E F G H
> > > > > H E F G
> > > > >
> > > > > (columns are drives, rows are stripes, LETTERS are chunks of
> > > > > data).
> > > >
> > > > Presumably, this is the case for --layout=f2 ?
> > >
> > > Almost. mdadm doesn't support this layout yet.
> > > 'f2' is a similar layout, but the offset stripes are a lot further
> > > down the drives.
> > > It will possibly be called 'o2' or 'offset2'.
> > >
> > > > If so, would --layout=f4 result in a 4-mirror/striped array?
> > >
> > > o4 on a 4 drive array would be
> > >
> > > A B C D
> > > D A B C
> > > C D A B
> > > B C D A
> > > E F G H
> > > ....
> >
> > Yes, so would this give us 4 physically duplicate mirrors?
>
> It would give 4 devices each containing the same data, but in a
> different layout - much as the picture shows....
>
> > If not, would it be possible to add a far-offset mode to yield such
> > a layout?
>
> Exactly what sort of layout do you want?
Something like this:
o1 A1 B1 C1 D1
o2 A1 A2 B1 B2
C2 C1 D2 D1
o3 A1 A2 A3 B1
B2 B3 C1 C2
C3 D1 D2 D3
o4 A1 A2 A3 A4
B2 B3 B4 B1
C3 C4 C1 C2
D4 D1 D2 D3
(columns are drives, numbers are stripes, LETTERS are chunks of data).
> > > > Also, would it be possible to have a staged write-back mechanism
> > > > across multiple stripes?
> > >
> > > What exactly would that mean?
> >
> > Write the first stripe, then write subsequent duplicate stripes based on
> > idle with a max delay for each delayed stripe.
> >
> > > And what would be the advantage?
> >
> > Faster burst writes, probably.
>
> I still don't get what you are after.
> You always need to wait for writes of all copies to complete before
> acknowledging the write to the filesystem, otherwise you risk
> corruption if there is a crash and a device failure.
Yes, some people can not stomach running degraded at any time, so they could
set the delay for the first duplicate stripe to 0, and subsequent stripes at
leasure.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 29+ messages in thread
* Raid5 resize "testing opportunity"
2006-05-08 7:17 ` Neil Brown
2006-05-08 16:59 ` Al Boldi
@ 2006-05-17 21:32 ` Patrik Jonsson
2006-05-17 23:49 ` Neil Brown
1 sibling, 1 reply; 29+ messages in thread
From: Patrik Jonsson @ 2006-05-17 21:32 UTC (permalink / raw)
Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 355 bytes --]
Hi all,
For Neil's benefit (:-) I'm about to test the raid5 resize code by
trying to grow our 2TB raid5 from 8 to 10 devices. Currently, I'm
running a 2.6.16-rc4-mm2 kernel. Is this current enough to support the
resize? (I suspect not.) If I upgrade to 2.6.17-rc4-mm1, would that do
it, or is it even in stable 2.6.16.16?
Thanks,
/Patrik
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Raid5 resize "testing opportunity"
2006-05-17 21:32 ` Raid5 resize "testing opportunity" Patrik Jonsson
@ 2006-05-17 23:49 ` Neil Brown
2006-05-19 0:40 ` Patrik Jonsson
0 siblings, 1 reply; 29+ messages in thread
From: Neil Brown @ 2006-05-17 23:49 UTC (permalink / raw)
To: Patrik Jonsson; +Cc: linux-raid
On Wednesday May 17, patrik@ucolick.org wrote:
> Hi all,
>
> For Neil's benefit (:-) I'm about to test the raid5 resize code by
> trying to grow our 2TB raid5 from 8 to 10 devices. Currently, I'm
> running a 2.6.16-rc4-mm2 kernel. Is this current enough to support the
> resize? (I suspect not.) If I upgrade to 2.6.17-rc4-mm1, would that do
> it, or is it even in stable 2.6.16.16?
Thanks!
You need at least 2.6.17-rc1. I would suggest the latest -rc:
2.6.17-rc4
Don't use -mm. It could have new bugs, and you don't want them to
trouble you when you are growing your array.
I look forward to the results!
NeilBrown
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Raid5 resize "testing opportunity"
2006-05-17 23:49 ` Neil Brown
@ 2006-05-19 0:40 ` Patrik Jonsson
2006-05-19 0:44 ` Neil Brown
0 siblings, 1 reply; 29+ messages in thread
From: Patrik Jonsson @ 2006-05-19 0:40 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 183 bytes --]
Hi Neil,
The raid5 reshape seems to have gone smoothly (nice job!), though it
took 11 hours! Are there any pieces of info you would like about the array?
cheers,
/Patrik
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Raid5 resize "testing opportunity"
2006-05-19 0:40 ` Patrik Jonsson
@ 2006-05-19 0:44 ` Neil Brown
2006-05-19 20:11 ` Per Lindstrand
0 siblings, 1 reply; 29+ messages in thread
From: Neil Brown @ 2006-05-19 0:44 UTC (permalink / raw)
To: Patrik Jonsson; +Cc: linux-raid
On Thursday May 18, patrik@ucolick.org wrote:
> Hi Neil,
>
> The raid5 reshape seems to have gone smoothly (nice job!), though it
> took 11 hours! Are there any pieces of info you would like about the array?
Excellent!
No, no other information would be useful.
This is the first real-life example that I know of of adding 2 devices
at once. That should be no more difficult, but it is good to know
that it works in fact as well as in theory.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Raid5 resize "testing opportunity"
2006-05-19 0:44 ` Neil Brown
@ 2006-05-19 20:11 ` Per Lindstrand
0 siblings, 0 replies; 29+ messages in thread
From: Per Lindstrand @ 2006-05-19 20:11 UTC (permalink / raw)
To: Neil Brown; +Cc: Patrik Jonsson, linux-raid
Hi Neil,
I'm currently running an active raid5 array of 12 x 300GB SATA devices.
During the last couple of months I have grown my raid two times (from 4
to 8 to 12). I was using a 2.6.16-rc1 kernel with the (at that time)
latest md-patch.
I'm happy to say that both times the growing procedure completed
successfully!
This is how I did:
At first I had 4 devices ( /dev/sd{a,b,c,d} ) running in an active raid5
array (chunk-size 256). When I bought 4 more I thought I’d try to grow
them instead of running another array. I assembled my array with 4
drives and made sure that the array started without problems (cat:ed
/proc/mdstat). After that I cfdisk:ed the 4 new devices to one huge
partition with the type FD (Linux raid autodetect) and added them as
spares with the command:
# mdadm --add /dev/md0 /dev/sd{e,f,g,h}1
After that I checked the /proc/mdstat to confirm that they had been
successfully added and then executed the grow command:
# mdadm -Gv /dev/md0 -n8
which started the whole growing procedure. After that I waited (it took
about 6 times rebuilding from 4 to 8 and almost 11 hours from 8 to 12).
The following information might not belong in the raid-list but I
thought it might be useful someone:
---------------------------------------------------------------------
The raid is encrypted with LUKS aes-cbc-essiv:sha256 and has an ext3
filesystem formatted with '-T Largefile', -m0 and '-R stride=64'. After
I successfully had grown the raid5 array I managed to resize the LUKS
and the ext3 partition with the following commands:
(After decrypting the raid using standard luksOpen procedure)
# cryptsetup resize cmd0
(no I didn't forget the <size> information)
# resize2fs -p /dev/mapper/cmd0
seemed to do the trick with the ext3 filesystem.
---------------------------------------------------------------------
This is how I did it both times and I must say, even though it was scary
as hell growing a raid of 2.1TB with need-to-have data, it was really
interesting and boy am I glad it worked! =)
I just thought I’d tribute to the raid-list with my grow-story. It can
be nice to hear of those who succeed too and not only when people have
accidents. =)
Thanks for a great work with the growing code!
Best regards
Per Lindstrand, Sweden
Neil Brown wrote:
> On Thursday May 18, patrik@ucolick.org wrote:
>> Hi Neil,
>>
>> The raid5 reshape seems to have gone smoothly (nice job!), though it
>> took 11 hours! Are there any pieces of info you would like about the array?
>
> Excellent!
>
> No, no other information would be useful.
> This is the first real-life example that I know of of adding 2 devices
> at once. That should be no more difficult, but it is good to know
> that it works in fact as well as in theory.
>
> Thanks,
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 010 of 11] md: make md_print_devices() static
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (7 preceding siblings ...)
2006-05-01 5:30 ` [PATCH 009 of 11] md: Support stripe/offset mode in raid10 NeilBrown
@ 2006-05-01 5:31 ` NeilBrown
2006-05-01 5:31 ` [PATCH 011 of 11] md: Split reshape portion of raid5 sync_request into a separate function NeilBrown
[not found] ` <1060501053025.22961@suse.de>
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel, Adrian Bunk
From: Adrian Bunk <bunk@stusta.de>
This patch makes the needlessly global md_print_devices() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/md.c | 7 +++++--
./include/linux/raid/md.h | 4 ----
2 files changed, 5 insertions(+), 6 deletions(-)
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:13:14.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:13:29.000000000 +1000
@@ -72,6 +72,10 @@ static void autostart_arrays (int part);
static LIST_HEAD(pers_list);
static DEFINE_SPINLOCK(pers_lock);
+static void md_print_devices(void);
+
+#define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
+
/*
* Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
* is 1000 KB/sec, so the extra system load does not show up that much.
@@ -1512,7 +1516,7 @@ static void print_rdev(mdk_rdev_t *rdev)
printk(KERN_INFO "md: no rdev superblock!\n");
}
-void md_print_devices(void)
+static void md_print_devices(void)
{
struct list_head *tmp, *tmp2;
mdk_rdev_t *rdev;
@@ -5310,7 +5314,6 @@ EXPORT_SYMBOL(md_write_end);
EXPORT_SYMBOL(md_register_thread);
EXPORT_SYMBOL(md_unregister_thread);
EXPORT_SYMBOL(md_wakeup_thread);
-EXPORT_SYMBOL(md_print_devices);
EXPORT_SYMBOL(md_check_recovery);
MODULE_LICENSE("GPL");
MODULE_ALIAS("md");
diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h
--- ./include/linux/raid/md.h~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./include/linux/raid/md.h 2006-05-01 15:13:29.000000000 +1000
@@ -85,8 +85,6 @@ extern void md_done_sync(mddev_t *mddev,
extern void md_error (mddev_t *mddev, mdk_rdev_t *rdev);
extern void md_unplug_mddev(mddev_t *mddev);
-extern void md_print_devices (void);
-
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
@@ -97,7 +95,5 @@ extern void md_new_event(mddev_t *mddev)
extern void md_update_sb(mddev_t * mddev);
-#define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
-
#endif
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 011 of 11] md: Split reshape portion of raid5 sync_request into a separate function.
2006-05-01 5:29 [PATCH 000 of 11] md: Introduction - assort md enhancements for 2.6.18 NeilBrown
` (8 preceding siblings ...)
2006-05-01 5:31 ` [PATCH 010 of 11] md: make md_print_devices() static NeilBrown
@ 2006-05-01 5:31 ` NeilBrown
[not found] ` <1060501053025.22961@suse.de>
10 siblings, 0 replies; 29+ messages in thread
From: NeilBrown @ 2006-05-01 5:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid, linux-kernel
... as raid5 sync_request is WAY too big.
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./drivers/md/raid5.c | 244 ++++++++++++++++++++++++++-------------------------
1 file changed, 127 insertions(+), 117 deletions(-)
diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~ 2006-05-01 15:12:34.000000000 +1000
+++ ./drivers/md/raid5.c 2006-05-01 15:13:41.000000000 +1000
@@ -2696,13 +2696,136 @@ static int make_request(request_queue_t
return 0;
}
-/* FIXME go_faster isn't used */
-static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
+static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped)
{
+ /* reshaping is quite different to recovery/resync so it is
+ * handled quite separately ... here.
+ *
+ * On each call to sync_request, we gather one chunk worth of
+ * destination stripes and flag them as expanding.
+ * Then we find all the source stripes and request reads.
+ * As the reads complete, handle_stripe will copy the data
+ * into the destination stripe and release that stripe.
+ */
raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
struct stripe_head *sh;
int pd_idx;
sector_t first_sector, last_sector;
+ int raid_disks;
+ int data_disks;
+ int i;
+ int dd_idx;
+ sector_t writepos, safepos, gap;
+
+ if (sector_nr == 0 &&
+ conf->expand_progress != 0) {
+ /* restarting in the middle, skip the initial sectors */
+ sector_nr = conf->expand_progress;
+ sector_div(sector_nr, conf->raid_disks-1);
+ *skipped = 1;
+ return sector_nr;
+ }
+
+ /* we update the metadata when there is more than 3Meg
+ * in the block range (that is rather arbitrary, should
+ * probably be time based) or when the data about to be
+ * copied would over-write the source of the data at
+ * the front of the range.
+ * i.e. one new_stripe forward from expand_progress new_maps
+ * to after where expand_lo old_maps to
+ */
+ writepos = conf->expand_progress +
+ conf->chunk_size/512*(conf->raid_disks-1);
+ sector_div(writepos, conf->raid_disks-1);
+ safepos = conf->expand_lo;
+ sector_div(safepos, conf->previous_raid_disks-1);
+ gap = conf->expand_progress - conf->expand_lo;
+
+ if (writepos >= safepos ||
+ gap > (conf->raid_disks-1)*3000*2 /*3Meg*/) {
+ /* Cannot proceed until we've updated the superblock... */
+ wait_event(conf->wait_for_overlap,
+ atomic_read(&conf->reshape_stripes)==0);
+ mddev->reshape_position = conf->expand_progress;
+ mddev->sb_dirty = 1;
+ md_wakeup_thread(mddev->thread);
+ wait_event(mddev->sb_wait, mddev->sb_dirty == 0 ||
+ kthread_should_stop());
+ spin_lock_irq(&conf->device_lock);
+ conf->expand_lo = mddev->reshape_position;
+ spin_unlock_irq(&conf->device_lock);
+ wake_up(&conf->wait_for_overlap);
+ }
+
+ for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
+ int j;
+ int skipped = 0;
+ pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
+ sh = get_active_stripe(conf, sector_nr+i,
+ conf->raid_disks, pd_idx, 0);
+ set_bit(STRIPE_EXPANDING, &sh->state);
+ atomic_inc(&conf->reshape_stripes);
+ /* If any of this stripe is beyond the end of the old
+ * array, then we need to zero those blocks
+ */
+ for (j=sh->disks; j--;) {
+ sector_t s;
+ if (j == sh->pd_idx)
+ continue;
+ s = compute_blocknr(sh, j);
+ if (s < (mddev->array_size<<1)) {
+ skipped = 1;
+ continue;
+ }
+ memset(page_address(sh->dev[j].page), 0, STRIPE_SIZE);
+ set_bit(R5_Expanded, &sh->dev[j].flags);
+ set_bit(R5_UPTODATE, &sh->dev[j].flags);
+ }
+ if (!skipped) {
+ set_bit(STRIPE_EXPAND_READY, &sh->state);
+ set_bit(STRIPE_HANDLE, &sh->state);
+ }
+ release_stripe(sh);
+ }
+ spin_lock_irq(&conf->device_lock);
+ conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
+ spin_unlock_irq(&conf->device_lock);
+ /* Ok, those stripe are ready. We can start scheduling
+ * reads on the source stripes.
+ * The source stripes are determined by mapping the first and last
+ * block on the destination stripes.
+ */
+ raid_disks = conf->previous_raid_disks;
+ data_disks = raid_disks - 1;
+ first_sector =
+ raid5_compute_sector(sector_nr*(conf->raid_disks-1),
+ raid_disks, data_disks,
+ &dd_idx, &pd_idx, conf);
+ last_sector =
+ raid5_compute_sector((sector_nr+conf->chunk_size/512)
+ *(conf->raid_disks-1) -1,
+ raid_disks, data_disks,
+ &dd_idx, &pd_idx, conf);
+ if (last_sector >= (mddev->size<<1))
+ last_sector = (mddev->size<<1)-1;
+ while (first_sector <= last_sector) {
+ pd_idx = stripe_to_pdidx(first_sector, conf, conf->previous_raid_disks);
+ sh = get_active_stripe(conf, first_sector,
+ conf->previous_raid_disks, pd_idx, 0);
+ set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+ set_bit(STRIPE_HANDLE, &sh->state);
+ release_stripe(sh);
+ first_sector += STRIPE_SECTORS;
+ }
+ return conf->chunk_size>>9;
+}
+
+/* FIXME go_faster isn't used */
+static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster)
+{
+ raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
+ struct stripe_head *sh;
+ int pd_idx;
int raid_disks = conf->raid_disks;
int data_disks = raid_disks - conf->max_degraded;
sector_t max_sector = mddev->size << 1;
@@ -2728,122 +2851,9 @@ static sector_t sync_request(mddev_t *md
return 0;
}
- if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
- /* reshaping is quite different to recovery/resync so it is
- * handled quite separately ... here.
- *
- * On each call to sync_request, we gather one chunk worth of
- * destination stripes and flag them as expanding.
- * Then we find all the source stripes and request reads.
- * As the reads complete, handle_stripe will copy the data
- * into the destination stripe and release that stripe.
- */
- int i;
- int dd_idx;
- sector_t writepos, safepos, gap;
-
- if (sector_nr == 0 &&
- conf->expand_progress != 0) {
- /* restarting in the middle, skip the initial sectors */
- sector_nr = conf->expand_progress;
- sector_div(sector_nr, conf->raid_disks-1);
- *skipped = 1;
- return sector_nr;
- }
+ if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
+ return reshape_request(mddev, sector_nr, skipped);
- /* we update the metadata when there is more than 3Meg
- * in the block range (that is rather arbitrary, should
- * probably be time based) or when the data about to be
- * copied would over-write the source of the data at
- * the front of the range.
- * i.e. one new_stripe forward from expand_progress new_maps
- * to after where expand_lo old_maps to
- */
- writepos = conf->expand_progress +
- conf->chunk_size/512*(conf->raid_disks-1);
- sector_div(writepos, conf->raid_disks-1);
- safepos = conf->expand_lo;
- sector_div(safepos, conf->previous_raid_disks-1);
- gap = conf->expand_progress - conf->expand_lo;
-
- if (writepos >= safepos ||
- gap > (conf->raid_disks-1)*3000*2 /*3Meg*/) {
- /* Cannot proceed until we've updated the superblock... */
- wait_event(conf->wait_for_overlap,
- atomic_read(&conf->reshape_stripes)==0);
- mddev->reshape_position = conf->expand_progress;
- mddev->sb_dirty = 1;
- md_wakeup_thread(mddev->thread);
- wait_event(mddev->sb_wait, mddev->sb_dirty == 0 ||
- kthread_should_stop());
- spin_lock_irq(&conf->device_lock);
- conf->expand_lo = mddev->reshape_position;
- spin_unlock_irq(&conf->device_lock);
- wake_up(&conf->wait_for_overlap);
- }
-
- for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
- int j;
- int skipped = 0;
- pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
- sh = get_active_stripe(conf, sector_nr+i,
- conf->raid_disks, pd_idx, 0);
- set_bit(STRIPE_EXPANDING, &sh->state);
- atomic_inc(&conf->reshape_stripes);
- /* If any of this stripe is beyond the end of the old
- * array, then we need to zero those blocks
- */
- for (j=sh->disks; j--;) {
- sector_t s;
- if (j == sh->pd_idx)
- continue;
- s = compute_blocknr(sh, j);
- if (s < (mddev->array_size<<1)) {
- skipped = 1;
- continue;
- }
- memset(page_address(sh->dev[j].page), 0, STRIPE_SIZE);
- set_bit(R5_Expanded, &sh->dev[j].flags);
- set_bit(R5_UPTODATE, &sh->dev[j].flags);
- }
- if (!skipped) {
- set_bit(STRIPE_EXPAND_READY, &sh->state);
- set_bit(STRIPE_HANDLE, &sh->state);
- }
- release_stripe(sh);
- }
- spin_lock_irq(&conf->device_lock);
- conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
- spin_unlock_irq(&conf->device_lock);
- /* Ok, those stripe are ready. We can start scheduling
- * reads on the source stripes.
- * The source stripes are determined by mapping the first and last
- * block on the destination stripes.
- */
- raid_disks = conf->previous_raid_disks;
- data_disks = raid_disks - 1;
- first_sector =
- raid5_compute_sector(sector_nr*(conf->raid_disks-1),
- raid_disks, data_disks,
- &dd_idx, &pd_idx, conf);
- last_sector =
- raid5_compute_sector((sector_nr+conf->chunk_size/512)
- *(conf->raid_disks-1) -1,
- raid_disks, data_disks,
- &dd_idx, &pd_idx, conf);
- if (last_sector >= (mddev->size<<1))
- last_sector = (mddev->size<<1)-1;
- while (first_sector <= last_sector) {
- pd_idx = stripe_to_pdidx(first_sector, conf, conf->previous_raid_disks);
- sh = get_active_stripe(conf, first_sector,
- conf->previous_raid_disks, pd_idx, 0);
- set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
- set_bit(STRIPE_HANDLE, &sh->state);
- release_stripe(sh);
- first_sector += STRIPE_SECTORS;
- }
- return conf->chunk_size>>9;
- }
/* if there is too many failed drives and we are trying
* to resync, then assert that we are finished, because there is
* nothing we can do.
^ permalink raw reply [flat|nested] 29+ messages in thread[parent not found: <1060501053025.22961@suse.de>]
* Re: [PATCH 005 of 11] md: Merge raid5 and raid6 code
[not found] ` <1060501053025.22961@suse.de>
@ 2006-05-01 5:40 ` H. Peter Anvin
0 siblings, 0 replies; 29+ messages in thread
From: H. Peter Anvin @ 2006-05-01 5:40 UTC (permalink / raw)
To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel
NeilBrown wrote:
> There is a lot of commonality between raid5.c and raid6main.c. This
> patches merges both into one module called raid456. This saves a lot
> of code, and paves the way for online raid5->raid6 migrations.
>
> There is still duplication, e.g. between handle_stripe5 and
> handle_stripe6. This will probably be cleaned up later.
>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Neil Brown <neilb@suse.de>
>
Wonderful! Thank you for doing this :)
-hpa
^ permalink raw reply [flat|nested] 29+ messages in thread