* [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
@ 2007-11-03 10:08 Konstantin Sharlaimov
2007-11-03 14:29 ` Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Konstantin Sharlaimov @ 2007-11-03 10:08 UTC (permalink / raw)
To: linux-raid; +Cc: Rik van Riel, ; Ingo Molnar, ; Neil Brown
This patch adds RAID1 read balancing to device mapper. A read operation
that is close (in terms of sectors) to a previous read or write goes to
the same mirror.
Signed-off-by: Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com>
---
Please give it a try, it works for me, yet my results might be system-specific.
Any feedback (bug-reports, suggestions) will be greatly appreciated.
--- linux-2.6.23.1/drivers/md/dm-raid1.c.old 2007-11-03 18:47:10.000000000 +1000
+++ linux-2.6.23.1/drivers/md/dm-raid1.c 2007-11-03 19:54:35.000000000 +1000
@@ -19,6 +19,7 @@
#include <linux/time.h>
#include <linux/vmalloc.h>
#include <linux/workqueue.h>
+#include <linux/random.h>
#define DM_MSG_PREFIX "raid1"
#define DM_IO_PAGES 64
@@ -26,6 +27,9 @@
#define DM_RAID1_HANDLE_ERRORS 0x01
#define errors_handled(p) ((p)->features & DM_RAID1_HANDLE_ERRORS)
+/* Read balancing max hdd head distance */
+#define DM_RAID1_BALANCE_MAX_IO_DISTANCE (256)
+
static DECLARE_WAIT_QUEUE_HEAD(_kmirrord_recovery_stopped);
/*-----------------------------------------------------------------
@@ -116,6 +120,7 @@ struct mirror {
atomic_t error_count;
struct dm_dev *dev;
sector_t offset;
+ sector_t last_io_sector;
};
struct mirror_set {
@@ -741,13 +746,51 @@ static void do_recovery(struct mirror_se
}
}
+static void set_mirror_last_io_sector(struct mirror *m, sector_t sector)
+{
+ /* FIXME: Probably some more work is needed here, however this is unlikely */
+ m->last_io_sector = sector;
+}
+
/*-----------------------------------------------------------------
* Reads
*---------------------------------------------------------------*/
+/*
+ * There is a per-array 'last IO operation' sector number maintained by
+ * read and write handlers for the region. When balancing reads we pick
+ * the disk whose IO operation (HDD head position) is closest.
+ */
static struct mirror *choose_mirror(struct mirror_set *ms, sector_t sector)
{
- /* FIXME: add read balancing */
- return ms->default_mirror;
+ /* If we got here, then the array is in sync and we can pick any mirror */
+
+ unsigned int i;
+ struct mirror *use_mirror;
+ sector_t use_distance, new_distance;
+
+ use_mirror = &ms->mirror[0];
+ use_distance = abs(sector - ms->mirror[0].last_io_sector);
+
+ for (i = 1; i < ms->nr_mirrors; i++) {
+ new_distance = abs(sector - ms->mirror[i].last_io_sector);
+ if (new_distance < use_distance) {
+ use_distance = new_distance;
+ use_mirror = &ms->mirror[i];
+ }
+ }
+
+ /*
+ * If the HDD head is too far from the needed sector then we do stochastic
+ * balancing - chose the mirror randomly. This appers to have a better
+ * chance of chosing an idle disk in case of two or more regions residing
+ * on the same physical disk.
+ *
+ * TODO: Gather more statistical data and verify that the above is correct
+ */
+ if (use_distance > DM_RAID1_BALANCE_MAX_IO_DISTANCE)
+ return &ms->mirror[random32() % ms->nr_mirrors];
+ else
+ return use_mirror;
}
/*
@@ -776,6 +819,9 @@ static void do_reads(struct mirror_set *
else
m = ms->default_mirror;
+ /* Set last IO position for chosen mirror */
+ set_mirror_last_io_sector(m, bio->bi_sector);
+
map_bio(ms, m, bio);
generic_make_request(bio);
}
@@ -800,6 +846,21 @@ static void write_callback(unsigned long
ms = bio_get_ms(bio);
bio_set_ms(bio, NULL);
+
+ /*
+ * Things might be different for various region states:
+ * SYNC: writing is done to all mirrors, reading is balanced
+ * RECOVERING: writing is delayed, reading is done from the default
+ * NOSYNC: writing to default only, reading from the default
+ *
+ * In any case, if we update last IO sector at all mirrors, we will use
+ * the up-to-date data when doing read balancing
+ *
+ * FIXME: update write position only on the region being written
+ */
+
+ for (i = 0; i < ms->nr_mirrors; i++)
+ set_mirror_last_io_sector(&ms->mirror[i], bio->bi_sector);
/*
* NOTE: We don't decrement the pending count here,
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-03 10:08 [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing Konstantin Sharlaimov
@ 2007-11-03 14:29 ` Rik van Riel
2007-11-07 9:15 ` Was: " Goswin von Brederlow
2007-11-11 23:36 ` Samuel Tardieu
2 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2007-11-03 14:29 UTC (permalink / raw)
To: konstantin.sharlaimov
Cc: linux-raid, Ingo Molnar <mingo@redhat.com>, Neil Brown
On Sat, 03 Nov 2007 20:08:42 +1000
Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com> wrote:
> This patch adds RAID1 read balancing to device mapper. A read
> operation that is close (in terms of sectors) to a previous read or
> write goes to the same mirror.
>
> Signed-off-by: Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com>
> ---
> Please give it a try, it works for me, yet my results might be
> system-specific. Any feedback (bug-reports, suggestions) will be
> greatly appreciated.
I like your approach. I have no dm-raid1 array here to test with,
but I am curious what your results are :)
The code looks nice.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-03 10:08 [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing Konstantin Sharlaimov
2007-11-03 14:29 ` Rik van Riel
@ 2007-11-07 9:15 ` Goswin von Brederlow
2007-11-08 11:06 ` Konstantin Sharlaimov
2007-11-11 23:36 ` Samuel Tardieu
2 siblings, 1 reply; 10+ messages in thread
From: Goswin von Brederlow @ 2007-11-07 9:15 UTC (permalink / raw)
To: konstantin.sharlaimov
Cc: linux-raid, Rik van Riel, ; Ingo Molnar, ; Neil Brown
Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com> writes:
> This patch adds RAID1 read balancing to device mapper. A read operation
> that is close (in terms of sectors) to a previous read or write goes to
> the same mirror.
I wonder if there shouldn't be a way to turn this off (or if there
already is one).
Or more generaly an option to say what is "near". Specifically I would
like to teach the raid1 layer that I have 2 external raid boxes with a
16k chunk size. So read/write within a 16k chunk will be the same disk
but the next 16k are a different disk and "near" doesn't apply
anymore.
MfG
Goswin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-07 9:15 ` Was: " Goswin von Brederlow
@ 2007-11-08 11:06 ` Konstantin Sharlaimov
2007-11-08 16:28 ` Goswin von Brederlow
0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Sharlaimov @ 2007-11-08 11:06 UTC (permalink / raw)
To: Goswin von Brederlow
Cc: linux-raid, Rik van Riel, ; Ingo Molnar, ; Neil Brown
On Wed, 2007-11-07 at 10:15 +0100, Goswin von Brederlow wrote:
> I wonder if there shouldn't be a way to turn this off (or if there
> already is one).
>
> Or more generaly an option to say what is "near". Specifically I would
> like to teach the raid1 layer that I have 2 external raid boxes with a
> 16k chunk size. So read/write within a 16k chunk will be the same disk
> but the next 16k are a different disk and "near" doesn't apply
> anymore.
Currently there is no way to turn this feature off (this is only a
"request for comments" patch), but I'm planning to make it configurable
via sysfs and module parameters.
Thanks for suggestion for the "near" definition. What do you think about
adding the "chunk_size" parameter (with the default value of 1 chunk = 1
sector). Setting it to 32 will make all reads within 16k chunk to be
considered "near" (with zero distance) so they will go to the same disk.
Max distance will also be configurable (after this distance the "read"
operation is considered "far" and will go to randomly chosen disk)
Regards,
Konstantin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-08 11:06 ` Konstantin Sharlaimov
@ 2007-11-08 16:28 ` Goswin von Brederlow
2007-11-08 17:06 ` Rik van Riel
2007-11-08 17:35 ` Bill Davidsen
0 siblings, 2 replies; 10+ messages in thread
From: Goswin von Brederlow @ 2007-11-08 16:28 UTC (permalink / raw)
To: konstantin.sharlaimov
Cc: Goswin von Brederlow, linux-raid, Rik van Riel, ; Ingo Molnar,
; Neil Brown
Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com> writes:
> On Wed, 2007-11-07 at 10:15 +0100, Goswin von Brederlow wrote:
>> I wonder if there shouldn't be a way to turn this off (or if there
>> already is one).
>>
>> Or more generaly an option to say what is "near". Specifically I would
>> like to teach the raid1 layer that I have 2 external raid boxes with a
>> 16k chunk size. So read/write within a 16k chunk will be the same disk
>> but the next 16k are a different disk and "near" doesn't apply
>> anymore.
>
> Currently there is no way to turn this feature off (this is only a
> "request for comments" patch), but I'm planning to make it configurable
> via sysfs and module parameters.
>
> Thanks for suggestion for the "near" definition. What do you think about
> adding the "chunk_size" parameter (with the default value of 1 chunk = 1
> sector). Setting it to 32 will make all reads within 16k chunk to be
> considered "near" (with zero distance) so they will go to the same disk.
>
> Max distance will also be configurable (after this distance the "read"
> operation is considered "far" and will go to randomly chosen disk)
>
> Regards,
> Konstantin
Maybe you need more parameter:
chunk_size - size of a continious chunk on the (multi disk) device
stripe_size - size of a stripe of chunks spanning all disks
rotation_size - size of multiple stripes before parity rotates to a
new disk (sign gives direction of rotation)
near_size - size that is considered to be near on a disk
I would give all sizes in blocks of 512 bytes or bytes.
Default would be:
chunk_size = 1 (block)
stripe_size = 1 (block)
rotation_size = 0 (no rotation)
near_size = 256
That would reflect that you have all chunks continious on a normal
disk and read/writes are done in 128K chunks.
For raid 1 on raid 0:
chunk_size = raid chunk size
stripe_size = num disks * chunk_size
rotation_size = 0
near_size = 256
For raid 1 on raid 5:
chunk_size = raid chunk size
stripe_size = (num disks - 1) * chunk_size
rotation_size = (num disks - 1) * chunk_size (?)
near_size = 256
and so on.
MfG
Goswin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-08 16:28 ` Goswin von Brederlow
@ 2007-11-08 17:06 ` Rik van Riel
2007-11-08 17:24 ` Bill Davidsen
2007-11-08 19:43 ` Goswin von Brederlow
2007-11-08 17:35 ` Bill Davidsen
1 sibling, 2 replies; 10+ messages in thread
From: Rik van Riel @ 2007-11-08 17:06 UTC (permalink / raw)
Cc: konstantin.sharlaimov, Goswin von Brederlow, linux-raid,
; Ingo Molnar, ; Neil Brown
On Thu, 08 Nov 2007 17:28:37 +0100
Goswin von Brederlow <brederlo@informatik.uni-tuebingen.de> wrote:
> Maybe you need more parameter:
Generally a bad idea, unless you can come up with sane defaults (which
do not need tuning 99% of the time) or you can derive these parameters
automatically from the RAID configuration (unlikely with RAID 1?).
Before turning Konstantin's nice and simple code into something complex,
it would be good to see benchmark results show that the complexity is
actually needed.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-08 17:06 ` Rik van Riel
@ 2007-11-08 17:24 ` Bill Davidsen
2007-11-08 19:43 ` Goswin von Brederlow
1 sibling, 0 replies; 10+ messages in thread
From: Bill Davidsen @ 2007-11-08 17:24 UTC (permalink / raw)
To: Rik van Riel
Cc: Goswin von Brederlow, konstantin.sharlaimov, linux-raid,
; Ingo Molnar, ; Neil Brown
Rik van Riel wrote:
> On Thu, 08 Nov 2007 17:28:37 +0100
> Goswin von Brederlow <brederlo@informatik.uni-tuebingen.de> wrote:
>
>
>> Maybe you need more parameter:
>>
>
> Generally a bad idea, unless you can come up with sane defaults (which
> do not need tuning 99% of the time) or you can derive these parameters
> automatically from the RAID configuration (unlikely with RAID 1?).
>
> Before turning Konstantin's nice and simple code into something complex,
> it would be good to see benchmark results show that the complexity is
> actually needed.
>
>
I was about to post a question about the benefit of all this logic, I'll
trim before posting...
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-08 16:28 ` Goswin von Brederlow
2007-11-08 17:06 ` Rik van Riel
@ 2007-11-08 17:35 ` Bill Davidsen
1 sibling, 0 replies; 10+ messages in thread
From: Bill Davidsen @ 2007-11-08 17:35 UTC (permalink / raw)
To: Goswin von Brederlow
Cc: konstantin.sharlaimov, linux-raid, Rik van Riel, ; Ingo Molnar,
; Neil Brown
Goswin von Brederlow wrote:
> Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com> writes:
>
>
>> On Wed, 2007-11-07 at 10:15 +0100, Goswin von Brederlow wrote:
>>
>>> I wonder if there shouldn't be a way to turn this off (or if there
>>> already is one).
>>>
>>> Or more generaly an option to say what is "near". Specifically I would
>>> like to teach the raid1 layer that I have 2 external raid boxes with a
>>> 16k chunk size. So read/write within a 16k chunk will be the same disk
>>> but the next 16k are a different disk and "near" doesn't apply
>>> anymore.
>>>
>> Currently there is no way to turn this feature off (this is only a
>> "request for comments" patch), but I'm planning to make it configurable
>> via sysfs and module parameters.
>>
>> Thanks for suggestion for the "near" definition. What do you think about
>> adding the "chunk_size" parameter (with the default value of 1 chunk = 1
>> sector). Setting it to 32 will make all reads within 16k chunk to be
>> considered "near" (with zero distance) so they will go to the same disk.
>>
>> Max distance will also be configurable (after this distance the "read"
>> operation is considered "far" and will go to randomly chosen disk)
>>
>> Regards,
>> Konstantin
>>
>
> Maybe you need more parameter:
>
> chunk_size - size of a continious chunk on the (multi disk) device
> stripe_size - size of a stripe of chunks spanning all disks
> rotation_size - size of multiple stripes before parity rotates to a
> new disk (sign gives direction of rotation)
> near_size - size that is considered to be near on a disk
>
> I would give all sizes in blocks of 512 bytes or bytes.
>
Why? Would there ever be a time when there would be a significant (or
any) benefit from a size other than a multiple of chunk size? If you
give the rest of the sizes in multiples of chunk size it invites less
human math.
> Default would be:
>
Default would be "zero" to indicate that the raid system should figure
out what to use, allowing the value of "one" to actually mean what it
says. It also invites use of zero for the rest of the calculated sizes,
indicating the raid subsystem should select values. With coming SSD
hardware you may actually want one to mean one.
> chunk_size = 1 (block)
> stripe_size = 1 (block)
> rotation_size = 0 (no rotation)
> near_size = 256
>
> That would reflect that you have all chunks continious on a normal
> disk and read/writes are done in 128K chunks.
>
> For raid 1 on raid 0:
>
> chunk_size = raid chunk size
> stripe_size = num disks * chunk_size
> rotation_size = 0
> near_size = 256
>
> For raid 1 on raid 5:
>
> chunk_size = raid chunk size
> stripe_size = (num disks - 1) * chunk_size
> rotation_size = (num disks - 1) * chunk_size (?)
> near_size = 256
>
> and so on.
>
> MfG
> Goswin
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-08 17:06 ` Rik van Riel
2007-11-08 17:24 ` Bill Davidsen
@ 2007-11-08 19:43 ` Goswin von Brederlow
1 sibling, 0 replies; 10+ messages in thread
From: Goswin von Brederlow @ 2007-11-08 19:43 UTC (permalink / raw)
To: Rik van Riel
Cc: Goswin von Brederlow, konstantin.sharlaimov, linux-raid,
; Ingo Molnar, ; Neil Brown
Rik van Riel <riel@redhat.com> writes:
> On Thu, 08 Nov 2007 17:28:37 +0100
> Goswin von Brederlow <brederlo@informatik.uni-tuebingen.de> wrote:
>
>> Maybe you need more parameter:
>
> Generally a bad idea, unless you can come up with sane defaults (which
> do not need tuning 99% of the time) or you can derive these parameters
> automatically from the RAID configuration (unlikely with RAID 1?).
>
> Before turning Konstantin's nice and simple code into something complex,
> it would be good to see benchmark results show that the complexity is
> actually needed.
A default of chunk=1, stripe=1, rotate=0 is what you have now. SO
there is no regression. It is also the right choice for any device
that isn't a (hardware) raid. I don't think 99% of all users have
hardware raid.
For stacked software raids the raid1 modul should detect the
underlying raid and default to values reflecting the raid disk below
(assuming the mirrors have reasonable equal layouts). That would
truely only leave hardware raids to be manually tuned.
MfG
Goswin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
2007-11-03 10:08 [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing Konstantin Sharlaimov
2007-11-03 14:29 ` Rik van Riel
2007-11-07 9:15 ` Was: " Goswin von Brederlow
@ 2007-11-11 23:36 ` Samuel Tardieu
2 siblings, 0 replies; 10+ messages in thread
From: Samuel Tardieu @ 2007-11-11 23:36 UTC (permalink / raw)
To: linux-raid
>>>>> "Konstantin" == Konstantin Sharlaimov
>>>>> <konstantin.sharlaimov@gmail.com> writes:
Konstantin> This patch adds RAID1 read balancing to device mapper. A
Konstantin> read operation that is close (in terms of sectors) to a
Konstantin> previous read or write goes to the same mirror.
I am currently running it on top of a 2.6.24-rc2 kernel and it works
fine so far.
Sam
--
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-11-11 23:36 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-03 10:08 [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing Konstantin Sharlaimov
2007-11-03 14:29 ` Rik van Riel
2007-11-07 9:15 ` Was: " Goswin von Brederlow
2007-11-08 11:06 ` Konstantin Sharlaimov
2007-11-08 16:28 ` Goswin von Brederlow
2007-11-08 17:06 ` Rik van Riel
2007-11-08 17:24 ` Bill Davidsen
2007-11-08 19:43 ` Goswin von Brederlow
2007-11-08 17:35 ` Bill Davidsen
2007-11-11 23:36 ` Samuel Tardieu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).