From: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
To: Alasdair Kergon <agk@redhat.com>
Cc: device-mapper development <dm-devel@redhat.com>
Subject: Re: [PATCH 3/3] dm-mpath: add service-time oriented dynamic load balancer
Date: Fri, 15 May 2009 12:09:21 +0900 [thread overview]
Message-ID: <4A0CDCE1.50409@ct.jp.nec.com> (raw)
In-Reply-To: <20090509004959.GJ1320@agk-dp.fab.redhat.com>
[-- Attachment #1: Type: text/plain, Size: 2323 bytes --]
Hi Alasdair,
Thank you for the review and comments.
I'm totally agree with your comments.
I found that you have already updated the patch in your editing tree, thanks.
But I'm considering to update the patch to be simpler.
Please see below.
On 05/09/2009 09:49 AM +0900, Alasdair G Kergon wrote:
> On Fri, Apr 24, 2009 at 05:08:02PM +0900, Kiyoshi Ueda wrote:
>> + struct selector *s = (struct selector *) ps->context;
>> + status_type_t type, char *result, unsigned int maxlen)
>> + int sz = 0;
>> + DMEMIT("%u %lu ", atomic_read(&pi->in_flight_size),
>
> (similar comments)
OK, you have already fixed the pointer cast, the sz type and
the print format for in_flight_size.
I'll fix the type of maxlen.
>> +/*
>> + * Returns:
>> + * < 0 : pi1 is better
>> + * 0 : no difference between pi1 and pi2
>> + * > 0 : pi2 is better
>> + */
>
> Please document the parameters and algorithm.
> Nothing explains what the performance value is for and examples of anticipated values.
OK, I'll add comments and documentation like the attached patch.
>> + DMEMIT("%u %lu ", pi->repeat_count, pi->perf);
>
> Need to handle both 32 and 64 bit archs: cast to llu.
>
>> + if ((argc == 2) && (sscanf(argv[1], "%lu", &perf) != 1)) {
>
> Please do sscanf for size_t the same way as dm-crypt does.
> (Or scan for lu, if the intent is to impose a size limit ahead of the later
> do_div() then cast explicitly.)
>
>> + do_div(sz1, pi1->perf);
>> + do_div(sz2, pi2->perf);
>
> Or sector_div ?
>
> But what's the performance of those divisions like?
> Is there a better way of getting the same answer?
> Is there validation limiting the size of 'perf'?
I tried to use multiplication before, but it didn't work well,
because overflow happened when 'in_flight_size' and 'perf' had
pretty big values, and then I got wrong comparison results.
So I decided to use division.
However, if I limit the 'perf' value in smaller range (e.g. 0 to 100),
such overflow shouldn't happen. So I should be able to use
multiplication. Also, I don't need to use size_t for 'perf',
because 'unsinged' should be enough for such small values.
> (Also, did you check there's adequate locking & memory barriers around all the
> atomics?)
Yes, I did and I found no problem in both dm-queue-length and
dm-service-time.
Thanks,
Kiyoshi Ueda
[-- Attachment #2: rqdm-dlb-04-service-time-dlb-document.patch --]
[-- Type: text/plain, Size: 5834 bytes --]
Add documents/comments for dm-service-time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
Documentation/device-mapper/dm-service-time.txt | 87 ++++++++++++++++++++++++
drivers/md/dm-service-time.c | 26 +++++--
2 files changed, 107 insertions(+), 6 deletions(-)
Index: 2.6.30-rc5/Documentation/device-mapper/dm-service-time.txt
===================================================================
--- /dev/null
+++ 2.6.30-rc5/Documentation/device-mapper/dm-service-time.txt
@@ -0,0 +1,87 @@
+dm-service-time
+===============
+
+dm-service-time is a path selector module for device-mapper targets,
+which selects a path with the shortest estimated service time for
+the incoming I/O.
+
+The service time for each path is estimated by dividing the total size
+of in-flight I/Os on a path with the performance value of the path.
+The performance value is a relative throughput value among all paths
+in a path-group, and it can be specified as a table argument.
+
+The path selector name is 'service-time'.
+
+Table parameters for each path: [<repeat_count> [<performance>]]
+ <repeat_count>: The number of I/Os to dispatch using the selected
+ path before switching to the next path.
+ If not given, internal default is used. To check
+ the default value, see the activated table.
+ <performance>: The relative throughput value of the path
+ among all paths in the path-group.
+ If not given, minimum value '1' is used.
+ If '0' is given, the path isn't selected while
+ other paths having a positive value are available.
+
+Status for each path: <status> <fail-count> <in-flight-size> <performance>
+ <status>: 'A' if the path is active, 'F' if the path is failed.
+ <fail-count>: The number of path failures.
+ <in-flight-size>: The size of in-flight I/Os on the path.
+ <performance>: The relative throughput value of the path
+ among all paths in the path-group.
+
+
+Algorithm
+=========
+
+dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
+dispatched and substracts when completed.
+Basically, dm-service-time selects a path having minimum service time
+which is calculated by:
+
+ ('in-flight-size' + 'size-of-incoming-io') / 'performance'
+
+However, some optimizations below are used to reduce the calculation
+as much as possible.
+
+ 1. If the paths have the same 'performance', skip the division
+ and just compare the 'in-flight-size'.
+
+ 2. If the paths have the same 'in-flight-size', skip the division
+ and just compare the 'performance'.
+
+ 3. If some paths have non-zero 'performance' and others have zero
+ 'performance', ignore those paths with zero 'performance'.
+
+If such optimizations can't be applied, calculate service time, and
+compare service time.
+If calculated service time is equal, the path having maximum 'performance'
+may be better. So compare 'performance' then.
+
+
+Examples
+========
+In case that 2 paths (sda and sdb) are used with repeat_count == 128
+and sda has an average throughput 1GB/s and sdb has 4GB/s, 'performance'
+value may be '1' for sda and '4' for sdb.
+
+# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
+ dmsetup create test
+#
+# dmsetup table
+test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
+#
+# dmsetup status
+test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
+
+
+Or '2' for sda and '8' for sdb would be also true.
+
+# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
+ dmsetup create test
+#
+# dmsetup table
+test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
+#
+# dmsetup status
+test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
Index: 2.6.30-rc5/drivers/md/dm-service-time.c
===================================================================
--- 2.6.30-rc5.orig/drivers/md/dm-service-time.c
+++ 2.6.30-rc5/drivers/md/dm-service-time.c
@@ -105,22 +105,27 @@ static int st_add_path(struct path_selec
unsigned repeat_count = ST_MIN_IO;
unsigned long long tmpll = 1;
+ /*
+ * Arguments: [<repeat_count> [<performance>]]
+ * <repeat_count>: The number of I/Os before switching path.
+ * If not given, default (ST_MIN_IO) is used.
+ * <performance>: The relative throughput value of the path
+ * among all paths in the path-group.
+ * If not given, minimum value '1' is used.
+ * If '0' is given, the path isn't selected while
+ * other paths having a positive value are
+ * available.
+ */
if (argc > 2) {
*error = "service-time ps: incorrect number of arguments";
return -EINVAL;
}
- /* First path argument is number of I/Os before switching path. */
if ((argc > 0) && (sscanf(argv[0], "%u", &repeat_count) != 1)) {
*error = "service-time ps: invalid repeat count";
return -EINVAL;
}
- /*
- * Second path argument is a relative performance value.
- * If 0 is given, the path isn't used while other paths having
- * a positive value are available.
- */
if ((argc == 2) && (sscanf(argv[1], "%llu", &tmpll) != 1)) {
*error = "service-time ps: invalid performance value";
return -EINVAL;
@@ -164,10 +169,19 @@ static int st_reinstate_path(struct path
}
/*
+ * Compare the estimated service time of 2 paths, pi1 and pi2,
+ * for the incoming I/O.
+ *
* Returns:
* < 0 : pi1 is better
* 0 : no difference between pi1 and pi2
* > 0 : pi2 is better
+ *
+ * Description:
+ * Basically, the service time is estimated by:
+ * ('pi->in-flight-size' + 'incoming') / 'pi->perf'
+ * To reduce the calculation, some optimizations are made.
+ * (See comments inline)
*/
static int st_compare_load(struct path_info *pi1, struct path_info *pi2,
size_t incoming)
[-- Attachment #3: Type: text/plain, Size: 0 bytes --]
next prev parent reply other threads:[~2009-05-15 3:09 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-24 8:04 [PATCH 0/3] dm-mpath: dynamic load balancers (v2) Kiyoshi Ueda
2009-04-24 8:06 ` [PATCH 1/3] dm-mpath: interface change for dynamic load balancers Kiyoshi Ueda
2009-04-24 8:07 ` [PATCH 2/3] dm-mpath: add queue-length oriented dynamic load balancer Kiyoshi Ueda
2009-05-09 0:31 ` Alasdair G Kergon
2009-05-14 6:14 ` Kiyoshi Ueda
2009-05-14 12:34 ` Alasdair G Kergon
2009-04-24 8:08 ` [PATCH 3/3] dm-mpath: add service-time " Kiyoshi Ueda
2009-05-09 0:49 ` Alasdair G Kergon
2009-05-15 3:09 ` Kiyoshi Ueda [this message]
2009-05-18 11:46 ` Alasdair G Kergon
2009-05-19 2:59 ` Kiyoshi Ueda
2009-05-19 8:13 ` Kiyoshi Ueda
-- strict thread matches above, loose matches on Subject: below --
2009-03-18 8:34 Subject: [PATCH 0/3] dm-mpath: dynamic load balancers (v1) Kiyoshi Ueda
2009-03-18 8:40 ` [PATCH 3/3] dm-mpath: add service-time oriented dynamic load balancer Kiyoshi Ueda
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A0CDCE1.50409@ct.jp.nec.com \
--to=k-ueda@ct.jp.nec.com \
--cc=agk@redhat.com \
--cc=dm-devel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.