* [PATCH v2 0/3] Historical Service Time Path Selector
@ 2020-04-28 0:51 Gabriel Krisman Bertazi
2020-04-28 0:51 ` [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors Gabriel Krisman Bertazi
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-04-28 0:51 UTC (permalink / raw)
To: agk, snitzer; +Cc: dm-devel, kernel, Gabriel Krisman Bertazi, khazhy
Hi Mike,
Please find an updated version of HST integrating the change you
requested to also support BIO based multipath. I hope you don't mind me
folding the function you implemented into patch 2. If you prefer, I can
integrate a patch you provide into the series.
One interesting data point is that the selection performance on
BIO-based is worse when compared to request-based. I tested a bit and I
believe the reason is that the paths are more sticky because the bucket
is too large, and it takes longer for HST to detect differences. By
changing the bucket_size indirectly through the bucket_shift, the
bio-based multipath performance improved, but I'm not sure this is a
knob we want to expose to users. For now, please consider the patches
below, for review.
By the way, the exercise to support bio-based mpath uncovered the mpath
initialization bug that I fixed in the previous patch I sent today, so
it was worth it.
Gabriel Krisman Bertazi (2):
md: multipath: Encapsulate parameters passed to selectors
md: multipath: Pass io_start_time to the path selector
Khazhismel Kumykov (1):
md: Add Historical Service Time Path Selector
drivers/md/Kconfig | 11 +
drivers/md/Makefile | 1 +
drivers/md/dm-historical-service-time.c | 561 ++++++++++++++++++++++++
drivers/md/dm-mpath.c | 30 +-
drivers/md/dm-path-selector.h | 9 +-
drivers/md/dm-queue-length.c | 4 +-
drivers/md/dm-service-time.c | 8 +-
drivers/md/dm.c | 10 +
include/linux/device-mapper.h | 2 +
9 files changed, 623 insertions(+), 13 deletions(-)
create mode 100644 drivers/md/dm-historical-service-time.c
--
2.26.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors
2020-04-28 0:51 [PATCH v2 0/3] Historical Service Time Path Selector Gabriel Krisman Bertazi
@ 2020-04-28 0:51 ` Gabriel Krisman Bertazi
2020-04-28 17:20 ` Mike Snitzer
2020-04-28 0:51 ` [PATCH v2 2/3] md: multipath: Pass io_start_time to the path selector Gabriel Krisman Bertazi
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-04-28 0:51 UTC (permalink / raw)
To: agk, snitzer; +Cc: dm-devel, kernel, Gabriel Krisman Bertazi, khazhy
Different selector will use different parameters, which means .io_start
and .io_end will get their signatures modified to include more and more
parameters. This encapsulates the data in a structure so we can
simplify the interface for future users. For now it only passes
nr_bytes, but HST will require start_time already.
Cc: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
drivers/md/dm-mpath.c | 25 ++++++++++++++++++++-----
drivers/md/dm-path-selector.h | 8 ++++++--
drivers/md/dm-queue-length.c | 4 ++--
drivers/md/dm-service-time.c | 8 ++++----
4 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f5e7f8e88767..1ef4fc2e745b 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -500,6 +500,9 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
struct dm_mpath_io *mpio = get_mpio(map_context);
struct request_queue *q;
struct request *clone;
+ struct path_selector_io_data io_data = {
+ .nr_bytes = nr_bytes,
+ };
/* Do we need to select a new pgpath? */
pgpath = READ_ONCE(m->current_pgpath);
@@ -549,7 +552,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
&pgpath->path,
- nr_bytes);
+ &io_data);
return DM_MAPIO_REMAPPED;
}
@@ -563,11 +566,14 @@ static void multipath_release_clone(struct request *clone,
*/
struct dm_mpath_io *mpio = get_mpio(map_context);
struct pgpath *pgpath = mpio->pgpath;
+ struct path_selector_io_data ps_io_data = {
+ .nr_bytes = mpio->nr_bytes,
+ };
if (pgpath && pgpath->pg->ps.type->end_io)
pgpath->pg->ps.type->end_io(&pgpath->pg->ps,
&pgpath->path,
- mpio->nr_bytes);
+ &ps_io_data);
}
blk_put_request(clone);
@@ -617,6 +623,9 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
struct dm_mpath_io *mpio)
{
struct pgpath *pgpath = __map_bio(m, bio);
+ struct path_selector_io_data io_data = {
+ .nr_bytes = mpio->nr_bytes,
+ };
if (IS_ERR(pgpath))
return DM_MAPIO_SUBMITTED;
@@ -637,7 +646,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
&pgpath->path,
- mpio->nr_bytes);
+ &io_data);
return DM_MAPIO_REMAPPED;
}
@@ -1618,9 +1627,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
if (pgpath) {
struct path_selector *ps = &pgpath->pg->ps;
+ struct path_selector_io_data io_data = {
+ .nr_bytes = mpio->nr_bytes
+ };
if (ps->type->end_io)
- ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
+ ps->type->end_io(ps, &pgpath->path, &io_data);
}
return r;
@@ -1662,9 +1674,12 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
done:
if (pgpath) {
struct path_selector *ps = &pgpath->pg->ps;
+ struct path_selector_io_data io_data = {
+ .nr_bytes = mpio->nr_bytes
+ };
if (ps->type->end_io)
- ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
+ ps->type->end_io(ps, &pgpath->path, &io_data);
}
return r;
diff --git a/drivers/md/dm-path-selector.h b/drivers/md/dm-path-selector.h
index b6eb5365b1a4..fb582a943234 100644
--- a/drivers/md/dm-path-selector.h
+++ b/drivers/md/dm-path-selector.h
@@ -26,6 +26,10 @@ struct path_selector {
void *context;
};
+struct path_selector_io_data {
+ size_t nr_bytes;
+};
+
/* Information about a path selector type */
struct path_selector_type {
char *name;
@@ -72,9 +76,9 @@ struct path_selector_type {
status_type_t type, char *result, unsigned int maxlen);
int (*start_io) (struct path_selector *ps, struct dm_path *path,
- size_t nr_bytes);
+ const struct path_selector_io_data *io_data);
int (*end_io) (struct path_selector *ps, struct dm_path *path,
- size_t nr_bytes);
+ const struct path_selector_io_data *io_data);
};
/* Register a path selector */
diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c
index 969c4f1a3633..eeaa038a1dbb 100644
--- a/drivers/md/dm-queue-length.c
+++ b/drivers/md/dm-queue-length.c
@@ -217,7 +217,7 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
}
static int ql_start_io(struct path_selector *ps, struct dm_path *path,
- size_t nr_bytes)
+ const struct path_selector_io_data *io_data)
{
struct path_info *pi = path->pscontext;
@@ -227,7 +227,7 @@ static int ql_start_io(struct path_selector *ps, struct dm_path *path,
}
static int ql_end_io(struct path_selector *ps, struct dm_path *path,
- size_t nr_bytes)
+ const struct path_selector_io_data *io_data)
{
struct path_info *pi = path->pscontext;
diff --git a/drivers/md/dm-service-time.c b/drivers/md/dm-service-time.c
index f006a9005593..d751f26b61af 100644
--- a/drivers/md/dm-service-time.c
+++ b/drivers/md/dm-service-time.c
@@ -299,21 +299,21 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
}
static int st_start_io(struct path_selector *ps, struct dm_path *path,
- size_t nr_bytes)
+ const struct path_selector_io_data *io_data)
{
struct path_info *pi = path->pscontext;
- atomic_add(nr_bytes, &pi->in_flight_size);
+ atomic_add(io_data->nr_bytes, &pi->in_flight_size);
return 0;
}
static int st_end_io(struct path_selector *ps, struct dm_path *path,
- size_t nr_bytes)
+ const struct path_selector_io_data *io_data)
{
struct path_info *pi = path->pscontext;
- atomic_sub(nr_bytes, &pi->in_flight_size);
+ atomic_sub(io_data->nr_bytes, &pi->in_flight_size);
return 0;
}
--
2.26.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 2/3] md: multipath: Pass io_start_time to the path selector
2020-04-28 0:51 [PATCH v2 0/3] Historical Service Time Path Selector Gabriel Krisman Bertazi
2020-04-28 0:51 ` [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors Gabriel Krisman Bertazi
@ 2020-04-28 0:51 ` Gabriel Krisman Bertazi
2020-04-28 17:25 ` Mike Snitzer
2020-04-28 0:51 ` [PATCH v2 3/3] md: Add Historical Service Time Path Selector Gabriel Krisman Bertazi
2020-04-30 17:49 ` [PATCH v2 0/3] " Gabriel Krisman Bertazi
3 siblings, 1 reply; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-04-28 0:51 UTC (permalink / raw)
To: agk, snitzer; +Cc: dm-devel, kernel, Gabriel Krisman Bertazi, khazhy
HST need to know the IO start time in order to predict path
performance. For request-based multipath use the block layer
io_start_time, while for BIO use the dm_io start_time.
The dm_start_time_ns_from_clone function was suggested and implemented
by Mike Snitzer <snitzer@redhat.com>.
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
drivers/md/dm-mpath.c | 25 +++++++++++++++----------
drivers/md/dm-path-selector.h | 1 +
drivers/md/dm.c | 10 ++++++++++
include/linux/device-mapper.h | 2 ++
4 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 1ef4fc2e745b..7af3249948be 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -500,8 +500,9 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
struct dm_mpath_io *mpio = get_mpio(map_context);
struct request_queue *q;
struct request *clone;
- struct path_selector_io_data io_data = {
+ struct path_selector_io_data ps_io_data = {
.nr_bytes = nr_bytes,
+ .io_start_time = rq->io_start_time_ns
};
/* Do we need to select a new pgpath? */
@@ -552,7 +553,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
&pgpath->path,
- &io_data);
+ &ps_io_data);
return DM_MAPIO_REMAPPED;
}
@@ -568,6 +569,7 @@ static void multipath_release_clone(struct request *clone,
struct pgpath *pgpath = mpio->pgpath;
struct path_selector_io_data ps_io_data = {
.nr_bytes = mpio->nr_bytes,
+ .io_start_time = clone->io_start_time_ns,
};
if (pgpath && pgpath->pg->ps.type->end_io)
@@ -623,8 +625,9 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
struct dm_mpath_io *mpio)
{
struct pgpath *pgpath = __map_bio(m, bio);
- struct path_selector_io_data io_data = {
+ struct path_selector_io_data ps_io_data = {
.nr_bytes = mpio->nr_bytes,
+ .io_start_time = dm_start_time_ns_from_clone(bio)
};
if (IS_ERR(pgpath))
@@ -646,7 +649,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
&pgpath->path,
- &io_data);
+ &ps_io_data);
return DM_MAPIO_REMAPPED;
}
@@ -1627,12 +1630,13 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
if (pgpath) {
struct path_selector *ps = &pgpath->pg->ps;
- struct path_selector_io_data io_data = {
- .nr_bytes = mpio->nr_bytes
+ struct path_selector_io_data ps_io_data = {
+ .nr_bytes = mpio->nr_bytes,
+ .io_start_time = clone->io_start_time_ns
};
if (ps->type->end_io)
- ps->type->end_io(ps, &pgpath->path, &io_data);
+ ps->type->end_io(ps, &pgpath->path, &ps_io_data);
}
return r;
@@ -1674,12 +1678,13 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
done:
if (pgpath) {
struct path_selector *ps = &pgpath->pg->ps;
- struct path_selector_io_data io_data = {
- .nr_bytes = mpio->nr_bytes
+ struct path_selector_io_data ps_io_data = {
+ .nr_bytes = mpio->nr_bytes,
+ .io_start_time = dm_start_time_ns_from_clone(clone)
};
if (ps->type->end_io)
- ps->type->end_io(ps, &pgpath->path, &io_data);
+ ps->type->end_io(ps, &pgpath->path, &ps_io_data);
}
return r;
diff --git a/drivers/md/dm-path-selector.h b/drivers/md/dm-path-selector.h
index fb582a943234..4c5fa6a2efe3 100644
--- a/drivers/md/dm-path-selector.h
+++ b/drivers/md/dm-path-selector.h
@@ -28,6 +28,7 @@ struct path_selector {
struct path_selector_io_data {
size_t nr_bytes;
+ u64 io_start_time;
};
/* Information about a path selector type */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index df13fdebe21f..2e0637a6de9d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -674,6 +674,16 @@ static bool md_in_flight(struct mapped_device *md)
return md_in_flight_bios(md);
}
+u64 dm_start_time_ns_from_clone(struct bio *bio)
+{
+ struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
+ struct dm_io *io = tio->io;
+
+ /* FIXME: convert io->start_time from jiffies to nanoseconds */
+ return (u64)jiffies_to_msecs(io->start_time) * NSEC_PER_MSEC;
+}
+EXPORT_SYMBOL_GPL(dm_start_time_ns_from_clone);
+
static void start_io_acct(struct dm_io *io)
{
struct mapped_device *md = io->md;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 475668c69dbc..e2d506dd805e 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -329,6 +329,8 @@ void *dm_per_bio_data(struct bio *bio, size_t data_size);
struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size);
unsigned dm_bio_get_target_bio_nr(const struct bio *bio);
+u64 dm_start_time_ns_from_clone(struct bio *bio);
+
int dm_register_target(struct target_type *t);
void dm_unregister_target(struct target_type *t);
--
2.26.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 3/3] md: Add Historical Service Time Path Selector
2020-04-28 0:51 [PATCH v2 0/3] Historical Service Time Path Selector Gabriel Krisman Bertazi
2020-04-28 0:51 ` [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors Gabriel Krisman Bertazi
2020-04-28 0:51 ` [PATCH v2 2/3] md: multipath: Pass io_start_time to the path selector Gabriel Krisman Bertazi
@ 2020-04-28 0:51 ` Gabriel Krisman Bertazi
2020-04-30 17:49 ` [PATCH v2 0/3] " Gabriel Krisman Bertazi
3 siblings, 0 replies; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-04-28 0:51 UTC (permalink / raw)
To: agk, snitzer; +Cc: dm-devel, kernel, Gabriel Krisman Bertazi, khazhy
From: Khazhismel Kumykov <khazhy@google.com>
This new selector keeps an exponential moving average of the service
time for each path (losely defined as delta between start_io and
end_io), and uses this along with the number of inflight requests to
estimate future service time for a path. Since we don't have a prober
to account for temporally slow paths, re-try "slow" paths every once in
a while (num_paths * historical_service_time). To account for fast paths
transitioning to slow, if a path has not completed any request within
(num_paths * historical_service_time), limit the number of outstanding
requests. To account for low volume situations where number of
inflights would be zero, the last finish time of each path is factored
in.
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
drivers/md/Kconfig | 11 +
drivers/md/Makefile | 1 +
drivers/md/dm-historical-service-time.c | 561 ++++++++++++++++++++++++
3 files changed, 573 insertions(+)
create mode 100644 drivers/md/dm-historical-service-time.c
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 37f05707c59d..5669eeb5cf6f 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -452,6 +452,17 @@ config DM_MULTIPATH_ST
If unsure, say N.
+config DM_MULTIPATH_HST
+ tristate "I/O Path Selector based on historical service time"
+ depends on DM_MULTIPATH
+ help
+ This path selector is a dynamic load balancer which selects
+ the path expected to complete the incoming I/O in the shortest
+ time by comparing estimated service time (based on historical
+ service time).
+
+ If unsure, say N.
+
config DM_DELAY
tristate "I/O delaying target"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 9a2d673f94bc..31840f95cd40 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
+obj-$(CONFIG_DM_MULTIPATH_HST) += dm-historical-service-time.o
obj-$(CONFIG_DM_SWITCH) += dm-switch.o
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_PERSISTENT_DATA) += persistent-data/
diff --git a/drivers/md/dm-historical-service-time.c b/drivers/md/dm-historical-service-time.c
new file mode 100644
index 000000000000..866195048ef9
--- /dev/null
+++ b/drivers/md/dm-historical-service-time.c
@@ -0,0 +1,561 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Historical Service Time
+ *
+ * Keeps a time-weighted exponential moving average of the historical
+ * service time. Estimates future service time based on the historical
+ * service time and the number of outstanding requests.
+ *
+ * Marks paths stale if they have not finished within hst *
+ * num_paths. If a path is stale and unused, we will send a single
+ * request to probe in case the path has improved. This situation
+ * generally arises if the path is so much worse than others that it
+ * will never have the best estimated service time, or if the entire
+ * multipath device is unused. If a path is stale and in use, limit the
+ * number of requests it can receive with the assumption that the path
+ * has become degraded.
+ *
+ * To avoid repeatedly calculating exponents for time weighting, times
+ * are split into HST_WEIGHT_COUNT buckets each (1 >> HST_BUCKET_SHIFT)
+ * ns, and the weighting is pre-calculated.
+ *
+ */
+
+#include "dm.h"
+#include "dm-path-selector.h"
+
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+
+
+#define DM_MSG_PREFIX "multipath historical-service-time"
+#define HST_MIN_IO 1
+#define HST_VERSION "0.1.1"
+
+#define HST_FIXED_SHIFT 10 /* 10 bits of decimal precision */
+#define HST_FIXED_MAX (ULLONG_MAX >> HST_FIXED_SHIFT)
+#define HST_FIXED_1 (1 << HST_FIXED_SHIFT)
+#define HST_FIXED_95 972
+#define HST_MAX_INFLIGHT HST_FIXED_1
+#define HST_BUCKET_SHIFT 24 /* Buckets are ~ 16ms */
+#define HST_WEIGHT_COUNT 64ULL
+
+struct selector {
+ struct list_head valid_paths;
+ struct list_head failed_paths;
+ int valid_count;
+ spinlock_t lock;
+
+ unsigned int weights[HST_WEIGHT_COUNT];
+ unsigned int threshold_multiplier;
+};
+
+struct path_info {
+ struct list_head list;
+ struct dm_path *path;
+ unsigned int repeat_count;
+
+ spinlock_t lock;
+
+ u64 historical_service_time; /* Fixed point */
+
+ u64 stale_after;
+ u64 last_finish;
+
+ u64 outstanding;
+};
+
+/**
+ * fixed_power - compute: x^n, in O(log n) time
+ *
+ * @x: base of the power
+ * @frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ *
+ * (see: kernel/sched/loadavg.c)
+ */
+static u64 fixed_power(u64 x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+
+ if (n) {
+ for (;;) {
+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;
+ if (!n)
+ break;
+ x *= x;
+ x += 1UL << (frac_bits - 1);
+ x >>= frac_bits;
+ }
+ }
+
+ return result;
+}
+
+/*
+ * Calculate the next value of an exponential moving average
+ * a_1 = a_0 * e + a * (1 - e)
+ *
+ * @last: [0, ULLONG_MAX >> HST_FIXED_SHIFT]
+ * @next: [0, ULLONG_MAX >> HST_FIXED_SHIFT]
+ * @weight: [0, HST_FIXED_1]
+ *
+ * Note:
+ * To account for multiple periods in the same calculation,
+ * a_n = a_0 * e^n + a * (1 - e^n),
+ * so call fixed_ema(last, next, pow(weight, N))
+ */
+static u64 fixed_ema(u64 last, u64 next, u64 weight)
+{
+ last *= weight;
+ last += next * (HST_FIXED_1 - weight);
+ last += 1ULL << (HST_FIXED_SHIFT - 1);
+ return last >> HST_FIXED_SHIFT;
+}
+
+static struct selector *alloc_selector(void)
+{
+ struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
+
+ if (s) {
+ INIT_LIST_HEAD(&s->valid_paths);
+ INIT_LIST_HEAD(&s->failed_paths);
+ spin_lock_init(&s->lock);
+ s->valid_count = 0;
+ }
+
+ return s;
+}
+
+/*
+ * Get the weight for a given time span.
+ */
+static u64 hst_weight(struct path_selector *ps, u64 delta)
+{
+ struct selector *s = ps->context;
+ int bucket = clamp(delta >> HST_BUCKET_SHIFT, 0ULL,
+ HST_WEIGHT_COUNT - 1);
+
+ return s->weights[bucket];
+}
+
+/*
+ * Set up the weights array.
+ *
+ * weights[len-1] = 0
+ * weights[n] = base ^ (n + 1)
+ */
+static void hst_set_weights(struct path_selector *ps, unsigned int base)
+{
+ struct selector *s = ps->context;
+ int i;
+
+ if (base >= HST_FIXED_1)
+ return;
+
+ for (i = 0; i < HST_WEIGHT_COUNT - 1; i++)
+ s->weights[i] = fixed_power(base, HST_FIXED_SHIFT, i + 1);
+ s->weights[HST_WEIGHT_COUNT - 1] = 0;
+}
+
+static int hst_create(struct path_selector *ps, unsigned int argc, char **argv)
+{
+ struct selector *s;
+ unsigned int base_weight = HST_FIXED_95;
+ unsigned int threshold_multiplier = 0;
+ char dummy;
+
+ /*
+ * Arguments: [<base_weight> [<threshold_multiplier>]]
+ * <base_weight>: Base weight for ema [0, 1024) 10-bit fixed point. A
+ * value of 0 will completely ignore any history.
+ * If not given, default (HST_FIXED_95) is used.
+ * <threshold_multiplier>: Minimum threshold multiplier for paths to
+ * be considered different. That is, a path is
+ * considered different iff (p1 > N * p2) where p1
+ * is the path with higher service time. A threshold
+ * of 1 or 0 has no effect. Defaults to 0.
+ */
+ if (argc > 2)
+ return -EINVAL;
+
+ if (argc && (sscanf(argv[0], "%u%c", &base_weight, &dummy) != 1 ||
+ base_weight >= HST_FIXED_1)) {
+ return -EINVAL;
+ }
+
+ if (argc > 1 && (sscanf(argv[1], "%u%c",
+ &threshold_multiplier, &dummy) != 1)) {
+ return -EINVAL;
+ }
+
+ s = alloc_selector();
+ if (!s)
+ return -ENOMEM;
+
+ ps->context = s;
+
+ hst_set_weights(ps, base_weight);
+ s->threshold_multiplier = threshold_multiplier;
+ return 0;
+}
+
+static void free_paths(struct list_head *paths)
+{
+ struct path_info *pi, *next;
+
+ list_for_each_entry_safe(pi, next, paths, list) {
+ list_del(&pi->list);
+ kfree(pi);
+ }
+}
+
+static void hst_destroy(struct path_selector *ps)
+{
+ struct selector *s = ps->context;
+
+ free_paths(&s->valid_paths);
+ free_paths(&s->failed_paths);
+ kfree(s);
+ ps->context = NULL;
+}
+
+static int hst_status(struct path_selector *ps, struct dm_path *path,
+ status_type_t type, char *result, unsigned int maxlen)
+{
+ unsigned int sz = 0;
+ struct path_info *pi;
+
+ if (!path) {
+ struct selector *s = ps->context;
+
+ DMEMIT("2 %u %u ", s->weights[0], s->threshold_multiplier);
+ } else {
+ pi = path->pscontext;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%llu %llu %llu ", pi->historical_service_time,
+ pi->outstanding, pi->stale_after);
+ break;
+ case STATUSTYPE_TABLE:
+ DMEMIT("0 ");
+ break;
+ }
+ }
+
+ return sz;
+}
+
+static int hst_add_path(struct path_selector *ps, struct dm_path *path,
+ int argc, char **argv, char **error)
+{
+ struct selector *s = ps->context;
+ struct path_info *pi;
+ unsigned int repeat_count = HST_MIN_IO;
+ char dummy;
+ unsigned long flags;
+
+ /*
+ * Arguments: [<repeat_count>]
+ * <repeat_count>: The number of I/Os before switching path.
+ * If not given, default (HST_MIN_IO) is used.
+ */
+ if (argc > 1) {
+ *error = "historical-service-time ps: incorrect number of arguments";
+ return -EINVAL;
+ }
+
+ if (argc && (sscanf(argv[0], "%u%c", &repeat_count, &dummy) != 1)) {
+ *error = "historical-service-time ps: invalid repeat count";
+ return -EINVAL;
+ }
+
+ /* allocate the path */
+ pi = kmalloc(sizeof(*pi), GFP_KERNEL);
+ if (!pi) {
+ *error = "historical-service-time ps: Error allocating path context";
+ return -ENOMEM;
+ }
+
+ pi->path = path;
+ pi->repeat_count = repeat_count;
+
+ pi->historical_service_time = HST_FIXED_1;
+
+ spin_lock_init(&pi->lock);
+ pi->outstanding = 0;
+
+ pi->stale_after = 0;
+ pi->last_finish = 0;
+
+ path->pscontext = pi;
+
+ spin_lock_irqsave(&s->lock, flags);
+ list_add_tail(&pi->list, &s->valid_paths);
+ s->valid_count++;
+ spin_unlock_irqrestore(&s->lock, flags);
+
+ return 0;
+}
+
+static void hst_fail_path(struct path_selector *ps, struct dm_path *path)
+{
+ struct selector *s = ps->context;
+ struct path_info *pi = path->pscontext;
+ unsigned long flags;
+
+ spin_lock_irqsave(&s->lock, flags);
+ list_move(&pi->list, &s->failed_paths);
+ s->valid_count--;
+ spin_unlock_irqrestore(&s->lock, flags);
+}
+
+static int hst_reinstate_path(struct path_selector *ps, struct dm_path *path)
+{
+ struct selector *s = ps->context;
+ struct path_info *pi = path->pscontext;
+ unsigned long flags;
+
+ spin_lock_irqsave(&s->lock, flags);
+ list_move_tail(&pi->list, &s->valid_paths);
+ s->valid_count++;
+ spin_unlock_irqrestore(&s->lock, flags);
+
+ return 0;
+}
+
+static void hst_fill_compare(struct path_info *pi, u64 *hst,
+ u64 *out, u64 *stale)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&pi->lock, flags);
+ *hst = pi->historical_service_time;
+ *out = pi->outstanding;
+ *stale = pi->stale_after;
+ spin_unlock_irqrestore(&pi->lock, flags);
+}
+
+/*
+ * Compare the estimated service time of 2 paths, pi1 and pi2,
+ * for the incoming I/O.
+ *
+ * Returns:
+ * < 0 : pi1 is better
+ * 0 : no difference between pi1 and pi2
+ * > 0 : pi2 is better
+ *
+ */
+static long long hst_compare(struct path_info *pi1, struct path_info *pi2,
+ u64 time_now, struct path_selector *ps)
+{
+ struct selector *s = ps->context;
+ u64 hst1, hst2;
+ long long out1, out2, stale1, stale2;
+ int pi2_better, over_threshold;
+
+ hst_fill_compare(pi1, &hst1, &out1, &stale1);
+ hst_fill_compare(pi2, &hst2, &out2, &stale2);
+
+ /* Check here if estimated latency for two paths are too similar.
+ * If this is the case, we skip extra calculation and just compare
+ * outstanding requests. In this case, any unloaded paths will
+ * be preferred.
+ */
+ if (hst1 > hst2)
+ over_threshold = hst1 > (s->threshold_multiplier * hst2);
+ else
+ over_threshold = hst2 > (s->threshold_multiplier * hst1);
+
+ if (!over_threshold)
+ return out1 - out2;
+
+ /*
+ * If an unloaded path is stale, choose it. If both paths are unloaded,
+ * choose path that is the most stale.
+ * (If one path is loaded, choose the other)
+ */
+ if ((!out1 && stale1 < time_now) || (!out2 && stale2 < time_now) ||
+ (!out1 && !out2))
+ return (!out2 * stale1) - (!out1 * stale2);
+
+ /* Compare estimated service time. If outstanding is the same, we
+ * don't need to multiply
+ */
+ if (out1 == out2) {
+ pi2_better = hst1 > hst2;
+ } else {
+ /* Potential overflow with out >= 1024 */
+ if (unlikely(out1 >= HST_MAX_INFLIGHT ||
+ out2 >= HST_MAX_INFLIGHT)) {
+ /* If over 1023 in-flights, we may overflow if hst
+ * is at max. (With this shift we still overflow at
+ * 1048576 in-flights, which is high enough).
+ */
+ hst1 >>= HST_FIXED_SHIFT;
+ hst2 >>= HST_FIXED_SHIFT;
+ }
+ pi2_better = (1 + out1) * hst1 > (1 + out2) * hst2;
+ }
+
+ /* In the case that the 'winner' is stale, limit to equal usage. */
+ if (pi2_better) {
+ if (stale2 < time_now)
+ return out1 - out2;
+ return 1;
+ }
+ if (stale1 < time_now)
+ return out1 - out2;
+ return -1;
+}
+
+static struct dm_path *hst_select_path(struct path_selector *ps,
+ size_t nr_bytes)
+{
+ struct selector *s = ps->context;
+ struct path_info *pi = NULL, *best = NULL;
+ u64 time_now = sched_clock();
+ struct dm_path *ret = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&s->lock, flags);
+ if (list_empty(&s->valid_paths))
+ goto out;
+
+ list_for_each_entry(pi, &s->valid_paths, list) {
+ if (!best || (hst_compare(pi, best, time_now, ps) < 0))
+ best = pi;
+ }
+
+ if (!best)
+ goto out;
+
+ /* Move last used path to end (least preferred in case of ties) */
+ list_move_tail(&best->list, &s->valid_paths);
+
+ ret = best->path;
+
+out:
+ spin_unlock_irqrestore(&s->lock, flags);
+ return ret;
+}
+
+static int hst_start_io(struct path_selector *ps, struct dm_path *path,
+ const struct path_selector_io_data *io_data)
+{
+ struct path_info *pi = path->pscontext;
+ unsigned long flags;
+
+ spin_lock_irqsave(&pi->lock, flags);
+ pi->outstanding++;
+ spin_unlock_irqrestore(&pi->lock, flags);
+
+ return 0;
+}
+
+static u64 path_service_time(struct path_info *pi, u64 start_time)
+{
+ u64 sched_now = ktime_get_ns();
+
+ /* if a previous disk request has finished after this IO was
+ * sent to the hardware, pretend the submission happened
+ * serially.
+ */
+ if (time_after64(pi->last_finish, start_time))
+ start_time = pi->last_finish;
+
+ pi->last_finish = sched_now;
+ if (time_before64(sched_now, start_time))
+ return 0;
+
+ return sched_now - start_time;
+}
+
+static int hst_end_io(struct path_selector *ps, struct dm_path *path,
+ const struct path_selector_io_data *io_data)
+{
+ struct path_info *pi = path->pscontext;
+ struct selector *s = ps->context;
+ unsigned long flags;
+ u64 st;
+
+ spin_lock_irqsave(&pi->lock, flags);
+
+ st = path_service_time(pi, io_data->io_start_time);
+ pi->outstanding--;
+ pi->historical_service_time =
+ fixed_ema(pi->historical_service_time,
+ min(st * HST_FIXED_1, HST_FIXED_MAX),
+ hst_weight(ps, st));
+
+ /*
+ * On request end, mark path as fresh. If a path hasn't
+ * finished any requests within the fresh period, the estimated
+ * service time is considered too optimistic and we limit the
+ * maximum requests on that path.
+ */
+ pi->stale_after = pi->last_finish +
+ (s->valid_count * (pi->historical_service_time >> HST_FIXED_SHIFT));
+
+ spin_unlock_irqrestore(&pi->lock, flags);
+
+ return 0;
+}
+
+static struct path_selector_type hst_ps = {
+ .name = "historical-service-time",
+ .module = THIS_MODULE,
+ .table_args = 1,
+ .info_args = 3,
+ .create = hst_create,
+ .destroy = hst_destroy,
+ .status = hst_status,
+ .add_path = hst_add_path,
+ .fail_path = hst_fail_path,
+ .reinstate_path = hst_reinstate_path,
+ .select_path = hst_select_path,
+ .start_io = hst_start_io,
+ .end_io = hst_end_io,
+};
+
+static int __init dm_hst_init(void)
+{
+ int r = dm_register_path_selector(&hst_ps);
+
+ if (r < 0)
+ DMERR("register failed %d", r);
+
+ DMINFO("version " HST_VERSION " loaded");
+
+ return r;
+}
+
+static void __exit dm_hst_exit(void)
+{
+ int r = dm_unregister_path_selector(&hst_ps);
+
+ if (r < 0)
+ DMERR("unregister failed %d", r);
+}
+
+module_init(dm_hst_init);
+module_exit(dm_hst_exit);
+
+MODULE_DESCRIPTION(DM_NAME " measured service time oriented path selector");
+MODULE_AUTHOR("Khazhismel Kumykov <khazhy@google.com>");
+MODULE_LICENSE("GPL");
--
2.26.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors
2020-04-28 0:51 ` [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors Gabriel Krisman Bertazi
@ 2020-04-28 17:20 ` Mike Snitzer
0 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2020-04-28 17:20 UTC (permalink / raw)
To: Gabriel Krisman Bertazi; +Cc: dm-devel, kernel, khazhy, agk
On Mon, Apr 27 2020 at 8:51pm -0400,
Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> Different selector will use different parameters, which means .io_start
> and .io_end will get their signatures modified to include more and more
> parameters. This encapsulates the data in a structure so we can
> simplify the interface for future users. For now it only passes
> nr_bytes, but HST will require start_time already.
>
> Cc: Khazhismel Kumykov <khazhy@google.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
I really don't see HST's need for start_time_ns in the path selector's
end_io hook as a solid justification for this encapsulation.
Especially in that the parameters needed for ps's start_io and end_io
really aren't symmetric. Imposing that they are just causes needless
code (an example of that is in patch 2/3).
So please drop this encapsulation.
Thanks,
Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/3] md: multipath: Pass io_start_time to the path selector
2020-04-28 0:51 ` [PATCH v2 2/3] md: multipath: Pass io_start_time to the path selector Gabriel Krisman Bertazi
@ 2020-04-28 17:25 ` Mike Snitzer
0 siblings, 0 replies; 9+ messages in thread
From: Mike Snitzer @ 2020-04-28 17:25 UTC (permalink / raw)
To: Gabriel Krisman Bertazi; +Cc: dm-devel, kernel, khazhy, agk
On Mon, Apr 27 2020 at 8:51pm -0400,
Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> HST need to know the IO start time in order to predict path
> performance. For request-based multipath use the block layer
> io_start_time, while for BIO use the dm_io start_time.
>
> The dm_start_time_ns_from_clone function was suggested and implemented
> by Mike Snitzer <snitzer@redhat.com>.
>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Khazhismel Kumykov <khazhy@google.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
> drivers/md/dm-mpath.c | 25 +++++++++++++++----------
> drivers/md/dm-path-selector.h | 1 +
> drivers/md/dm.c | 10 ++++++++++
> include/linux/device-mapper.h | 2 ++
> 4 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 1ef4fc2e745b..7af3249948be 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -500,8 +500,9 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
> struct dm_mpath_io *mpio = get_mpio(map_context);
> struct request_queue *q;
> struct request *clone;
> - struct path_selector_io_data io_data = {
> + struct path_selector_io_data ps_io_data = {
> .nr_bytes = nr_bytes,
> + .io_start_time = rq->io_start_time_ns
> };
>
> /* Do we need to select a new pgpath? */
> @@ -552,7 +553,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
> if (pgpath->pg->ps.type->start_io)
> pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
> &pgpath->path,
> - &io_data);
> + &ps_io_data);
> return DM_MAPIO_REMAPPED;
> }
>
> @@ -568,6 +569,7 @@ static void multipath_release_clone(struct request *clone,
> struct pgpath *pgpath = mpio->pgpath;
> struct path_selector_io_data ps_io_data = {
> .nr_bytes = mpio->nr_bytes,
> + .io_start_time = clone->io_start_time_ns,
> };
>
> if (pgpath && pgpath->pg->ps.type->end_io)
> @@ -623,8 +625,9 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
> struct dm_mpath_io *mpio)
> {
> struct pgpath *pgpath = __map_bio(m, bio);
> - struct path_selector_io_data io_data = {
> + struct path_selector_io_data ps_io_data = {
> .nr_bytes = mpio->nr_bytes,
> + .io_start_time = dm_start_time_ns_from_clone(bio)
> };
>
> if (IS_ERR(pgpath))
> @@ -646,7 +649,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio,
> if (pgpath->pg->ps.type->start_io)
> pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
> &pgpath->path,
> - &io_data);
> + &ps_io_data);
> return DM_MAPIO_REMAPPED;
> }
>
io_start_time_ns isn't needed by any path selector's start_io method.
Please drop that from start_io and only pass it to the end_io hook.
(the need for dm_start_time_ns_from_clone() to access the more nested
nature of a bio clone's start_time showcases why we should avoid
needless collection of parameters that aren't required).
Thanks,
Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 0/3] Historical Service Time Path Selector
2020-04-28 0:51 [PATCH v2 0/3] Historical Service Time Path Selector Gabriel Krisman Bertazi
` (2 preceding siblings ...)
2020-04-28 0:51 ` [PATCH v2 3/3] md: Add Historical Service Time Path Selector Gabriel Krisman Bertazi
@ 2020-04-30 17:49 ` Gabriel Krisman Bertazi
2020-04-30 18:33 ` Mike Snitzer
3 siblings, 1 reply; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-04-30 17:49 UTC (permalink / raw)
To: agk; +Cc: dm-devel, kernel, khazhy, snitzer
Gabriel Krisman Bertazi <krisman@collabora.com> writes:
> Hi Mike,
>
> Please find an updated version of HST integrating the change you
> requested to also support BIO based multipath. I hope you don't mind me
> folding the function you implemented into patch 2. If you prefer, I can
> integrate a patch you provide into the series.
Hi Mike,
Sorry for the encapsulation patches, I'll pass the parameter directly on
v3.
> One interesting data point is that the selection performance on
> BIO-based is worse when compared to request-based. I tested a bit and I
> believe the reason is that the paths are more sticky because the bucket
> is too large, and it takes longer for HST to detect differences. By
> changing the bucket_size indirectly through the bucket_shift, the
> bio-based multipath performance improved, but I'm not sure this is a
> knob we want to expose to users. For now, please consider the patches
> below, for review.
>
The reason for the BIO-based being worse than request-based was that we
are comparing the jiffies_to_nsecs(jiffies) with ktime_get_ns(), which is not
accurate, given the different precision of the clocks. Problem is,
request-based mpath uses ktime_get_ns in the block layer, while
dm_io->start_time uses jiffies, and inside the path selector, we don't
have a way to distinguish those paths. I see a few ways forward, but in
the best interest of getting it right early, I'd like to hear what you
think it is best:
1) My best suggestion would be converting dm_io->start_time to use
ktime_get_ns. This has the advantage of simplifying dm_stats_account_io
and wouldn't affect the ABI of the accounting histogram. The only
downside, from what I can see is that ktime_get is slightly more
expensive than reading jiffies, which might be a problem according to
Documentation/admin-guide/device-mapper/statistics.rst. Is that really
a problem? I see your FIXME note on the function you implemented, but
are you suggesting exactly this or simply storing
jiffies_to_nsecs(jiffies) in dm_io->start_time?
2) Alternatively, pass end_io_time to the path selector as well, and do
the time collection outside the path selector. This makes the interface
look ugly, adds the cost anyway of doing ktime operations.
3) Alternatively, collect ktime start/end costs inside HST. This means
we need to match the IO on start_io and end_io from inside HST, which
doesn't seem like a good design either.
4) ?
As you can see, I'm leaning towards option 1. Would you be open to a
patch like the following? Completely untested, still need some work,
just to show the idea.
Khazhy, any thoughts?
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..eb7c7fae4bc6 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -23,7 +23,7 @@ struct dm_rq_target_io {
blk_status_t error;
union map_info info;
struct dm_stats_aux stats_aux;
- unsigned long duration_jiffies;
+ u64 duration_ns;
unsigned n_sectors;
unsigned completed;
};
@@ -132,10 +132,10 @@ static void rq_end_stats(struct mapped_device *md, struct request *orig)
{
if (unlikely(dm_stats_used(&md->stats))) {
struct dm_rq_target_io *tio = tio_from_request(orig);
- tio->duration_jiffies = jiffies - tio->duration_jiffies;
+ tio->duration_ns = ktime_get_ns() - tio->duration_ns;
dm_stats_account_io(&md->stats, rq_data_dir(orig),
blk_rq_pos(orig), tio->n_sectors, true,
- tio->duration_jiffies, &tio->stats_aux);
+ tio->duration_ns, &tio->stats_aux);
}
}
@@ -451,7 +451,7 @@ static void dm_start_request(struct mapped_device *md, struct request *orig)
if (unlikely(dm_stats_used(&md->stats))) {
struct dm_rq_target_io *tio = tio_from_request(orig);
- tio->duration_jiffies = jiffies;
+ tio->duration_ns = ktime_get_ns();
tio->n_sectors = blk_rq_sectors(orig);
dm_stats_account_io(&md->stats, rq_data_dir(orig),
blk_rq_pos(orig), tio->n_sectors, false, 0,
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
index 71417048256a..4353dd04ec42 100644
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -514,7 +514,7 @@ static void dm_stat_round(struct dm_stat *s, struct dm_stat_shared *shared,
static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
int idx, sector_t len,
struct dm_stats_aux *stats_aux, bool end,
- unsigned long duration_jiffies)
+ u64 duration_ns)
{
struct dm_stat_shared *shared = &s->stat_shared[entry];
struct dm_stat_percpu *p;
@@ -553,11 +553,11 @@ static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
p->ios[idx] += 1;
p->merges[idx] += stats_aux->merged;
if (!(s->stat_flags & STAT_PRECISE_TIMESTAMPS)) {
- p->ticks[idx] += duration_jiffies;
- duration = jiffies_to_msecs(duration_jiffies);
+ p->ticks[idx] += nsecs_to_jiffies(duration_ns);
+ duration = div64_u64(duration_ns, 1000000);
} else {
- p->ticks[idx] += stats_aux->duration_ns;
- duration = stats_aux->duration_ns;
+ p->ticks[idx] += duration_ns;
+ duration = duration_ns;
}
if (s->n_histogram_entries) {
unsigned lo = 0, hi = s->n_histogram_entries + 1;
@@ -583,7 +583,7 @@ static void dm_stat_for_entry(struct dm_stat *s, size_t entry,
static void __dm_stat_bio(struct dm_stat *s, int bi_rw,
sector_t bi_sector, sector_t end_sector,
- bool end, unsigned long duration_jiffies,
+ bool end, u64 duration_ns,
struct dm_stats_aux *stats_aux)
{
sector_t rel_sector, offset, todo, fragment_len;
@@ -612,7 +612,7 @@ static void __dm_stat_bio(struct dm_stat *s, int bi_rw,
if (fragment_len > s->step - offset)
fragment_len = s->step - offset;
dm_stat_for_entry(s, entry, bi_rw, fragment_len,
- stats_aux, end, duration_jiffies);
+ stats_aux, end, duration_ns);
todo -= fragment_len;
entry++;
offset = 0;
@@ -621,13 +621,11 @@ static void __dm_stat_bio(struct dm_stat *s, int bi_rw,
void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
sector_t bi_sector, unsigned bi_sectors, bool end,
- unsigned long duration_jiffies,
- struct dm_stats_aux *stats_aux)
+ u64 duration_ns, struct dm_stats_aux *stats_aux)
{
struct dm_stat *s;
sector_t end_sector;
struct dm_stats_last_position *last;
- bool got_precise_time;
if (unlikely(!bi_sectors))
return;
@@ -651,17 +649,9 @@ void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
rcu_read_lock();
- got_precise_time = false;
- list_for_each_entry_rcu(s, &stats->list, list_entry) {
- if (s->stat_flags & STAT_PRECISE_TIMESTAMPS && !got_precise_time) {
- if (!end)
- stats_aux->duration_ns = ktime_to_ns(ktime_get());
- else
- stats_aux->duration_ns = ktime_to_ns(ktime_get()) - stats_aux->duration_ns;
- got_precise_time = true;
- }
- __dm_stat_bio(s, bi_rw, bi_sector, end_sector, end, duration_jiffies, stats_aux);
- }
+ list_for_each_entry_rcu(s, &stats->list, list_entry)
+ __dm_stat_bio(s, bi_rw, bi_sector, end_sector, end,
+ duration_ns, stats_aux);
rcu_read_unlock();
}
diff --git a/drivers/md/dm-stats.h b/drivers/md/dm-stats.h
index 2ddfae678f32..e9755f14ce68 100644
--- a/drivers/md/dm-stats.h
+++ b/drivers/md/dm-stats.h
@@ -19,7 +19,6 @@ struct dm_stats {
struct dm_stats_aux {
bool merged;
- unsigned long long duration_ns;
};
void dm_stats_init(struct dm_stats *st);
@@ -32,8 +31,7 @@ int dm_stats_message(struct mapped_device *md, unsigned argc, char **argv,
void dm_stats_account_io(struct dm_stats *stats, unsigned long bi_rw,
sector_t bi_sector, unsigned bi_sectors, bool end,
- unsigned long duration_jiffies,
- struct dm_stats_aux *aux);
+ u64 duration_ns, struct dm_stats_aux *aux);
static inline bool dm_stats_used(struct dm_stats *st)
{
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c5deee1bea67..10ca0d0c3c46 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -689,7 +689,7 @@ static void start_io_acct(struct dm_io *io)
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
- io->start_time = jiffies;
+ io->start_time = ktime_get_ns();
generic_start_io_acct(md->queue, bio_op(bio), bio_sectors(bio),
&dm_disk(md)->part0);
@@ -704,10 +704,10 @@ static void end_io_acct(struct dm_io *io)
{
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
- unsigned long duration = jiffies - io->start_time;
+ u64 duration = ktime_get_ns() - io->start_time;
generic_end_io_acct(md->queue, bio_op(bio), &dm_disk(md)->part0,
- io->start_time);
+ nsecs_to_jiffies(io->start_time));
if (unlikely(dm_stats_used(&md->stats)))
dm_stats_account_io(&md->stats, bio_data_dir(bio),
--
Gabriel Krisman Bertazi
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 0/3] Historical Service Time Path Selector
2020-04-30 17:49 ` [PATCH v2 0/3] " Gabriel Krisman Bertazi
@ 2020-04-30 18:33 ` Mike Snitzer
2020-05-01 18:03 ` Mikulas Patocka
0 siblings, 1 reply; 9+ messages in thread
From: Mike Snitzer @ 2020-04-30 18:33 UTC (permalink / raw)
To: Gabriel Krisman Bertazi
Cc: Bryn Reeves, khazhy, dm-devel, Mikulas Patocka, kernel, agk
On Thu, Apr 30 2020 at 1:49pm -0400,
Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> Gabriel Krisman Bertazi <krisman@collabora.com> writes:
>
> > Hi Mike,
> >
> > Please find an updated version of HST integrating the change you
> > requested to also support BIO based multipath. I hope you don't mind me
> > folding the function you implemented into patch 2. If you prefer, I can
> > integrate a patch you provide into the series.
>
> Hi Mike,
>
> Sorry for the encapsulation patches, I'll pass the parameter directly on
> v3.
Great, thanks.
> > One interesting data point is that the selection performance on
> > BIO-based is worse when compared to request-based. I tested a bit and I
> > believe the reason is that the paths are more sticky because the bucket
> > is too large, and it takes longer for HST to detect differences. By
> > changing the bucket_size indirectly through the bucket_shift, the
> > bio-based multipath performance improved, but I'm not sure this is a
> > knob we want to expose to users. For now, please consider the patches
> > below, for review.
> >
>
> The reason for the BIO-based being worse than request-based was that we
> are comparing the jiffies_to_nsecs(jiffies) with ktime_get_ns(), which is not
> accurate, given the different precision of the clocks. Problem is,
> request-based mpath uses ktime_get_ns in the block layer, while
> dm_io->start_time uses jiffies, and inside the path selector, we don't
> have a way to distinguish those paths. I see a few ways forward, but in
> the best interest of getting it right early, I'd like to hear what you
> think it is best:
>
> 1) My best suggestion would be converting dm_io->start_time to use
> ktime_get_ns. This has the advantage of simplifying dm_stats_account_io
> and wouldn't affect the ABI of the accounting histogram. The only
> downside, from what I can see is that ktime_get is slightly more
> expensive than reading jiffies, which might be a problem according to
> Documentation/admin-guide/device-mapper/statistics.rst. Is that really
> a problem?
We should check with Mikulas (now cc'd) but given the speed improvements
of storage we'll likely need to use "precise_timestamps" going forward
anyway.
So I'm inclined to just take the hit of ktime_get_ns(). Mikulas would
you be OK with this?
> I see your FIXME note on the function you implemented, but
> are you suggesting exactly this or simply storing
> jiffies_to_nsecs(jiffies) in dm_io->start_time?
Yes, I am suggesting exactly this (1) in that FIXME.
> As you can see, I'm leaning towards option 1. Would you be open to a
> patch like the following? Completely untested, still need some work,
> just to show the idea.
Yes, follow-on work would be something like the patch you provided.
Only nit I see is we should rename io->start_time to io->start_time_ns
But please keep this patch (that does the work to address the "FIXME" I
added) separate from your HST patchset. The need to improve bio-based
HST is a secondary concern given bio-based multipath isn't widely
deployed AFAICT. So we can address the conversion to nanoseconds with
later followon work that builds on your initial HST patchset.
Thanks,
Mike
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 0/3] Historical Service Time Path Selector
2020-04-30 18:33 ` Mike Snitzer
@ 2020-05-01 18:03 ` Mikulas Patocka
0 siblings, 0 replies; 9+ messages in thread
From: Mikulas Patocka @ 2020-05-01 18:03 UTC (permalink / raw)
To: Mike Snitzer
Cc: Bryn Reeves, khazhy, dm-devel, kernel, Gabriel Krisman Bertazi,
agk
On Thu, 30 Apr 2020, Mike Snitzer wrote:
> On Thu, Apr 30 2020 at 1:49pm -0400,
> Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>
> > Gabriel Krisman Bertazi <krisman@collabora.com> writes:
> >
> > > Hi Mike,
> > >
> > > Please find an updated version of HST integrating the change you
> > > requested to also support BIO based multipath. I hope you don't mind me
> > > folding the function you implemented into patch 2. If you prefer, I can
> > > integrate a patch you provide into the series.
> >
> > Hi Mike,
> >
> > Sorry for the encapsulation patches, I'll pass the parameter directly on
> > v3.
>
> Great, thanks.
>
> > > One interesting data point is that the selection performance on
> > > BIO-based is worse when compared to request-based. I tested a bit and I
> > > believe the reason is that the paths are more sticky because the bucket
> > > is too large, and it takes longer for HST to detect differences. By
> > > changing the bucket_size indirectly through the bucket_shift, the
> > > bio-based multipath performance improved, but I'm not sure this is a
> > > knob we want to expose to users. For now, please consider the patches
> > > below, for review.
> > >
> >
> > The reason for the BIO-based being worse than request-based was that we
> > are comparing the jiffies_to_nsecs(jiffies) with ktime_get_ns(), which is not
> > accurate, given the different precision of the clocks. Problem is,
> > request-based mpath uses ktime_get_ns in the block layer, while
> > dm_io->start_time uses jiffies, and inside the path selector, we don't
> > have a way to distinguish those paths. I see a few ways forward, but in
> > the best interest of getting it right early, I'd like to hear what you
> > think it is best:
> >
> > 1) My best suggestion would be converting dm_io->start_time to use
> > ktime_get_ns. This has the advantage of simplifying dm_stats_account_io
> > and wouldn't affect the ABI of the accounting histogram. The only
> > downside, from what I can see is that ktime_get is slightly more
> > expensive than reading jiffies, which might be a problem according to
> > Documentation/admin-guide/device-mapper/statistics.rst. Is that really
> > a problem?
>
> We should check with Mikulas (now cc'd) but given the speed improvements
> of storage we'll likely need to use "precise_timestamps" going forward
> anyway.
>
> So I'm inclined to just take the hit of ktime_get_ns(). Mikulas would
> you be OK with this?
You can use ktime_get_ns() inside the multipath target. But I wouldn't use
it in the general device mapper code because it would slow down
everything.
I suggest to use ktime_get_ns() in the multipath map routine and in the
the end_io routine and subract these two values to get precise time. I
think that we don't need to hack generic dm code with that.
Mikulas
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-05-01 18:03 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-28 0:51 [PATCH v2 0/3] Historical Service Time Path Selector Gabriel Krisman Bertazi
2020-04-28 0:51 ` [PATCH v2 1/3] md: multipath: Encapsulate parameters passed to selectors Gabriel Krisman Bertazi
2020-04-28 17:20 ` Mike Snitzer
2020-04-28 0:51 ` [PATCH v2 2/3] md: multipath: Pass io_start_time to the path selector Gabriel Krisman Bertazi
2020-04-28 17:25 ` Mike Snitzer
2020-04-28 0:51 ` [PATCH v2 3/3] md: Add Historical Service Time Path Selector Gabriel Krisman Bertazi
2020-04-30 17:49 ` [PATCH v2 0/3] " Gabriel Krisman Bertazi
2020-04-30 18:33 ` Mike Snitzer
2020-05-01 18:03 ` Mikulas Patocka
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.