* [PATCH 1/5] crush: fix off-by-one errors in total_tries refactor
2014-03-19 16:09 [PATCH 0/5] wip-tunables3 Ilya Dryomov
@ 2014-03-19 16:09 ` Ilya Dryomov
2014-03-19 16:09 ` [PATCH 2/5] crush: allow crush rules to set (re)tries counts to 0 Ilya Dryomov
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Ilya Dryomov @ 2014-03-19 16:09 UTC (permalink / raw)
To: ceph-devel
Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH
code to allow adjustment of the retry counts on a per-pool basis. That
commit had an off-by-one bug: the previous "tries" counter was a *retry*
count, not a *try* count, but the new code was passing in 1 meaning
there should be no retries.
Fix the ftotal vs tries comparison to use < instead of <= to fix the
problem. Note that the original code used <= here, which means the
global "choose_total_tries" tunable is actually counting retries.
Compensate for that by adding 1 in crush_do_rule when we pull the tunable
into the local variable.
This was noticed looking at output from a user provided osdmap.
Unfortunately the map doesn't illustrate the change in mapping behavior
and I haven't managed to construct one yet that does. Inspection of the
crush debug output now aligns with prior versions, though.
Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
net/ceph/crush/mapper.c | 46 +++++++++++++++++++++++++++-------------------
1 file changed, 27 insertions(+), 19 deletions(-)
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index b703790b4e44..074bb2a5e675 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -292,8 +292,8 @@ static int is_out(const struct crush_map *map,
* @outpos: our position in that vector
* @tries: number of attempts to make
* @recurse_tries: number of attempts to have recursive chooseleaf make
- * @local_tries: localized retries
- * @local_fallback_tries: localized fallback retries
+ * @local_retries: localized retries
+ * @local_fallback_retries: localized fallback retries
* @recurse_to_leaf: true if we want one device under each item of given type (chooseleaf instead of choose)
* @out2: second output vector for leaf items (if @recurse_to_leaf)
*/
@@ -304,8 +304,8 @@ static int crush_choose_firstn(const struct crush_map *map,
int *out, int outpos,
unsigned int tries,
unsigned int recurse_tries,
- unsigned int local_tries,
- unsigned int local_fallback_tries,
+ unsigned int local_retries,
+ unsigned int local_fallback_retries,
int recurse_to_leaf,
int *out2)
{
@@ -344,9 +344,9 @@ static int crush_choose_firstn(const struct crush_map *map,
reject = 1;
goto reject;
}
- if (local_fallback_tries > 0 &&
+ if (local_fallback_retries > 0 &&
flocal >= (in->size>>1) &&
- flocal > local_fallback_tries)
+ flocal > local_fallback_retries)
item = bucket_perm_choose(in, x, r);
else
item = crush_bucket_choose(in, x, r);
@@ -393,8 +393,8 @@ static int crush_choose_firstn(const struct crush_map *map,
x, outpos+1, 0,
out2, outpos,
recurse_tries, 0,
- local_tries,
- local_fallback_tries,
+ local_retries,
+ local_fallback_retries,
0,
NULL) <= outpos)
/* didn't get leaf */
@@ -420,14 +420,14 @@ reject:
ftotal++;
flocal++;
- if (collide && flocal <= local_tries)
+ if (collide && flocal <= local_retries)
/* retry locally a few times */
retry_bucket = 1;
- else if (local_fallback_tries > 0 &&
- flocal <= in->size + local_fallback_tries)
+ else if (local_fallback_retries > 0 &&
+ flocal <= in->size + local_fallback_retries)
/* exhaustive bucket search */
retry_bucket = 1;
- else if (ftotal <= tries)
+ else if (ftotal < tries)
/* then retry descent */
retry_descent = 1;
else
@@ -640,10 +640,18 @@ int crush_do_rule(const struct crush_map *map,
__u32 step;
int i, j;
int numrep;
- int choose_tries = map->choose_total_tries;
- int choose_local_tries = map->choose_local_tries;
- int choose_local_fallback_tries = map->choose_local_fallback_tries;
+ /*
+ * the original choose_total_tries value was off by one (it
+ * counted "retries" and not "tries"). add one.
+ */
+ int choose_tries = map->choose_total_tries + 1;
int choose_leaf_tries = 0;
+ /*
+ * the local tries values were counted as "retries", though,
+ * and need no adjustment
+ */
+ int choose_local_retries = map->choose_local_tries;
+ int choose_local_fallback_retries = map->choose_local_fallback_tries;
if ((__u32)ruleno >= map->max_rules) {
dprintk(" bad ruleno %d\n", ruleno);
@@ -677,12 +685,12 @@ int crush_do_rule(const struct crush_map *map,
case CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES:
if (curstep->arg1 > 0)
- choose_local_tries = curstep->arg1;
+ choose_local_retries = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES:
if (curstep->arg1 > 0)
- choose_local_fallback_tries = curstep->arg1;
+ choose_local_fallback_retries = curstep->arg1;
break;
case CRUSH_RULE_CHOOSELEAF_FIRSTN:
@@ -734,8 +742,8 @@ int crush_do_rule(const struct crush_map *map,
o+osize, j,
choose_tries,
recurse_tries,
- choose_local_tries,
- choose_local_fallback_tries,
+ choose_local_retries,
+ choose_local_fallback_retries,
recurse_to_leaf,
c+osize);
} else {
--
1.7.10.4
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 2/5] crush: allow crush rules to set (re)tries counts to 0
2014-03-19 16:09 [PATCH 0/5] wip-tunables3 Ilya Dryomov
2014-03-19 16:09 ` [PATCH 1/5] crush: fix off-by-one errors in total_tries refactor Ilya Dryomov
@ 2014-03-19 16:09 ` Ilya Dryomov
2014-03-19 16:09 ` [PATCH 3/5] crush: add chooseleaf_vary_r tunable Ilya Dryomov
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Ilya Dryomov @ 2014-03-19 16:09 UTC (permalink / raw)
To: ceph-devel
These two fields are misnomers; they are *retry* counts.
Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
net/ceph/crush/mapper.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 074bb2a5e675..b3fb84903b30 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -684,12 +684,12 @@ int crush_do_rule(const struct crush_map *map,
break;
case CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES:
- if (curstep->arg1 > 0)
+ if (curstep->arg1 >= 0)
choose_local_retries = curstep->arg1;
break;
case CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES:
- if (curstep->arg1 > 0)
+ if (curstep->arg1 >= 0)
choose_local_fallback_retries = curstep->arg1;
break;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 3/5] crush: add chooseleaf_vary_r tunable
2014-03-19 16:09 [PATCH 0/5] wip-tunables3 Ilya Dryomov
2014-03-19 16:09 ` [PATCH 1/5] crush: fix off-by-one errors in total_tries refactor Ilya Dryomov
2014-03-19 16:09 ` [PATCH 2/5] crush: allow crush rules to set (re)tries counts to 0 Ilya Dryomov
@ 2014-03-19 16:09 ` Ilya Dryomov
2014-03-19 16:09 ` [PATCH 4/5] crush: add SET_CHOOSELEAF_VARY_R step Ilya Dryomov
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Ilya Dryomov @ 2014-03-19 16:09 UTC (permalink / raw)
To: ceph-devel
The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call. That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.
Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.
Note that this was done from the get-go for the new crush_choose_indep
algorithm.
This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.
Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
include/linux/crush/crush.h | 6 ++++++
net/ceph/crush/mapper.c | 30 ++++++++++++++++++++++++------
2 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h
index acaa5615d634..75f36a6c7f67 100644
--- a/include/linux/crush/crush.h
+++ b/include/linux/crush/crush.h
@@ -173,6 +173,12 @@ struct crush_map {
* apply to a collision: in that case we will retry as we used
* to. */
__u32 chooseleaf_descend_once;
+
+ /* if non-zero, feed r into chooseleaf, bit-shifted right by (r-1)
+ * bits. a value of 1 is best for new clusters. for legacy clusters
+ * that want to limit reshuffling, a value of 3 or 4 will make the
+ * mappings line up a bit better with previous mappings. */
+ __u8 chooseleaf_vary_r;
};
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index b3fb84903b30..947150cde297 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -295,7 +295,9 @@ static int is_out(const struct crush_map *map,
* @local_retries: localized retries
* @local_fallback_retries: localized fallback retries
* @recurse_to_leaf: true if we want one device under each item of given type (chooseleaf instead of choose)
+ * @vary_r: pass r to recursive calls
* @out2: second output vector for leaf items (if @recurse_to_leaf)
+ * @parent_r: r value passed from the parent
*/
static int crush_choose_firstn(const struct crush_map *map,
struct crush_bucket *bucket,
@@ -307,7 +309,9 @@ static int crush_choose_firstn(const struct crush_map *map,
unsigned int local_retries,
unsigned int local_fallback_retries,
int recurse_to_leaf,
- int *out2)
+ unsigned int vary_r,
+ int *out2,
+ int parent_r)
{
int rep;
unsigned int ftotal, flocal;
@@ -319,8 +323,11 @@ static int crush_choose_firstn(const struct crush_map *map,
int itemtype;
int collide, reject;
- dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d\n", recurse_to_leaf ? "_LEAF" : "",
- bucket->id, x, outpos, numrep);
+ dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d tries %d recurse_tries %d local_retries %d local_fallback_retries %d parent_r %d\n",
+ recurse_to_leaf ? "_LEAF" : "",
+ bucket->id, x, outpos, numrep,
+ tries, recurse_tries, local_retries, local_fallback_retries,
+ parent_r);
for (rep = outpos; rep < numrep; rep++) {
/* keep trying until we get a non-out, non-colliding item */
@@ -335,7 +342,7 @@ static int crush_choose_firstn(const struct crush_map *map,
do {
collide = 0;
retry_bucket = 0;
- r = rep;
+ r = rep + parent_r;
/* r' = r + f_total */
r += ftotal;
@@ -387,6 +394,11 @@ static int crush_choose_firstn(const struct crush_map *map,
reject = 0;
if (!collide && recurse_to_leaf) {
if (item < 0) {
+ int sub_r;
+ if (vary_r)
+ sub_r = r >> (vary_r-1);
+ else
+ sub_r = 0;
if (crush_choose_firstn(map,
map->buckets[-1-item],
weight, weight_max,
@@ -396,7 +408,9 @@ static int crush_choose_firstn(const struct crush_map *map,
local_retries,
local_fallback_retries,
0,
- NULL) <= outpos)
+ vary_r,
+ NULL,
+ sub_r) <= outpos)
/* didn't get leaf */
reject = 1;
} else {
@@ -653,6 +667,8 @@ int crush_do_rule(const struct crush_map *map,
int choose_local_retries = map->choose_local_tries;
int choose_local_fallback_retries = map->choose_local_fallback_tries;
+ int vary_r = map->chooseleaf_vary_r;
+
if ((__u32)ruleno >= map->max_rules) {
dprintk(" bad ruleno %d\n", ruleno);
return 0;
@@ -745,7 +761,9 @@ int crush_do_rule(const struct crush_map *map,
choose_local_retries,
choose_local_fallback_retries,
recurse_to_leaf,
- c+osize);
+ vary_r,
+ c+osize,
+ 0);
} else {
crush_choose_indep(
map,
--
1.7.10.4
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 4/5] crush: add SET_CHOOSELEAF_VARY_R step
2014-03-19 16:09 [PATCH 0/5] wip-tunables3 Ilya Dryomov
` (2 preceding siblings ...)
2014-03-19 16:09 ` [PATCH 3/5] crush: add chooseleaf_vary_r tunable Ilya Dryomov
@ 2014-03-19 16:09 ` Ilya Dryomov
2014-03-19 16:09 ` [PATCH 5/5] crush: support chooseleaf_vary_r tunable (tunables3) by default Ilya Dryomov
2014-03-27 23:39 ` [PATCH 0/5] wip-tunables3 Josh Durgin
5 siblings, 0 replies; 7+ messages in thread
From: Ilya Dryomov @ 2014-03-19 16:09 UTC (permalink / raw)
To: ceph-devel
This lets you adjust the vary_r tunable on a per-rule basis.
Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
include/linux/crush/crush.h | 1 +
net/ceph/crush/mapper.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h
index 75f36a6c7f67..4fad5f8ee01d 100644
--- a/include/linux/crush/crush.h
+++ b/include/linux/crush/crush.h
@@ -51,6 +51,7 @@ enum {
CRUSH_RULE_SET_CHOOSELEAF_TRIES = 9, /* override chooseleaf_descend_once */
CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES = 10,
CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES = 11,
+ CRUSH_RULE_SET_CHOOSELEAF_VARY_R = 12
};
/*
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 947150cde297..a1ef53c04415 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -709,6 +709,11 @@ int crush_do_rule(const struct crush_map *map,
choose_local_fallback_retries = curstep->arg1;
break;
+ case CRUSH_RULE_SET_CHOOSELEAF_VARY_R:
+ if (curstep->arg1 >= 0)
+ vary_r = curstep->arg1;
+ break;
+
case CRUSH_RULE_CHOOSELEAF_FIRSTN:
case CRUSH_RULE_CHOOSE_FIRSTN:
firstn = 1;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 5/5] crush: support chooseleaf_vary_r tunable (tunables3) by default
2014-03-19 16:09 [PATCH 0/5] wip-tunables3 Ilya Dryomov
` (3 preceding siblings ...)
2014-03-19 16:09 ` [PATCH 4/5] crush: add SET_CHOOSELEAF_VARY_R step Ilya Dryomov
@ 2014-03-19 16:09 ` Ilya Dryomov
2014-03-27 23:39 ` [PATCH 0/5] wip-tunables3 Josh Durgin
5 siblings, 0 replies; 7+ messages in thread
From: Ilya Dryomov @ 2014-03-19 16:09 UTC (permalink / raw)
To: ceph-devel
Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features
supported by default.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
include/linux/ceph/ceph_features.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index 138448f766b4..77c097fe9ea9 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -43,6 +43,13 @@
#define CEPH_FEATURE_CRUSH_V2 (1ULL<<36) /* new indep; SET_* steps */
#define CEPH_FEATURE_EXPORT_PEER (1ULL<<37)
#define CEPH_FEATURE_OSD_ERASURE_CODES (1ULL<<38)
+#define CEPH_FEATURE_OSD_TMAP2OMAP (1ULL<<38) /* overlap with EC */
+/* The process supports new-style OSDMap encoding. Monitors also use
+ this bit to determine if peers support NAK messages. */
+#define CEPH_FEATURE_OSDMAP_ENC (1ULL<<39)
+#define CEPH_FEATURE_MDS_INLINE_DATA (1ULL<<40)
+#define CEPH_FEATURE_CRUSH_TUNABLES3 (1ULL<<41)
+#define CEPH_FEATURE_OSD_PRIMARY_AFFINITY (1ULL<<41) /* overlap w/ tunables3 */
/*
* The introduction of CEPH_FEATURE_OSD_SNAPMAPPER caused the feature
@@ -82,7 +89,8 @@ static inline u64 ceph_sanitize_features(u64 features)
CEPH_FEATURE_OSDHASHPSPOOL | \
CEPH_FEATURE_OSD_CACHEPOOL | \
CEPH_FEATURE_CRUSH_V2 | \
- CEPH_FEATURE_EXPORT_PEER)
+ CEPH_FEATURE_EXPORT_PEER | \
+ CEPH_FEATURE_CRUSH_TUNABLES3)
#define CEPH_FEATURES_REQUIRED_DEFAULT \
(CEPH_FEATURE_NOSRCADDR | \
--
1.7.10.4
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH 0/5] wip-tunables3
2014-03-19 16:09 [PATCH 0/5] wip-tunables3 Ilya Dryomov
` (4 preceding siblings ...)
2014-03-19 16:09 ` [PATCH 5/5] crush: support chooseleaf_vary_r tunable (tunables3) by default Ilya Dryomov
@ 2014-03-27 23:39 ` Josh Durgin
5 siblings, 0 replies; 7+ messages in thread
From: Josh Durgin @ 2014-03-27 23:39 UTC (permalink / raw)
To: Ilya Dryomov, ceph-devel
On 03/19/2014 09:09 AM, Ilya Dryomov wrote:
> Hello,
>
> This series updates the kernel implementation of CRUSH with a couple of
> fixes and a new chooseleaf_vary_r tunable (all ported from ceph.git).
> TUNABLES3 feature bit is shared with PRIMARY_AFFINITY, which will also
> be posted in a couple of days in order to get everything into 3.15.
>
> Thanks,
>
> Ilya
>
>
> Ilya Dryomov (5):
> crush: fix off-by-one errors in total_tries refactor
> crush: allow crush rules to set (re)tries counts to 0
> crush: add chooseleaf_vary_r tunable
> crush: add SET_CHOOSELEAF_VARY_R step
> crush: support chooseleaf_vary_r tunable (tunables3) by default
>
> include/linux/ceph/ceph_features.h | 10 ++++-
> include/linux/crush/crush.h | 7 +++
> net/ceph/crush/mapper.c | 85 ++++++++++++++++++++++++------------
> 3 files changed, 74 insertions(+), 28 deletions(-)
>
These all look good.
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
^ permalink raw reply [flat|nested] 7+ messages in thread