All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY
@ 2014-03-27 18:17 Ilya Dryomov
  2014-03-27 18:17 ` [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show() Ilya Dryomov
                   ` (32 more replies)
  0 siblings, 33 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Hello,

This is on top of wip-tunables3, which I posted a week ago and brings
the support for the new osdmap encoding (OSDMAP_ENC feature bit),
primary_temp and primary affinity (PRIMARY_AFFINITY feature bit) to the
kernel client, along with some cleanups.  PRIMARY_AFFINITY feature bit
is shared with CRUSH_TUNABLES3, so wip-primary branch contains both
this series and chooseleaf_vary_r stuff.

- 01-16/33, 24-25/33 common ground + misc fixes and cleanups
- 17-23/33 new osdmap encoding + infrastructure
- 26-28/33, 32/33 refactor pg -> (osd set, primary) code paths
- 29-31/33, 33/33 primary_temp and primary affinity logic

Thanks,

                Ilya



Ilya Dryomov (33):
  libceph: refer to osdmap directly in osdmap_show()
  libceph: do not prefix osd lines with \t in debugfs output
  libceph: dump pg_temp mappings to debugfs
  libceph: dump osdmap and enhance output on decode errors
  libceph: split osdmap allocation and decode steps
  libceph: fixup error handling in osdmap_decode()
  libceph: safely decode max_osd value in osdmap_decode()
  libceph: assert length of osdmap osd arrays
  libceph: fix crush_decode() call site in osdmap_decode()
  libceph: fixup error handling in osdmap_apply_incremental()
  libceph: nuke bogus encoding version check in
    osdmap_apply_incremental()
  libceph: fix and clarify ceph_decode_need() sizes
  libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()
  libceph: introduce decode{,_new}_pools() and switch to them
  libceph: switch osdmap_set_max_osd() to krealloc()
  libceph: introduce decode{,_new}_pg_temp() and switch to them
  libceph: introduce get_osdmap_client_data_v()
  libceph: generalize ceph_pg_mapping
  libceph: primary_temp infrastructure
  libceph: primary_temp decode bits
  libceph: primary_affinity infrastructure
  libceph: primary_affinity decode bits
  libceph: enable OSDMAP_ENC feature bit
  libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
  libceph: ceph_can_shift_osds(pool) and pool type defines
  libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
  libceph: introduce apply_temps() helper
  libceph: switch ceph_calc_pg_acting() to new helpers
  libceph: return primary from ceph_calc_pg_acting()
  libceph: add support for primary_temp mappings
  libceph: add support for osd primary affinity
  libceph: redo ceph_calc_pg_primary() in terms of
    ceph_calc_pg_acting()
  libceph: enable PRIMARY_AFFINITY feature bit

 include/linux/ceph/ceph_features.h |    4 +-
 include/linux/ceph/osdmap.h        |   47 +-
 include/linux/ceph/rados.h         |    9 +-
 net/ceph/debugfs.c                 |   49 +-
 net/ceph/osd_client.c              |   12 +-
 net/ceph/osdmap.c                  |  963 ++++++++++++++++++++++++++----------
 6 files changed, 797 insertions(+), 287 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:09   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 02/33] libceph: do not prefix osd lines with \t in debugfs output Ilya Dryomov
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

To make it more readable and save screen space.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/debugfs.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index 258a382e75ed..d225842c7b41 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -53,34 +53,36 @@ static int osdmap_show(struct seq_file *s, void *p)
 {
 	int i;
 	struct ceph_client *client = s->private;
+	struct ceph_osdmap *map = client->osdc.osdmap;
 	struct rb_node *n;
 
-	if (client->osdc.osdmap == NULL)
+	if (map == NULL)
 		return 0;
-	seq_printf(s, "epoch %d\n", client->osdc.osdmap->epoch);
+
+	seq_printf(s, "epoch %d\n", map->epoch);
 	seq_printf(s, "flags%s%s\n",
-		   (client->osdc.osdmap->flags & CEPH_OSDMAP_NEARFULL) ?
-		   " NEARFULL" : "",
-		   (client->osdc.osdmap->flags & CEPH_OSDMAP_FULL) ?
-		   " FULL" : "");
-	for (n = rb_first(&client->osdc.osdmap->pg_pools); n; n = rb_next(n)) {
+		   (map->flags & CEPH_OSDMAP_NEARFULL) ?  " NEARFULL" : "",
+		   (map->flags & CEPH_OSDMAP_FULL) ?  " FULL" : "");
+
+	for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
 		struct ceph_pg_pool_info *pool =
 			rb_entry(n, struct ceph_pg_pool_info, node);
+
 		seq_printf(s, "pg_pool %llu pg_num %d / %d\n",
 			   (unsigned long long)pool->id, pool->pg_num,
 			   pool->pg_num_mask);
 	}
-	for (i = 0; i < client->osdc.osdmap->max_osd; i++) {
-		struct ceph_entity_addr *addr =
-			&client->osdc.osdmap->osd_addr[i];
-		int state = client->osdc.osdmap->osd_state[i];
+	for (i = 0; i < map->max_osd; i++) {
+		struct ceph_entity_addr *addr = &map->osd_addr[i];
+		int state = map->osd_state[i];
 		char sb[64];
 
 		seq_printf(s, "\tosd%d\t%s\t%3d%%\t(%s)\n",
 			   i, ceph_pr_addr(&addr->in_addr),
-			   ((client->osdc.osdmap->osd_weight[i]*100) >> 16),
+			   ((map->osd_weight[i]*100) >> 16),
 			   ceph_osdmap_state_str(sb, sizeof(sb), state));
 	}
+
 	return 0;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 02/33] libceph: do not prefix osd lines with \t in debugfs output
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
  2014-03-27 18:17 ` [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show() Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:10   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 03/33] libceph: dump pg_temp mappings to debugfs Ilya Dryomov
                   ` (30 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

To save screen space in anticipation of more fields (e.g. primary
affinity).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/debugfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index d225842c7b41..112d98edb156 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -77,7 +77,7 @@ static int osdmap_show(struct seq_file *s, void *p)
 		int state = map->osd_state[i];
 		char sb[64];
 
-		seq_printf(s, "\tosd%d\t%s\t%3d%%\t(%s)\n",
+		seq_printf(s, "osd%d\t%s\t%3d%%\t(%s)\n",
 			   i, ceph_pr_addr(&addr->in_addr),
 			   ((map->osd_weight[i]*100) >> 16),
 			   ceph_osdmap_state_str(sb, sizeof(sb), state));
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 03/33] libceph: dump pg_temp mappings to debugfs
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
  2014-03-27 18:17 ` [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show() Ilya Dryomov
  2014-03-27 18:17 ` [PATCH 02/33] libceph: do not prefix osd lines with \t in debugfs output Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:11   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 04/33] libceph: dump osdmap and enhance output on decode errors Ilya Dryomov
                   ` (29 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:

    pg_temp 2.6 [2,3,4]

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/debugfs.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index 112d98edb156..c45d235e774e 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -82,6 +82,17 @@ static int osdmap_show(struct seq_file *s, void *p)
 			   ((map->osd_weight[i]*100) >> 16),
 			   ceph_osdmap_state_str(sb, sizeof(sb), state));
 	}
+	for (n = rb_first(&map->pg_temp); n; n = rb_next(n)) {
+		struct ceph_pg_mapping *pg =
+			rb_entry(n, struct ceph_pg_mapping, node);
+
+		seq_printf(s, "pg_temp %llu.%x [", pg->pgid.pool,
+			   pg->pgid.seed);
+		for (i = 0; i < pg->len; i++)
+			seq_printf(s, "%s%d", (i == 0 ? "" : ","),
+				   pg->osds[i]);
+		seq_printf(s, "]\n");
+	}
 
 	return 0;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 04/33] libceph: dump osdmap and enhance output on decode errors
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (2 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 03/33] libceph: dump pg_temp mappings to debugfs Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:15   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 05/33] libceph: split osdmap allocation and decode steps Ilya Dryomov
                   ` (28 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Dump osdmap in hex on both full and incremental decode errors, to make
it easier to match the contents with error offset.  dout() map epoch
and max_osd value on success.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 9d1aaa24def6..4dd000d128fd 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -690,6 +690,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 	u16 version;
 	u32 len, max, i;
 	int err = -EINVAL;
+	u32 epoch = 0;
 	void *start = *p;
 	struct ceph_pg_pool_info *pi;
 
@@ -714,7 +715,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 
 	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), bad);
 	ceph_decode_copy(p, &map->fsid, sizeof(map->fsid));
-	map->epoch = ceph_decode_32(p);
+	epoch = map->epoch = ceph_decode_32(p);
 	ceph_decode_copy(p, &map->created, sizeof(map->created));
 	ceph_decode_copy(p, &map->modified, sizeof(map->modified));
 
@@ -814,14 +815,18 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 		goto bad;
 	}
 
-	/* ignore the rest of the map */
+	/* ignore the rest */
 	*p = end;
 
-	dout("osdmap_decode done %p %p\n", *p, end);
+	dout("full osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
 	return map;
 
 bad:
-	dout("osdmap_decode fail err %d\n", err);
+	pr_err("corrupt full osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
+	       err, epoch, (int)(*p - start), *p, start, end);
+	print_hex_dump(KERN_DEBUG, "osdmap: ",
+		       DUMP_PREFIX_OFFSET, 16, 1,
+		       start, end - start, true);
 	ceph_osdmap_destroy(map);
 	return ERR_PTR(err);
 }
@@ -845,6 +850,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	int err = -EINVAL;
 	u16 version;
 
+	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
+
 	ceph_decode_16_safe(p, end, version, bad);
 	if (version != 6) {
 		pr_warning("got unknown v %d != 6 of inc osdmap\n", version);
@@ -1032,11 +1039,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 
 	/* ignore the rest */
 	*p = end;
+
+	dout("inc osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
 	return map;
 
 bad:
-	pr_err("corrupt inc osdmap epoch %d off %d (%p of %p-%p)\n",
-	       epoch, (int)(*p - start), *p, start, end);
+	pr_err("corrupt inc osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
+	       err, epoch, (int)(*p - start), *p, start, end);
 	print_hex_dump(KERN_DEBUG, "osdmap: ",
 		       DUMP_PREFIX_OFFSET, 16, 1,
 		       start, end - start, true);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 05/33] libceph: split osdmap allocation and decode steps
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (3 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 04/33] libceph: dump osdmap and enhance output on decode errors Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:18   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 06/33] libceph: fixup error handling in osdmap_decode() Ilya Dryomov
                   ` (27 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |    2 +-
 net/ceph/osd_client.c       |    2 +-
 net/ceph/osdmap.c           |   44 ++++++++++++++++++++++++++++---------------
 3 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index 8c8b3cefc28b..46c3e304c3d8 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -156,7 +156,7 @@ static inline int ceph_decode_pgid(void **p, void *end, struct ceph_pg *pgid)
 	return 0;
 }
 
-extern struct ceph_osdmap *osdmap_decode(void **p, void *end);
+extern struct ceph_osdmap *ceph_osdmap_decode(void **p, void *end);
 extern struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 					    struct ceph_osdmap *map,
 					    struct ceph_messenger *msgr);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 71830d79b0f4..6f64eec18851 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2062,7 +2062,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 			int skipped_map = 0;
 
 			dout("taking full map %u len %d\n", epoch, maplen);
-			newmap = osdmap_decode(&p, p+maplen);
+			newmap = ceph_osdmap_decode(&p, p+maplen);
 			if (IS_ERR(newmap)) {
 				err = PTR_ERR(newmap);
 				goto bad;
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 4dd000d128fd..a82df6ea0749 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -684,9 +684,8 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
 /*
  * decode a full map.
  */
-struct ceph_osdmap *osdmap_decode(void **p, void *end)
+static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 {
-	struct ceph_osdmap *map;
 	u16 version;
 	u32 len, max, i;
 	int err = -EINVAL;
@@ -694,14 +693,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 	void *start = *p;
 	struct ceph_pg_pool_info *pi;
 
-	dout("osdmap_decode %p to %p len %d\n", *p, end, (int)(end - *p));
-
-	map = kzalloc(sizeof(*map), GFP_NOFS);
-	if (map == NULL)
-		return ERR_PTR(-ENOMEM);
-
-	map->pg_temp = RB_ROOT;
-	mutex_init(&map->crush_scratch_mutex);
+	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
 
 	ceph_decode_16_safe(p, end, version, bad);
 	if (version > 6) {
@@ -751,7 +743,6 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 	err = osdmap_set_max_osd(map, max);
 	if (err < 0)
 		goto bad;
-	dout("osdmap_decode max_osd = %d\n", map->max_osd);
 
 	/* osds */
 	err = -EINVAL;
@@ -819,7 +810,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 	*p = end;
 
 	dout("full osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
-	return map;
+	return 0;
 
 bad:
 	pr_err("corrupt full osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
@@ -827,8 +818,31 @@ bad:
 	print_hex_dump(KERN_DEBUG, "osdmap: ",
 		       DUMP_PREFIX_OFFSET, 16, 1,
 		       start, end - start, true);
-	ceph_osdmap_destroy(map);
-	return ERR_PTR(err);
+	return err;
+}
+
+/*
+ * Allocate and decode a full map.
+ */
+struct ceph_osdmap *ceph_osdmap_decode(void **p, void *end)
+{
+	struct ceph_osdmap *map;
+	int ret;
+
+	map = kzalloc(sizeof(*map), GFP_NOFS);
+	if (!map)
+		return ERR_PTR(-ENOMEM);
+
+	map->pg_temp = RB_ROOT;
+	mutex_init(&map->crush_scratch_mutex);
+
+	ret = osdmap_decode(p, end, map);
+	if (ret) {
+		ceph_osdmap_destroy(map);
+		return ERR_PTR(ret);
+	}
+
+	return map;
 }
 
 /*
@@ -872,7 +886,7 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	if (len > 0) {
 		dout("apply_incremental full map len %d, %p to %p\n",
 		     len, *p, end);
-		return osdmap_decode(p, min(*p+len, end));
+		return ceph_osdmap_decode(p, min(*p+len, end));
 	}
 
 	/* new crush? */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 06/33] libceph: fixup error handling in osdmap_decode()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (4 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 05/33] libceph: split osdmap allocation and decode steps Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:25   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 07/33] libceph: safely decode max_osd value " Ilya Dryomov
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Fix this by adding a special e_inval label to
be used by all ceph_decode_* macros.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   53 +++++++++++++++++++++++++++--------------------------
 1 file changed, 27 insertions(+), 26 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index a82df6ea0749..298d076eee89 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -688,36 +688,37 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 {
 	u16 version;
 	u32 len, max, i;
-	int err = -EINVAL;
 	u32 epoch = 0;
 	void *start = *p;
+	int err;
 	struct ceph_pg_pool_info *pi;
 
 	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
 
-	ceph_decode_16_safe(p, end, version, bad);
+	ceph_decode_16_safe(p, end, version, e_inval);
 	if (version > 6) {
 		pr_warning("got unknown v %d > 6 of osdmap\n", version);
-		goto bad;
+		goto e_inval;
 	}
 	if (version < 6) {
 		pr_warning("got old v %d < 6 of osdmap\n", version);
-		goto bad;
+		goto e_inval;
 	}
 
-	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), bad);
+	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), e_inval);
 	ceph_decode_copy(p, &map->fsid, sizeof(map->fsid));
 	epoch = map->epoch = ceph_decode_32(p);
 	ceph_decode_copy(p, &map->created, sizeof(map->created));
 	ceph_decode_copy(p, &map->modified, sizeof(map->modified));
 
-	ceph_decode_32_safe(p, end, max, bad);
+	ceph_decode_32_safe(p, end, max, e_inval);
 	while (max--) {
-		ceph_decode_need(p, end, 8 + 2, bad);
-		err = -ENOMEM;
+		ceph_decode_need(p, end, 8 + 2, e_inval);
 		pi = kzalloc(sizeof(*pi), GFP_NOFS);
-		if (!pi)
+		if (!pi) {
+			err = -ENOMEM;
 			goto bad;
+		}
 		pi->id = ceph_decode_64(p);
 		err = __decode_pool(p, end, pi);
 		if (err < 0) {
@@ -728,27 +729,25 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	}
 
 	err = __decode_pool_names(p, end, map);
-	if (err < 0) {
-		dout("fail to decode pool names");
+	if (err)
 		goto bad;
-	}
 
-	ceph_decode_32_safe(p, end, map->pool_max, bad);
+	ceph_decode_32_safe(p, end, map->pool_max, e_inval);
 
-	ceph_decode_32_safe(p, end, map->flags, bad);
+	ceph_decode_32_safe(p, end, map->flags, e_inval);
 
 	max = ceph_decode_32(p);
 
 	/* (re)alloc osd arrays */
 	err = osdmap_set_max_osd(map, max);
-	if (err < 0)
+	if (err)
 		goto bad;
 
 	/* osds */
-	err = -EINVAL;
 	ceph_decode_need(p, end, 3*sizeof(u32) +
 			 map->max_osd*(1 + sizeof(*map->osd_weight) +
-				       sizeof(*map->osd_addr)), bad);
+				       sizeof(*map->osd_addr)), e_inval);
+
 	*p += 4; /* skip length field (should match max) */
 	ceph_decode_copy(p, map->osd_state, map->max_osd);
 
@@ -762,7 +761,7 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 		ceph_decode_addr(&map->osd_addr[i]);
 
 	/* pg_temp */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	for (i = 0; i < len; i++) {
 		int n, j;
 		struct ceph_pg pgid;
@@ -771,16 +770,16 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 		err = ceph_decode_pgid(p, end, &pgid);
 		if (err)
 			goto bad;
-		ceph_decode_need(p, end, sizeof(u32), bad);
+		ceph_decode_need(p, end, sizeof(u32), e_inval);
 		n = ceph_decode_32(p);
-		err = -EINVAL;
 		if (n > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
-			goto bad;
-		ceph_decode_need(p, end, n * sizeof(u32), bad);
-		err = -ENOMEM;
+			goto e_inval;
+		ceph_decode_need(p, end, n * sizeof(u32), e_inval);
 		pg = kmalloc(sizeof(*pg) + n*sizeof(u32), GFP_NOFS);
-		if (!pg)
+		if (!pg) {
+			err = -ENOMEM;
 			goto bad;
+		}
 		pg->pgid = pgid;
 		pg->len = n;
 		for (j = 0; j < n; j++)
@@ -794,10 +793,10 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	}
 
 	/* crush */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	dout("osdmap_decode crush len %d from off 0x%x\n", len,
 	     (int)(*p - start));
-	ceph_decode_need(p, end, len, bad);
+	ceph_decode_need(p, end, len, e_inval);
 	map->crush = crush_decode(*p, end);
 	*p += len;
 	if (IS_ERR(map->crush)) {
@@ -812,6 +811,8 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	dout("full osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
 	return 0;
 
+e_inval:
+	err = -EINVAL;
 bad:
 	pr_err("corrupt full osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
 	       err, epoch, (int)(*p - start), *p, start, end);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 07/33] libceph: safely decode max_osd value in osdmap_decode()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (5 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 06/33] libceph: fixup error handling in osdmap_decode() Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:27   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 08/33] libceph: assert length of osdmap osd arrays Ilya Dryomov
                   ` (25 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

max_osd value is not covered by any ceph_decode_need().  Use a safe
version of ceph_decode_* macro to decode it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 298d076eee89..ec06010657b3 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -687,9 +687,10 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
 static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 {
 	u16 version;
-	u32 len, max, i;
 	u32 epoch = 0;
 	void *start = *p;
+	u32 max;
+	u32 len, i;
 	int err;
 	struct ceph_pg_pool_info *pi;
 
@@ -736,7 +737,8 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 
 	ceph_decode_32_safe(p, end, map->flags, e_inval);
 
-	max = ceph_decode_32(p);
+	/* max_osd */
+	ceph_decode_32_safe(p, end, max, e_inval);
 
 	/* (re)alloc osd arrays */
 	err = osdmap_set_max_osd(map, max);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 08/33] libceph: assert length of osdmap osd arrays
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (6 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 07/33] libceph: safely decode max_osd value " Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:30   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode() Ilya Dryomov
                   ` (24 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Assert length of osd_state, osd_weight and osd_addr arrays.  They
should all have exactly max_osd elements after the call to
osdmap_set_max_osd().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index ec06010657b3..19aca4d3c5dd 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -745,19 +745,19 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	if (err)
 		goto bad;
 
-	/* osds */
+	/* osd_state, osd_weight, osd_addrs->client_addr */
 	ceph_decode_need(p, end, 3*sizeof(u32) +
 			 map->max_osd*(1 + sizeof(*map->osd_weight) +
 				       sizeof(*map->osd_addr)), e_inval);
 
-	*p += 4; /* skip length field (should match max) */
+	BUG_ON(ceph_decode_32(p) != map->max_osd);
 	ceph_decode_copy(p, map->osd_state, map->max_osd);
 
-	*p += 4; /* skip length field (should match max) */
+	BUG_ON(ceph_decode_32(p) != map->max_osd);
 	for (i = 0; i < map->max_osd; i++)
 		map->osd_weight[i] = ceph_decode_32(p);
 
-	*p += 4; /* skip length field (should match max) */
+	BUG_ON(ceph_decode_32(p) != map->max_osd);
 	ceph_decode_copy(p, map->osd_addr, map->max_osd*sizeof(*map->osd_addr));
 	for (i = 0; i < map->max_osd; i++)
 		ceph_decode_addr(&map->osd_addr[i]);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (7 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 08/33] libceph: assert length of osdmap osd arrays Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:45   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental() Ilya Dryomov
                   ` (23 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

The size of the memory area feeded to crush_decode() should be limited
not only by osdmap end, but also by the crush map length.  Also, drop
unnecessary dout() (dout() in crush_decode() conveys the same info) and
step past crush map only if it is decoded successfully.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 19aca4d3c5dd..b70357adbdc0 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -796,16 +796,13 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 
 	/* crush */
 	ceph_decode_32_safe(p, end, len, e_inval);
-	dout("osdmap_decode crush len %d from off 0x%x\n", len,
-	     (int)(*p - start));
-	ceph_decode_need(p, end, len, e_inval);
-	map->crush = crush_decode(*p, end);
-	*p += len;
+	map->crush = crush_decode(*p, min(*p + len, end));
 	if (IS_ERR(map->crush)) {
 		err = PTR_ERR(map->crush);
 		map->crush = NULL;
 		goto bad;
 	}
+	*p += len;
 
 	/* ignore the rest */
 	*p = end;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (8 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode() Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:49   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 11/33] libceph: nuke bogus encoding version check " Ilya Dryomov
                   ` (22 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

The existing error handling scheme requires resetting err to -EINVAL
prior to calling any ceph_decode_* macro.  This is ugly and fragile,
and there already are a few places where we would return 0 on error,
due to a missing reset.  Follow osdmap_decode() and fix this by adding
a special e_inval label to be used by all ceph_decode_* macros.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   66 +++++++++++++++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index b70357adbdc0..0fc29a930c06 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -861,19 +861,19 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	__s64 new_pool_max;
 	__s32 new_flags, max;
 	void *start = *p;
-	int err = -EINVAL;
+	int err;
 	u16 version;
 
 	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
 
-	ceph_decode_16_safe(p, end, version, bad);
+	ceph_decode_16_safe(p, end, version, e_inval);
 	if (version != 6) {
 		pr_warning("got unknown v %d != 6 of inc osdmap\n", version);
-		goto bad;
+		goto e_inval;
 	}
 
 	ceph_decode_need(p, end, sizeof(fsid)+sizeof(modified)+2*sizeof(u32),
-			 bad);
+			 e_inval);
 	ceph_decode_copy(p, &fsid, sizeof(fsid));
 	epoch = ceph_decode_32(p);
 	BUG_ON(epoch != map->epoch+1);
@@ -882,7 +882,7 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	new_flags = ceph_decode_32(p);
 
 	/* full map? */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	if (len > 0) {
 		dout("apply_incremental full map len %d, %p to %p\n",
 		     len, *p, end);
@@ -890,13 +890,14 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 
 	/* new crush? */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	if (len > 0) {
-		dout("apply_incremental new crush map len %d, %p to %p\n",
-		     len, *p, end);
 		newcrush = crush_decode(*p, min(*p+len, end));
-		if (IS_ERR(newcrush))
-			return ERR_CAST(newcrush);
+		if (IS_ERR(newcrush)) {
+			err = PTR_ERR(newcrush);
+			newcrush = NULL;
+			goto bad;
+		}
 		*p += len;
 	}
 
@@ -906,13 +907,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	if (new_pool_max >= 0)
 		map->pool_max = new_pool_max;
 
-	ceph_decode_need(p, end, 5*sizeof(u32), bad);
+	ceph_decode_need(p, end, 5*sizeof(u32), e_inval);
 
 	/* new max? */
 	max = ceph_decode_32(p);
 	if (max >= 0) {
 		err = osdmap_set_max_osd(map, max);
-		if (err < 0)
+		if (err)
 			goto bad;
 	}
 
@@ -926,11 +927,11 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 
 	/* new_pool */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	while (len--) {
 		struct ceph_pg_pool_info *pi;
 
-		ceph_decode_64_safe(p, end, pool, bad);
+		ceph_decode_64_safe(p, end, pool, e_inval);
 		pi = __lookup_pg_pool(&map->pg_pools, pool);
 		if (!pi) {
 			pi = kzalloc(sizeof(*pi), GFP_NOFS);
@@ -947,29 +948,28 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 	if (version >= 5) {
 		err = __decode_pool_names(p, end, map);
-		if (err < 0)
+		if (err)
 			goto bad;
 	}
 
 	/* old_pool */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	while (len--) {
 		struct ceph_pg_pool_info *pi;
 
-		ceph_decode_64_safe(p, end, pool, bad);
+		ceph_decode_64_safe(p, end, pool, e_inval);
 		pi = __lookup_pg_pool(&map->pg_pools, pool);
 		if (pi)
 			__remove_pg_pool(&map->pg_pools, pi);
 	}
 
 	/* new_up */
-	err = -EINVAL;
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	while (len--) {
 		u32 osd;
 		struct ceph_entity_addr addr;
-		ceph_decode_32_safe(p, end, osd, bad);
-		ceph_decode_copy_safe(p, end, &addr, sizeof(addr), bad);
+		ceph_decode_32_safe(p, end, osd, e_inval);
+		ceph_decode_copy_safe(p, end, &addr, sizeof(addr), e_inval);
 		ceph_decode_addr(&addr);
 		pr_info("osd%d up\n", osd);
 		BUG_ON(osd >= map->max_osd);
@@ -978,11 +978,11 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 
 	/* new_state */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	while (len--) {
 		u32 osd;
 		u8 xorstate;
-		ceph_decode_32_safe(p, end, osd, bad);
+		ceph_decode_32_safe(p, end, osd, e_inval);
 		xorstate = **(u8 **)p;
 		(*p)++;  /* clean flag */
 		if (xorstate == 0)
@@ -994,10 +994,10 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 
 	/* new_weight */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	while (len--) {
 		u32 osd, off;
-		ceph_decode_need(p, end, sizeof(u32)*2, bad);
+		ceph_decode_need(p, end, sizeof(u32)*2, e_inval);
 		osd = ceph_decode_32(p);
 		off = ceph_decode_32(p);
 		pr_info("osd%d weight 0x%x %s\n", osd, off,
@@ -1008,7 +1008,7 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 
 	/* new_pg_temp */
-	ceph_decode_32_safe(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, e_inval);
 	while (len--) {
 		struct ceph_pg_mapping *pg;
 		int j;
@@ -1018,22 +1018,22 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 		err = ceph_decode_pgid(p, end, &pgid);
 		if (err)
 			goto bad;
-		ceph_decode_need(p, end, sizeof(u32), bad);
+		ceph_decode_need(p, end, sizeof(u32), e_inval);
 		pglen = ceph_decode_32(p);
 		if (pglen) {
-			ceph_decode_need(p, end, pglen*sizeof(u32), bad);
+			ceph_decode_need(p, end, pglen*sizeof(u32), e_inval);
 
 			/* removing existing (if any) */
 			(void) __remove_pg_mapping(&map->pg_temp, pgid);
 
 			/* insert */
-			err = -EINVAL;
 			if (pglen > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
-				goto bad;
-			err = -ENOMEM;
+				goto e_inval;
 			pg = kmalloc(sizeof(*pg) + sizeof(u32)*pglen, GFP_NOFS);
-			if (!pg)
+			if (!pg) {
+				err = -ENOMEM;
 				goto bad;
+			}
 			pg->pgid = pgid;
 			pg->len = pglen;
 			for (j = 0; j < pglen; j++)
@@ -1057,6 +1057,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	dout("inc osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
 	return map;
 
+e_inval:
+	err = -EINVAL;
 bad:
 	pr_err("corrupt inc osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
 	       err, epoch, (int)(*p - start), *p, start, end);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 11/33] libceph: nuke bogus encoding version check in osdmap_apply_incremental()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (9 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental() Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:50   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 12/33] libceph: fix and clarify ceph_decode_need() sizes Ilya Dryomov
                   ` (21 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Only version 6 of osdmap encoding is supported, anything other than
version 6 results in an error and halts the decoding process.  Checking
if version is >= 5 is therefore bogus.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 0fc29a930c06..75e192e99173 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -946,11 +946,10 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 		if (err < 0)
 			goto bad;
 	}
-	if (version >= 5) {
-		err = __decode_pool_names(p, end, map);
-		if (err)
-			goto bad;
-	}
+
+	err = __decode_pool_names(p, end, map);
+	if (err)
+		goto bad;
 
 	/* old_pool */
 	ceph_decode_32_safe(p, end, len, e_inval);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 12/33] libceph: fix and clarify ceph_decode_need() sizes
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (10 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 11/33] libceph: nuke bogus encoding version check " Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:53   ` Alex Elder
  2014-03-27 18:17 ` [PATCH 13/33] libceph: rename __decode_pool{,_names}() to decode_pool{,_names}() Ilya Dryomov
                   ` (20 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

Sum up sizeof(...) results instead of (incorrectly) hard-coding the
number of bytes, expressed in ints and longs.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 75e192e99173..6dd083906a1e 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -706,7 +706,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 		goto e_inval;
 	}
 
-	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), e_inval);
+	/* fsid, epoch, created, modified */
+	ceph_decode_need(p, end, sizeof(map->fsid) + sizeof(u32) +
+			 sizeof(map->created) + sizeof(map->modified), e_inval);
 	ceph_decode_copy(p, &map->fsid, sizeof(map->fsid));
 	epoch = map->epoch = ceph_decode_32(p);
 	ceph_decode_copy(p, &map->created, sizeof(map->created));
@@ -872,8 +874,9 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 		goto e_inval;
 	}
 
-	ceph_decode_need(p, end, sizeof(fsid)+sizeof(modified)+2*sizeof(u32),
-			 e_inval);
+	/* fsid, epoch, modified, new_pool_max, new_flags */
+	ceph_decode_need(p, end, sizeof(fsid) + sizeof(u32) + sizeof(modified) +
+			 sizeof(u64) + sizeof(u32), e_inval);
 	ceph_decode_copy(p, &fsid, sizeof(fsid));
 	epoch = ceph_decode_32(p);
 	BUG_ON(epoch != map->epoch+1);
@@ -907,10 +910,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	if (new_pool_max >= 0)
 		map->pool_max = new_pool_max;
 
-	ceph_decode_need(p, end, 5*sizeof(u32), e_inval);
-
 	/* new max? */
-	max = ceph_decode_32(p);
+	ceph_decode_32_safe(p, end, max, e_inval);
 	if (max >= 0) {
 		err = osdmap_set_max_osd(map, max);
 		if (err)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 13/33] libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (11 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 12/33] libceph: fix and clarify ceph_decode_need() sizes Ilya Dryomov
@ 2014-03-27 18:17 ` Ilya Dryomov
  2014-03-27 19:54   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 14/33] libceph: introduce decode{,_new}_pools() and switch to them Ilya Dryomov
                   ` (19 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:17 UTC (permalink / raw)
  To: ceph-devel

To be in line with all the other osdmap decode helpers.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 6dd083906a1e..cd8f34abe7b7 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -506,7 +506,7 @@ static void __remove_pg_pool(struct rb_root *root, struct ceph_pg_pool_info *pi)
 	kfree(pi);
 }
 
-static int __decode_pool(void **p, void *end, struct ceph_pg_pool_info *pi)
+static int decode_pool(void **p, void *end, struct ceph_pg_pool_info *pi)
 {
 	u8 ev, cv;
 	unsigned len, num;
@@ -587,7 +587,7 @@ bad:
 	return -EINVAL;
 }
 
-static int __decode_pool_names(void **p, void *end, struct ceph_osdmap *map)
+static int decode_pool_names(void **p, void *end, struct ceph_osdmap *map)
 {
 	struct ceph_pg_pool_info *pi;
 	u32 num, len;
@@ -723,7 +723,7 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 			goto bad;
 		}
 		pi->id = ceph_decode_64(p);
-		err = __decode_pool(p, end, pi);
+		err = decode_pool(p, end, pi);
 		if (err < 0) {
 			kfree(pi);
 			goto bad;
@@ -731,7 +731,8 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 		__insert_pg_pool(&map->pg_pools, pi);
 	}
 
-	err = __decode_pool_names(p, end, map);
+	/* pool_name */
+	err = decode_pool_names(p, end, map);
 	if (err)
 		goto bad;
 
@@ -943,12 +944,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 			pi->id = pool;
 			__insert_pg_pool(&map->pg_pools, pi);
 		}
-		err = __decode_pool(p, end, pi);
+		err = decode_pool(p, end, pi);
 		if (err < 0)
 			goto bad;
 	}
 
-	err = __decode_pool_names(p, end, map);
+	/* new_pool_names */
+	err = decode_pool_names(p, end, map);
 	if (err)
 		goto bad;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 14/33] libceph: introduce decode{,_new}_pools() and switch to them
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (12 preceding siblings ...)
  2014-03-27 18:17 ` [PATCH 13/33] libceph: rename __decode_pool{,_names}() to decode_pool{,_names}() Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 19:56   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 15/33] libceph: switch osdmap_set_max_osd() to krealloc() Ilya Dryomov
                   ` (18 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
map, same) decoding logic into a common helper and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   94 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 57 insertions(+), 37 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index cd8f34abe7b7..d6a569c5508f 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -681,6 +681,55 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
 	return 0;
 }
 
+static int __decode_pools(void **p, void *end, struct ceph_osdmap *map,
+			  bool incremental)
+{
+	u32 n;
+
+	ceph_decode_32_safe(p, end, n, e_inval);
+	while (n--) {
+		struct ceph_pg_pool_info *pi;
+		u64 pool;
+		int ret;
+
+		ceph_decode_64_safe(p, end, pool, e_inval);
+
+		pi = __lookup_pg_pool(&map->pg_pools, pool);
+		if (!incremental || !pi) {
+			pi = kzalloc(sizeof(*pi), GFP_NOFS);
+			if (!pi)
+				return -ENOMEM;
+
+			pi->id = pool;
+
+			ret = __insert_pg_pool(&map->pg_pools, pi);
+			if (ret) {
+				kfree(pi);
+				return ret;
+			}
+		}
+
+		ret = decode_pool(p, end, pi);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+
+e_inval:
+	return -EINVAL;
+}
+
+static int decode_pools(void **p, void *end, struct ceph_osdmap *map)
+{
+	return __decode_pools(p, end, map, false);
+}
+
+static int decode_new_pools(void **p, void *end, struct ceph_osdmap *map)
+{
+	return __decode_pools(p, end, map, true);
+}
+
 /*
  * decode a full map.
  */
@@ -692,7 +741,6 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	u32 max;
 	u32 len, i;
 	int err;
-	struct ceph_pg_pool_info *pi;
 
 	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
 
@@ -714,22 +762,10 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	ceph_decode_copy(p, &map->created, sizeof(map->created));
 	ceph_decode_copy(p, &map->modified, sizeof(map->modified));
 
-	ceph_decode_32_safe(p, end, max, e_inval);
-	while (max--) {
-		ceph_decode_need(p, end, 8 + 2, e_inval);
-		pi = kzalloc(sizeof(*pi), GFP_NOFS);
-		if (!pi) {
-			err = -ENOMEM;
-			goto bad;
-		}
-		pi->id = ceph_decode_64(p);
-		err = decode_pool(p, end, pi);
-		if (err < 0) {
-			kfree(pi);
-			goto bad;
-		}
-		__insert_pg_pool(&map->pg_pools, pi);
-	}
+	/* pools */
+	err = decode_pools(p, end, map);
+	if (err)
+		goto bad;
 
 	/* pool_name */
 	err = decode_pool_names(p, end, map);
@@ -928,26 +964,10 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 		newcrush = NULL;
 	}
 
-	/* new_pool */
-	ceph_decode_32_safe(p, end, len, e_inval);
-	while (len--) {
-		struct ceph_pg_pool_info *pi;
-
-		ceph_decode_64_safe(p, end, pool, e_inval);
-		pi = __lookup_pg_pool(&map->pg_pools, pool);
-		if (!pi) {
-			pi = kzalloc(sizeof(*pi), GFP_NOFS);
-			if (!pi) {
-				err = -ENOMEM;
-				goto bad;
-			}
-			pi->id = pool;
-			__insert_pg_pool(&map->pg_pools, pi);
-		}
-		err = decode_pool(p, end, pi);
-		if (err < 0)
-			goto bad;
-	}
+	/* new_pools */
+	err = decode_new_pools(p, end, map);
+	if (err)
+		goto bad;
 
 	/* new_pool_names */
 	err = decode_pool_names(p, end, map);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 15/33] libceph: switch osdmap_set_max_osd() to krealloc()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (13 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 14/33] libceph: introduce decode{,_new}_pools() and switch to them Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 19:59   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 16/33] libceph: introduce decode{,_new}_pg_temp() and switch to them Ilya Dryomov
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Use krealloc() instead of rolling our own.  (krealloc() with a NULL
first argument acts as a kmalloc()).  Properly initalize the new array
elements.  This is needed to make future additions to osdmap easier.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index d6a569c5508f..4565c72fec5c 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -646,38 +646,40 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
 }
 
 /*
- * adjust max osd value.  reallocate arrays.
+ * Adjust max_osd value, (re)allocate arrays.
+ *
+ * The new elements are properly initialized.
  */
 static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
 {
 	u8 *state;
-	struct ceph_entity_addr *addr;
 	u32 *weight;
+	struct ceph_entity_addr *addr;
+	int i;
 
-	state = kcalloc(max, sizeof(*state), GFP_NOFS);
-	addr = kcalloc(max, sizeof(*addr), GFP_NOFS);
-	weight = kcalloc(max, sizeof(*weight), GFP_NOFS);
-	if (state == NULL || addr == NULL || weight == NULL) {
+	state = krealloc(map->osd_state, max*sizeof(*state), GFP_NOFS);
+	weight = krealloc(map->osd_weight, max*sizeof(*weight), GFP_NOFS);
+	addr = krealloc(map->osd_addr, max*sizeof(*addr), GFP_NOFS);
+	if (!state || !weight || !addr) {
 		kfree(state);
-		kfree(addr);
 		kfree(weight);
+		kfree(addr);
+
 		return -ENOMEM;
 	}
 
-	/* copy old? */
-	if (map->osd_state) {
-		memcpy(state, map->osd_state, map->max_osd*sizeof(*state));
-		memcpy(addr, map->osd_addr, map->max_osd*sizeof(*addr));
-		memcpy(weight, map->osd_weight, map->max_osd*sizeof(*weight));
-		kfree(map->osd_state);
-		kfree(map->osd_addr);
-		kfree(map->osd_weight);
+	for (i = map->max_osd; i < max; i++) {
+		state[i] = 0;
+		weight[i] = CEPH_OSD_OUT;
+		memset(addr + i, 0, sizeof(*addr));
 	}
 
 	map->osd_state = state;
 	map->osd_weight = weight;
 	map->osd_addr = addr;
+
 	map->max_osd = max;
+
 	return 0;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 16/33] libceph: introduce decode{,_new}_pg_temp() and switch to them
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (14 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 15/33] libceph: switch osdmap_set_max_osd() to krealloc() Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:05   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 17/33] libceph: introduce get_osdmap_client_data_v() Ilya Dryomov
                   ` (16 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp
(inc map, same) decoding logic into a common helper and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |  139 ++++++++++++++++++++++++++---------------------------
 1 file changed, 67 insertions(+), 72 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 4565c72fec5c..0134df3639d2 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -732,6 +732,67 @@ static int decode_new_pools(void **p, void *end, struct ceph_osdmap *map)
 	return __decode_pools(p, end, map, true);
 }
 
+static int __decode_pg_temp(void **p, void *end, struct ceph_osdmap *map,
+			    bool incremental)
+{
+	u32 n;
+
+	ceph_decode_32_safe(p, end, n, e_inval);
+	while (n--) {
+		struct ceph_pg pgid;
+		u32 len, i;
+		int ret;
+
+		ret = ceph_decode_pgid(p, end, &pgid);
+		if (ret)
+			return ret;
+
+		ceph_decode_32_safe(p, end, len, e_inval);
+
+		ret = __remove_pg_mapping(&map->pg_temp, pgid);
+		BUG_ON(!incremental && ret != -ENOENT);
+
+		if (!incremental || len > 0) {
+			struct ceph_pg_mapping *pg;
+
+			ceph_decode_need(p, end, len*sizeof(u32), e_inval);
+
+			if (len > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
+				return -EINVAL;
+
+			pg = kzalloc(sizeof(*pg) + len*sizeof(u32), GFP_NOFS);
+			if (!pg)
+				return -ENOMEM;
+
+			pg->pgid = pgid;
+			pg->len = len;
+			for (i = 0; i < len; i++)
+				pg->osds[i] = ceph_decode_32(p);
+
+			ret = __insert_pg_mapping(pg, &map->pg_temp);
+			if (ret) {
+				kfree(pg);
+				return ret;
+			}
+		}
+	}
+
+	return 0;
+
+e_inval:
+	return -EINVAL;
+}
+
+static int decode_pg_temp(void **p, void *end, struct ceph_osdmap *map)
+{
+	return __decode_pg_temp(p, end, map, false);
+}
+
+static int decode_new_pg_temp(void **p, void *end, struct ceph_osdmap *map)
+{
+	return __decode_pg_temp(p, end, map, true);
+}
+
 /*
  * decode a full map.
  */
@@ -804,36 +865,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 		ceph_decode_addr(&map->osd_addr[i]);
 
 	/* pg_temp */
-	ceph_decode_32_safe(p, end, len, e_inval);
-	for (i = 0; i < len; i++) {
-		int n, j;
-		struct ceph_pg pgid;
-		struct ceph_pg_mapping *pg;
-
-		err = ceph_decode_pgid(p, end, &pgid);
-		if (err)
-			goto bad;
-		ceph_decode_need(p, end, sizeof(u32), e_inval);
-		n = ceph_decode_32(p);
-		if (n > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
-			goto e_inval;
-		ceph_decode_need(p, end, n * sizeof(u32), e_inval);
-		pg = kmalloc(sizeof(*pg) + n*sizeof(u32), GFP_NOFS);
-		if (!pg) {
-			err = -ENOMEM;
-			goto bad;
-		}
-		pg->pgid = pgid;
-		pg->len = n;
-		for (j = 0; j < n; j++)
-			pg->osds[j] = ceph_decode_32(p);
-
-		err = __insert_pg_mapping(pg, &map->pg_temp);
-		if (err)
-			goto bad;
-		dout(" added pg_temp %lld.%x len %d\n", pgid.pool, pgid.seed,
-		     len);
-	}
+	err = decode_pg_temp(p, end, map);
+	if (err)
+		goto bad;
 
 	/* crush */
 	ceph_decode_32_safe(p, end, len, e_inval);
@@ -1032,48 +1066,9 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	}
 
 	/* new_pg_temp */
-	ceph_decode_32_safe(p, end, len, e_inval);
-	while (len--) {
-		struct ceph_pg_mapping *pg;
-		int j;
-		struct ceph_pg pgid;
-		u32 pglen;
-
-		err = ceph_decode_pgid(p, end, &pgid);
-		if (err)
-			goto bad;
-		ceph_decode_need(p, end, sizeof(u32), e_inval);
-		pglen = ceph_decode_32(p);
-		if (pglen) {
-			ceph_decode_need(p, end, pglen*sizeof(u32), e_inval);
-
-			/* removing existing (if any) */
-			(void) __remove_pg_mapping(&map->pg_temp, pgid);
-
-			/* insert */
-			if (pglen > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
-				goto e_inval;
-			pg = kmalloc(sizeof(*pg) + sizeof(u32)*pglen, GFP_NOFS);
-			if (!pg) {
-				err = -ENOMEM;
-				goto bad;
-			}
-			pg->pgid = pgid;
-			pg->len = pglen;
-			for (j = 0; j < pglen; j++)
-				pg->osds[j] = ceph_decode_32(p);
-			err = __insert_pg_mapping(pg, &map->pg_temp);
-			if (err) {
-				kfree(pg);
-				goto bad;
-			}
-			dout(" added pg_temp %lld.%x len %d\n", pgid.pool,
-			     pgid.seed, pglen);
-		} else {
-			/* remove */
-			__remove_pg_mapping(&map->pg_temp, pgid);
-		}
-	}
+	err = decode_new_pg_temp(p, end, map);
+	if (err)
+		goto bad;
 
 	/* ignore the rest */
 	*p = end;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 17/33] libceph: introduce get_osdmap_client_data_v()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (15 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 16/33] libceph: introduce decode{,_new}_pg_temp() and switch to them Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:17   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 18/33] libceph: generalize ceph_pg_mapping Ilya Dryomov
                   ` (15 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Full and incremental osdmaps are structured identically and have
identical headers.  Add a helper to decode both "old" (16-bit version,
v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
enconding headers and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   81 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 65 insertions(+), 16 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 0134df3639d2..ae96c73aff71 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -683,6 +683,63 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
 	return 0;
 }
 
+#define OSDMAP_WRAPPER_COMPAT_VER	7
+#define OSDMAP_CLIENT_DATA_COMPAT_VER	1
+
+/*
+ * Return 0 or error.  On success, *v is set to 0 for old (v6) osdmaps,
+ * to struct_v of the client_data section for new (v7 and above)
+ * osdmaps.
+ */
+static int get_osdmap_client_data_v(void **p, void *end,
+				    const char *s, u8 *v)
+{
+	u8 struct_v;
+
+	ceph_decode_8_safe(p, end, struct_v, e_inval);
+	if (struct_v >= 7) {
+		u8 struct_compat;
+
+		ceph_decode_8_safe(p, end, struct_compat, e_inval);
+		if (struct_compat > OSDMAP_WRAPPER_COMPAT_VER) {
+			pr_warning("got v %d cv %d > %d of %s ceph_osdmap\n",
+				   struct_v, struct_compat,
+				   OSDMAP_WRAPPER_COMPAT_VER, s);
+			return -EINVAL;
+		}
+		*p += 4; /* ignore wrapper struct_len */
+
+		ceph_decode_8_safe(p, end, struct_v, e_inval);
+		ceph_decode_8_safe(p, end, struct_compat, e_inval);
+		if (struct_compat > OSDMAP_CLIENT_DATA_COMPAT_VER) {
+			pr_warning("got v %d cv %d > %d of %s ceph_osdmap client data\n",
+				   struct_v, struct_compat,
+				   OSDMAP_CLIENT_DATA_COMPAT_VER, s);
+			return -EINVAL;
+		}
+		*p += 4; /* ignore client data struct_len */
+	} else {
+		u16 version;
+
+		*p -= 1;
+		ceph_decode_16_safe(p, end, version, e_inval);
+		if (version < 6) {
+			pr_warning("got v %d < 6 of %s ceph_osdmap\n", version,
+				   s);
+			return -EINVAL;
+		}
+
+		/* old osdmap enconding */
+		struct_v = 0;
+	}
+
+	*v = struct_v;
+	return 0;
+
+e_inval:
+	return -EINVAL;
+}
+
 static int __decode_pools(void **p, void *end, struct ceph_osdmap *map,
 			  bool incremental)
 {
@@ -798,7 +855,7 @@ static int decode_new_pg_temp(void **p, void *end, struct ceph_osdmap *map)
  */
 static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 {
-	u16 version;
+	u8 struct_v;
 	u32 epoch = 0;
 	void *start = *p;
 	u32 max;
@@ -807,15 +864,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 
 	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
 
-	ceph_decode_16_safe(p, end, version, e_inval);
-	if (version > 6) {
-		pr_warning("got unknown v %d > 6 of osdmap\n", version);
-		goto e_inval;
-	}
-	if (version < 6) {
-		pr_warning("got old v %d < 6 of osdmap\n", version);
-		goto e_inval;
-	}
+	err = get_osdmap_client_data_v(p, end, "full", &struct_v);
+	if (err)
+		goto bad;
 
 	/* fsid, epoch, created, modified */
 	ceph_decode_need(p, end, sizeof(map->fsid) + sizeof(u32) +
@@ -937,15 +988,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	__s32 new_flags, max;
 	void *start = *p;
 	int err;
-	u16 version;
+	u8 struct_v;
 
 	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
 
-	ceph_decode_16_safe(p, end, version, e_inval);
-	if (version != 6) {
-		pr_warning("got unknown v %d != 6 of inc osdmap\n", version);
-		goto e_inval;
-	}
+	err = get_osdmap_client_data_v(p, end, "inc", &struct_v);
+	if (err)
+		goto bad;
 
 	/* fsid, epoch, modified, new_pool_max, new_flags */
 	ceph_decode_need(p, end, sizeof(fsid) + sizeof(u32) + sizeof(modified) +
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 18/33] libceph: generalize ceph_pg_mapping
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (16 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 17/33] libceph: introduce get_osdmap_client_data_v() Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 18:18 ` [PATCH 19/33] libceph: primary_temp infrastructure Ilya Dryomov
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

In preparation for adding support for primary_temp mappings, generalize
struct ceph_pg_mapping so it can hold mappings other than pg_temp.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |    9 +++++++--
 net/ceph/debugfs.c          |    4 ++--
 net/ceph/osdmap.c           |    8 ++++----
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index 46c3e304c3d8..4837e58e3203 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -60,8 +60,13 @@ struct ceph_object_id {
 struct ceph_pg_mapping {
 	struct rb_node node;
 	struct ceph_pg pgid;
-	int len;
-	int osds[];
+
+	union {
+		struct {
+			int len;
+			int osds[];
+		} pg_temp;
+	};
 };
 
 struct ceph_osdmap {
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index c45d235e774e..5865f2c9580a 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -88,9 +88,9 @@ static int osdmap_show(struct seq_file *s, void *p)
 
 		seq_printf(s, "pg_temp %llu.%x [", pg->pgid.pool,
 			   pg->pgid.seed);
-		for (i = 0; i < pg->len; i++)
+		for (i = 0; i < pg->pg_temp.len; i++)
 			seq_printf(s, "%s%d", (i == 0 ? "" : ","),
-				   pg->osds[i]);
+				   pg->pg_temp.osds[i]);
 		seq_printf(s, "]\n");
 	}
 
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index ae96c73aff71..401af78ad741 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -822,9 +822,9 @@ static int __decode_pg_temp(void **p, void *end, struct ceph_osdmap *map,
 				return -ENOMEM;
 
 			pg->pgid = pgid;
-			pg->len = len;
+			pg->pg_temp.len = len;
 			for (i = 0; i < len; i++)
-				pg->osds[i] = ceph_decode_32(p);
+				pg->pg_temp.osds[i] = ceph_decode_32(p);
 
 			ret = __insert_pg_mapping(pg, &map->pg_temp);
 			if (ret) {
@@ -1275,8 +1275,8 @@ static int *calc_pg_raw(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
 				    pool->pg_num_mask);
 	pg = __lookup_pg_mapping(&osdmap->pg_temp, pgid);
 	if (pg) {
-		*num = pg->len;
-		return pg->osds;
+		*num = pg->pg_temp.len;
+		return pg->pg_temp.osds;
 	}
 
 	/* crush */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 19/33] libceph: primary_temp infrastructure
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (17 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 18/33] libceph: generalize ceph_pg_mapping Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:21   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 20/33] libceph: primary_temp decode bits Ilya Dryomov
                   ` (13 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Add primary_temp mappings infrastructure.  struct ceph_pg_mapping is
overloaded, primary_temp mappings are stored in an rb-tree, rooted at
ceph_osdmap, in a manner similar to pg_temp mappings.

Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'primary_temp <pgid> <osd>' per line, e.g:

    primary_temp 2.6 4

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |    5 +++++
 net/ceph/debugfs.c          |    7 +++++++
 net/ceph/osdmap.c           |   10 +++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index 4837e58e3203..db4fb6322aae 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -66,6 +66,9 @@ struct ceph_pg_mapping {
 			int len;
 			int osds[];
 		} pg_temp;
+		struct {
+			int osd;
+		} primary_temp;
 	};
 };
 
@@ -83,6 +86,8 @@ struct ceph_osdmap {
 	struct ceph_entity_addr *osd_addr;
 
 	struct rb_root pg_temp;
+	struct rb_root primary_temp;
+
 	struct rb_root pg_pools;
 	u32 pool_max;
 
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index 5865f2c9580a..612bf55e6a8b 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -93,6 +93,13 @@ static int osdmap_show(struct seq_file *s, void *p)
 				   pg->pg_temp.osds[i]);
 		seq_printf(s, "]\n");
 	}
+	for (n = rb_first(&map->primary_temp); n; n = rb_next(n)) {
+		struct ceph_pg_mapping *pg =
+			rb_entry(n, struct ceph_pg_mapping, node);
+
+		seq_printf(s, "primary_temp %llu.%x %d\n", pg->pgid.pool,
+			   pg->pgid.seed, pg->primary_temp.osd);
+	}
 
 	return 0;
 }
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 401af78ad741..d78c3e5d60f7 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -343,7 +343,7 @@ bad:
 
 /*
  * rbtree of pg_mapping for handling pg_temp (explicit mapping of pgid
- * to a set of osds)
+ * to a set of osds) and primary_temp (explicit primary setting)
  */
 static int pgid_cmp(struct ceph_pg l, struct ceph_pg r)
 {
@@ -633,6 +633,13 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
 		rb_erase(&pg->node, &map->pg_temp);
 		kfree(pg);
 	}
+	while (!RB_EMPTY_ROOT(&map->primary_temp)) {
+		struct ceph_pg_mapping *pg =
+			rb_entry(rb_first(&map->primary_temp),
+				 struct ceph_pg_mapping, node);
+		rb_erase(&pg->node, &map->primary_temp);
+		kfree(pg);
+	}
 	while (!RB_EMPTY_ROOT(&map->pg_pools)) {
 		struct ceph_pg_pool_info *pi =
 			rb_entry(rb_first(&map->pg_pools),
@@ -960,6 +967,7 @@ struct ceph_osdmap *ceph_osdmap_decode(void **p, void *end)
 		return ERR_PTR(-ENOMEM);
 
 	map->pg_temp = RB_ROOT;
+	map->primary_temp = RB_ROOT;
 	mutex_init(&map->crush_scratch_mutex);
 
 	ret = osdmap_decode(p, end, map);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 20/33] libceph: primary_temp decode bits
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (18 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 19/33] libceph: primary_temp infrastructure Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:23   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 21/33] libceph: primary_affinity infrastructure Ilya Dryomov
                   ` (12 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Add a common helper to decode both primary_temp (full map, map<pg_t,
u32>) and new_primary_temp (inc map, same) and switch to it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index d78c3e5d60f7..0ca7f36e88b4 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -857,6 +857,61 @@ static int decode_new_pg_temp(void **p, void *end, struct ceph_osdmap *map)
 	return __decode_pg_temp(p, end, map, true);
 }
 
+static int __decode_primary_temp(void **p, void *end, struct ceph_osdmap *map,
+				 bool incremental)
+{
+	u32 n;
+
+	ceph_decode_32_safe(p, end, n, e_inval);
+	while (n--) {
+		struct ceph_pg pgid;
+		u32 osd;
+		int ret;
+
+		ret = ceph_decode_pgid(p, end, &pgid);
+		if (ret)
+			return ret;
+
+		ceph_decode_32_safe(p, end, osd, e_inval);
+
+		ret = __remove_pg_mapping(&map->primary_temp, pgid);
+		BUG_ON(!incremental && ret != -ENOENT);
+
+		if (!incremental || osd != (u32)-1) {
+			struct ceph_pg_mapping *pg;
+
+			pg = kzalloc(sizeof(*pg), GFP_NOFS);
+			if (!pg)
+				return -ENOMEM;
+
+			pg->pgid = pgid;
+			pg->primary_temp.osd = osd;
+
+			ret = __insert_pg_mapping(pg, &map->primary_temp);
+			if (ret) {
+				kfree(pg);
+				return ret;
+			}
+		}
+	}
+
+	return 0;
+
+e_inval:
+	return -EINVAL;
+}
+
+static int decode_primary_temp(void **p, void *end, struct ceph_osdmap *map)
+{
+	return __decode_primary_temp(p, end, map, false);
+}
+
+static int decode_new_primary_temp(void **p, void *end,
+				   struct ceph_osdmap *map)
+{
+	return __decode_primary_temp(p, end, map, true);
+}
+
 /*
  * decode a full map.
  */
@@ -927,6 +982,13 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 	if (err)
 		goto bad;
 
+	/* primary_temp */
+	if (struct_v >= 1) {
+		err = decode_primary_temp(p, end, map);
+		if (err)
+			goto bad;
+	}
+
 	/* crush */
 	ceph_decode_32_safe(p, end, len, e_inval);
 	map->crush = crush_decode(*p, min(*p + len, end));
@@ -1127,6 +1189,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 	if (err)
 		goto bad;
 
+	/* new_primary_temp */
+	if (struct_v >= 1) {
+		err = decode_new_primary_temp(p, end, map);
+		if (err)
+			goto bad;
+	}
+
 	/* ignore the rest */
 	*p = end;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 21/33] libceph: primary_affinity infrastructure
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (19 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 20/33] libceph: primary_temp decode bits Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:26   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 22/33] libceph: primary_affinity decode bits Ilya Dryomov
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Add primary_affinity infrastructure.  primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.

Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |    3 +++
 include/linux/ceph/rados.h  |    4 ++++
 net/ceph/debugfs.c          |    5 +++--
 net/ceph/osdmap.c           |   47 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index db4fb6322aae..6e030cb3c9ca 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -88,6 +88,8 @@ struct ceph_osdmap {
 	struct rb_root pg_temp;
 	struct rb_root primary_temp;
 
+	u32 *osd_primary_affinity;
+
 	struct rb_root pg_pools;
 	u32 pool_max;
 
@@ -134,6 +136,7 @@ static inline bool ceph_osdmap_flag(struct ceph_osdmap *map, int flag)
 }
 
 extern char *ceph_osdmap_state_str(char *str, int len, int state);
+extern u32 ceph_get_primary_affinity(struct ceph_osdmap *map, int osd);
 
 static inline struct ceph_entity_addr *ceph_osd_addr(struct ceph_osdmap *map,
 						     int osd)
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 2caabef8d369..bb6f40c9cb0f 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -133,6 +133,10 @@ extern const char *ceph_osd_state_name(int s);
 #define CEPH_OSD_IN  0x10000
 #define CEPH_OSD_OUT 0
 
+/* osd primary-affinity.  fixed point value: 0x10000 == baseline */
+#define CEPH_OSD_MAX_PRIMARY_AFFINITY 0x10000
+#define CEPH_OSD_DEFAULT_PRIMARY_AFFINITY 0x10000
+
 
 /*
  * osd map flag bits
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index 612bf55e6a8b..34453a2b4b4d 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -77,10 +77,11 @@ static int osdmap_show(struct seq_file *s, void *p)
 		int state = map->osd_state[i];
 		char sb[64];
 
-		seq_printf(s, "osd%d\t%s\t%3d%%\t(%s)\n",
+		seq_printf(s, "osd%d\t%s\t%3d%%\t(%s)\t%3d%%\n",
 			   i, ceph_pr_addr(&addr->in_addr),
 			   ((map->osd_weight[i]*100) >> 16),
-			   ceph_osdmap_state_str(sb, sizeof(sb), state));
+			   ceph_osdmap_state_str(sb, sizeof(sb), state),
+			   ((ceph_get_primary_affinity(map, i)*100) >> 16));
 	}
 	for (n = rb_first(&map->pg_temp); n; n = rb_next(n)) {
 		struct ceph_pg_mapping *pg =
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 0ca7f36e88b4..538b8dd341e8 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -649,6 +649,7 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
 	kfree(map->osd_state);
 	kfree(map->osd_weight);
 	kfree(map->osd_addr);
+	kfree(map->osd_primary_affinity);
 	kfree(map);
 }
 
@@ -685,6 +686,20 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
 	map->osd_weight = weight;
 	map->osd_addr = addr;
 
+	if (map->osd_primary_affinity) {
+		u32 *affinity;
+
+		affinity = krealloc(map->osd_primary_affinity,
+				    max*sizeof(*affinity), GFP_NOFS);
+		if (!affinity)
+			return -ENOMEM;
+
+		for (i = map->max_osd; i < max; i++)
+			affinity[i] = CEPH_OSD_DEFAULT_PRIMARY_AFFINITY;
+
+		map->osd_primary_affinity = affinity;
+	}
+
 	map->max_osd = max;
 
 	return 0;
@@ -912,6 +927,38 @@ static int decode_new_primary_temp(void **p, void *end,
 	return __decode_primary_temp(p, end, map, true);
 }
 
+u32 ceph_get_primary_affinity(struct ceph_osdmap *map, int osd)
+{
+	BUG_ON(osd >= map->max_osd);
+
+	if (!map->osd_primary_affinity)
+		return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY;
+
+	return map->osd_primary_affinity[osd];
+}
+
+static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
+{
+	BUG_ON(osd >= map->max_osd);
+
+	if (!map->osd_primary_affinity) {
+		int i;
+
+		map->osd_primary_affinity = kmalloc(map->max_osd*sizeof(u32),
+						    GFP_NOFS);
+		if (!map->osd_primary_affinity)
+			return -ENOMEM;
+
+		for (i = 0; i < map->max_osd; i++)
+			map->osd_primary_affinity[i] =
+			    CEPH_OSD_DEFAULT_PRIMARY_AFFINITY;
+	}
+
+	map->osd_primary_affinity[osd] = aff;
+
+	return 0;
+}
+
 /*
  * decode a full map.
  */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 22/33] libceph: primary_affinity decode bits
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (20 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 21/33] libceph: primary_affinity infrastructure Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:31   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit Ilya Dryomov
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Add two helpers to decode primary_affinity (full map, vector<u32>) and
new_primary_affinity (inc map, map<u32, u32>) and switch to them.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 538b8dd341e8..3ac2098972ea 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -959,6 +959,59 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
 	return 0;
 }
 
+static int decode_primary_affinity(void **p, void *end,
+				   struct ceph_osdmap *map)
+{
+	u32 len, i;
+
+	ceph_decode_32_safe(p, end, len, e_inval);
+	if (len == 0) {
+		kfree(map->osd_primary_affinity);
+		map->osd_primary_affinity = NULL;
+		return 0;
+	}
+
+	ceph_decode_need(p, end, map->max_osd*sizeof(u32), e_inval);
+
+	BUG_ON(len != map->max_osd);
+	for (i = 0; i < map->max_osd; i++) {
+		int ret;
+
+		ret = set_primary_affinity(map, i, ceph_decode_32(p));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+
+e_inval:
+	return -EINVAL;
+}
+
+static int decode_new_primary_affinity(void **p, void *end,
+				       struct ceph_osdmap *map)
+{
+	u32 n;
+
+	ceph_decode_32_safe(p, end, n, e_inval);
+	while (n--) {
+		u32 osd, aff;
+		int ret;
+
+		ceph_decode_32_safe(p, end, osd, e_inval);
+		ceph_decode_32_safe(p, end, aff, e_inval);
+
+		ret = set_primary_affinity(map, osd, aff);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+
+e_inval:
+	return -EINVAL;
+}
+
 /*
  * decode a full map.
  */
@@ -1036,6 +1089,17 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 			goto bad;
 	}
 
+	/* primary_affinity */
+	if (struct_v >= 2) {
+		err = decode_primary_affinity(p, end, map);
+		if (err)
+			goto bad;
+	} else {
+		/* XXX can this happen? */
+		kfree(map->osd_primary_affinity);
+		map->osd_primary_affinity = NULL;
+	}
+
 	/* crush */
 	ceph_decode_32_safe(p, end, len, e_inval);
 	map->crush = crush_decode(*p, min(*p + len, end));
@@ -1243,6 +1307,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
 			goto bad;
 	}
 
+	/* new_primary_affinity */
+	if (struct_v >= 2) {
+		err = decode_new_primary_affinity(p, end, map);
+		if (err)
+			goto bad;
+	}
+
 	/* ignore the rest */
 	*p = end;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (21 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 22/33] libceph: primary_affinity decode bits Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:32   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 24/33] libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions Ilya Dryomov
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Announce our support for "new" osdmap enconding.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/ceph_features.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index 77c097fe9ea9..7a4cab50b2cd 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -90,6 +90,7 @@ static inline u64 ceph_sanitize_features(u64 features)
 	 CEPH_FEATURE_OSD_CACHEPOOL |		\
 	 CEPH_FEATURE_CRUSH_V2 |		\
 	 CEPH_FEATURE_EXPORT_PEER |		\
+	 CEPH_FEATURE_OSDMAP_ENC |		\
 	 CEPH_FEATURE_CRUSH_TUNABLES3)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT   \
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 24/33] libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (22 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:33   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 25/33] libceph: ceph_can_shift_osds(pool) and pool type defines Ilya Dryomov
                   ` (8 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Sync up with ceph.git definitions.  Bring in ceph_osd_is_down().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index 6e030cb3c9ca..0895797b9e28 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -125,9 +125,21 @@ static inline void ceph_oid_copy(struct ceph_object_id *dest,
 	dest->name_len = src->name_len;
 }
 
+static inline int ceph_osd_exists(struct ceph_osdmap *map, int osd)
+{
+	return osd >= 0 && osd < map->max_osd &&
+	       (map->osd_state[osd] & CEPH_OSD_EXISTS);
+}
+
 static inline int ceph_osd_is_up(struct ceph_osdmap *map, int osd)
 {
-	return (osd < map->max_osd) && (map->osd_state[osd] & CEPH_OSD_UP);
+	return ceph_osd_exists(map, osd) &&
+	       (map->osd_state[osd] & CEPH_OSD_UP);
+}
+
+static inline int ceph_osd_is_down(struct ceph_osdmap *map, int osd)
+{
+	return !ceph_osd_is_up(map, osd);
 }
 
 static inline bool ceph_osdmap_flag(struct ceph_osdmap *map, int flag)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 25/33] libceph: ceph_can_shift_osds(pool) and pool type defines
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (23 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 24/33] libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:34   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 26/33] libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers Ilya Dryomov
                   ` (7 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Bring in pg_pool_t::can_shift_osds() counterpart along with pool type
defines.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |   12 ++++++++++++
 include/linux/ceph/rados.h  |    5 +++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index 0895797b9e28..4e28c1e5d62f 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -41,6 +41,18 @@ struct ceph_pg_pool_info {
 	char *name;
 };
 
+static inline bool ceph_can_shift_osds(struct ceph_pg_pool_info *pool)
+{
+	switch (pool->type) {
+	case CEPH_POOL_TYPE_REP:
+		return true;
+	case CEPH_POOL_TYPE_EC:
+		return false;
+	default:
+		BUG_ON(1);
+	}
+}
+
 struct ceph_object_locator {
 	s64 pool;
 };
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index bb6f40c9cb0f..f20e0d8a2155 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -81,8 +81,9 @@ struct ceph_pg_v1 {
  */
 #define CEPH_NOPOOL  ((__u64) (-1))  /* pool id not defined */
 
-#define CEPH_PG_TYPE_REP     1
-#define CEPH_PG_TYPE_RAID4   2
+#define CEPH_POOL_TYPE_REP     1
+#define CEPH_POOL_TYPE_RAID4   2 /* never implemented */
+#define CEPH_POOL_TYPE_EC      3
 
 /*
  * stable_mod func is used to control number of placement groups.
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 26/33] libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (24 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 25/33] libceph: ceph_can_shift_osds(pool) and pool type defines Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:36   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 27/33] libceph: introduce apply_temps() helper Ilya Dryomov
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

pg_to_raw_osds() helper for computing a raw (crush) set, which can
contain non-existant and down osds.

raw_to_up_osds() helper for pruning non-existant and down osds from the
raw set, therefore transforming it into an up set, and determining up
primary.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 3ac2098972ea..ee095e07cf98 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1514,6 +1514,82 @@ static int *calc_pg_raw(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
 }
 
 /*
+ * Calculate raw (crush) set for given pgid.
+ *
+ * Return raw set length, or error.
+ */
+static int pg_to_raw_osds(struct ceph_osdmap *osdmap,
+			  struct ceph_pg_pool_info *pool,
+			  struct ceph_pg pgid, u32 pps, int *osds)
+{
+	int ruleno;
+	int len;
+
+	/* crush */
+	ruleno = crush_find_rule(osdmap->crush, pool->crush_ruleset,
+				 pool->type, pool->size);
+	if (ruleno < 0) {
+		pr_err("no crush rule: pool %lld ruleset %d type %d size %d\n",
+		       pgid.pool, pool->crush_ruleset, pool->type,
+		       pool->size);
+		return -ENOENT;
+	}
+
+	len = do_crush(osdmap, ruleno, pps, osds,
+		       min_t(int, pool->size, CEPH_PG_MAX_SIZE),
+		       osdmap->osd_weight, osdmap->max_osd);
+	if (len < 0) {
+		pr_err("error %d from crush rule %d: pool %lld ruleset %d type %d size %d\n",
+		       len, ruleno, pgid.pool, pool->crush_ruleset,
+		       pool->type, pool->size);
+		return len;
+	}
+
+	return len;
+}
+
+/*
+ * Given raw set, calculate up set and up primary.
+ *
+ * Return up set length.  *primary is set to up primary osd id, or -1
+ * if up set is empty.
+ */
+static int raw_to_up_osds(struct ceph_osdmap *osdmap,
+			  struct ceph_pg_pool_info *pool,
+			  int *osds, int len, int *primary)
+{
+	int up_primary = -1;
+	int i;
+
+	if (ceph_can_shift_osds(pool)) {
+		int removed = 0;
+
+		for (i = 0; i < len; i++) {
+			if (ceph_osd_is_down(osdmap, osds[i])) {
+				removed++;
+				continue;
+			}
+			if (removed)
+				osds[i - removed] = osds[i];
+		}
+
+		len -= removed;
+		if (len > 0)
+			up_primary = osds[0];
+	} else {
+		for (i = len - 1; i >= 0; i--) {
+			if (ceph_osd_is_down(osdmap, osds[i]))
+				osds[i] = CRUSH_ITEM_NONE;
+			else
+				up_primary = osds[i];
+		}
+	}
+
+	*primary = up_primary;
+	return len;
+}
+
+/*
  * Return acting set for given pgid.
  */
 int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 27/33] libceph: introduce apply_temps() helper
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (25 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 26/33] libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:41   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers Ilya Dryomov
                   ` (5 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

apply_temp() helper for applying various temporary mappings (at this
point only pg_temp mappings) to the up set, therefore transforming it
into an acting set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index ee095e07cf98..6d418433d80d 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1590,6 +1590,58 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
 }
 
 /*
+ * Given up set, apply pg_temp mapping.
+ *
+ * Return acting set length.  *primary is set to acting primary osd id,
+ * or -1 if acting set is empty.
+ */
+static int apply_temps(struct ceph_osdmap *osdmap,
+		       struct ceph_pg_pool_info *pool, struct ceph_pg pgid,
+		       int *osds, int len, int *primary)
+{
+	struct ceph_pg_mapping *pg;
+	int temp_len;
+	int temp_primary;
+	int i;
+
+	/* raw_pg -> pg */
+	pgid.seed = ceph_stable_mod(pgid.seed, pool->pg_num,
+				    pool->pg_num_mask);
+
+	/* pg_temp? */
+	pg = __lookup_pg_mapping(&osdmap->pg_temp, pgid);
+	if (pg) {
+		temp_len = 0;
+		temp_primary = -1;
+
+		for (i = 0; i < pg->pg_temp.len; i++) {
+			if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
+				if (ceph_can_shift_osds(pool))
+					continue;
+				else
+					osds[temp_len++] = CRUSH_ITEM_NONE;
+			} else {
+				osds[temp_len++] = pg->pg_temp.osds[i];
+			}
+		}
+
+		/* apply pg_temp's primary */
+		for (i = 0; i < temp_len; i++) {
+			if (osds[i] != CRUSH_ITEM_NONE) {
+				temp_primary = osds[i];
+				break;
+			}
+		}
+	} else {
+		temp_len = len;
+		temp_primary = *primary;
+	}
+
+	*primary = temp_primary;
+	return temp_len;
+}
+
+/*
  * Return acting set for given pgid.
  */
 int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (26 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 27/33] libceph: introduce apply_temps() helper Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:49   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 29/33] libceph: return primary from ceph_calc_pg_acting() Ilya Dryomov
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |    2 +-
 net/ceph/osdmap.c           |   51 ++++++++++++++++++++++++++++++++-----------
 2 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index 4e28c1e5d62f..b0c8f8490663 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -212,7 +212,7 @@ extern int ceph_oloc_oid_to_pg(struct ceph_osdmap *osdmap,
 
 extern int ceph_calc_pg_acting(struct ceph_osdmap *osdmap,
 			       struct ceph_pg pgid,
-			       int *acting);
+			       int *osds);
 extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
 				struct ceph_pg pgid);
 
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 6d418433d80d..1963623bd488 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1642,24 +1642,49 @@ static int apply_temps(struct ceph_osdmap *osdmap,
 }
 
 /*
- * Return acting set for given pgid.
+ * Calculate acting set for given pgid.
+ *
+ * Return acting set length, or error.
  */
 int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
-			int *acting)
+			int *osds)
 {
-	int rawosds[CEPH_PG_MAX_SIZE], *osds;
-	int i, o, num = CEPH_PG_MAX_SIZE;
+	struct ceph_pg_pool_info *pool;
+	u32 pps;
+	int len;
+	int primary;
 
-	osds = calc_pg_raw(osdmap, pgid, rawosds, &num);
-	if (!osds)
-		return -1;
+	pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
+	if (!pool)
+		return 0;
 
-	/* primary is first up osd */
-	o = 0;
-	for (i = 0; i < num; i++)
-		if (ceph_osd_is_up(osdmap, osds[i]))
-			acting[o++] = osds[i];
-	return o;
+	if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
+		/* hash pool id and seed so that pool PGs do not overlap */
+		pps = crush_hash32_2(CRUSH_HASH_RJENKINS1,
+				     ceph_stable_mod(pgid.seed, pool->pgp_num,
+						     pool->pgp_num_mask),
+				     pgid.pool);
+	} else {
+		/*
+		 * legacy ehavior: add ps and pool together.  this is
+		 * not a great approach because the PGs from each pool
+		 * will overlap on top of each other: 0.5 == 1.4 ==
+		 * 2.3 == ...
+		 */
+		pps = ceph_stable_mod(pgid.seed, pool->pgp_num,
+				      pool->pgp_num_mask) +
+			(unsigned)pgid.pool;
+	}
+
+	len = pg_to_raw_osds(osdmap, pool, pgid, pps, osds);
+	if (len < 0)
+		return len;
+
+	len = raw_to_up_osds(osdmap, pool, osds, len, &primary);
+
+	len = apply_temps(osdmap, pool, pgid, osds, len, &primary);
+
+	return len;
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 29/33] libceph: return primary from ceph_calc_pg_acting()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (27 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:50   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 30/33] libceph: add support for primary_temp mappings Ilya Dryomov
                   ` (3 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly.  Primary is now specified separately
from the order of osds in the set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osdmap.h |    2 +-
 net/ceph/osd_client.c       |   10 ++++------
 net/ceph/osdmap.c           |   20 ++++++++++++--------
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index b0c8f8490663..561ea896c657 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -212,7 +212,7 @@ extern int ceph_oloc_oid_to_pg(struct ceph_osdmap *osdmap,
 
 extern int ceph_calc_pg_acting(struct ceph_osdmap *osdmap,
 			       struct ceph_pg pgid,
-			       int *osds);
+			       int *osds, int *primary);
 extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
 				struct ceph_pg pgid);
 
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 6f64eec18851..b4157dc22199 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1333,7 +1333,7 @@ static int __map_request(struct ceph_osd_client *osdc,
 {
 	struct ceph_pg pgid;
 	int acting[CEPH_PG_MAX_SIZE];
-	int o = -1, num = 0;
+	int num, o;
 	int err;
 	bool was_paused;
 
@@ -1346,11 +1346,9 @@ static int __map_request(struct ceph_osd_client *osdc,
 	}
 	req->r_pgid = pgid;
 
-	err = ceph_calc_pg_acting(osdc->osdmap, pgid, acting);
-	if (err > 0) {
-		o = acting[0];
-		num = err;
-	}
+	num = ceph_calc_pg_acting(osdc->osdmap, pgid, acting, &o);
+	if (num < 0)
+		num = 0;
 
 	was_paused = req->r_paused;
 	req->r_paused = __req_should_be_paused(osdc, req);
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 1963623bd488..7193b012ee02 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1644,19 +1644,21 @@ static int apply_temps(struct ceph_osdmap *osdmap,
 /*
  * Calculate acting set for given pgid.
  *
- * Return acting set length, or error.
+ * Return acting set length, or error.  *primary is set to acting
+ * primary osd id, or -1 if acting set is empty or on error.
  */
 int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
-			int *osds)
+			int *osds, int *primary)
 {
 	struct ceph_pg_pool_info *pool;
 	u32 pps;
 	int len;
-	int primary;
 
 	pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
-	if (!pool)
-		return 0;
+	if (!pool) {
+		*primary = -1;
+		return -ENOENT;
+	}
 
 	if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
 		/* hash pool id and seed so that pool PGs do not overlap */
@@ -1677,12 +1679,14 @@ int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
 	}
 
 	len = pg_to_raw_osds(osdmap, pool, pgid, pps, osds);
-	if (len < 0)
+	if (len < 0) {
+		*primary = -1;
 		return len;
+	}
 
-	len = raw_to_up_osds(osdmap, pool, osds, len, &primary);
+	len = raw_to_up_osds(osdmap, pool, osds, len, primary);
 
-	len = apply_temps(osdmap, pool, pgid, osds, len, &primary);
+	len = apply_temps(osdmap, pool, pgid, osds, len, primary);
 
 	return len;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 30/33] libceph: add support for primary_temp mappings
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (28 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 29/33] libceph: return primary from ceph_calc_pg_acting() Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:51   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 31/33] libceph: add support for osd primary affinity Ilya Dryomov
                   ` (2 subsequent siblings)
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Change apply_temp() to override primary in the same way pg_temp
overrides osd set.  primary_temp overrides pg_temp primary too.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 7193b012ee02..ed52b47d0ddb 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1590,7 +1590,7 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
 }
 
 /*
- * Given up set, apply pg_temp mapping.
+ * Given up set, apply pg_temp and primary_temp mappings.
  *
  * Return acting set length.  *primary is set to acting primary osd id,
  * or -1 if acting set is empty.
@@ -1637,6 +1637,11 @@ static int apply_temps(struct ceph_osdmap *osdmap,
 		temp_primary = *primary;
 	}
 
+	/* primary_temp? */
+	pg = __lookup_pg_mapping(&osdmap->primary_temp, pgid);
+	if (pg)
+		temp_primary = pg->primary_temp.osd;
+
 	*primary = temp_primary;
 	return temp_len;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 31/33] libceph: add support for osd primary affinity
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (29 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 30/33] libceph: add support for primary_temp mappings Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 20:59   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 32/33] libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() Ilya Dryomov
  2014-03-27 18:18 ` [PATCH 33/33] libceph: enable PRIMARY_AFFINITY feature bit Ilya Dryomov
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Respond to non-default primary_affinity values accordingly.  (Primary
affinity allows the admin to shift 'primary responsibility' away from
specific osds, effectively shifting around the read side of the
workload and whatever overhead is incurred by peering and writes by
virtue of being the primary).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index ed52b47d0ddb..8c596a13c60f 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1589,6 +1589,72 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
 	return len;
 }
 
+static void apply_primary_affinity(struct ceph_osdmap *osdmap, u32 pps,
+				   struct ceph_pg_pool_info *pool,
+				   int *osds, int len, int *primary)
+{
+	int i;
+	int pos = -1;
+
+	/*
+	 * Do we have any non-default primary_affinity values for these
+	 * osds?
+	 */
+	if (!osdmap->osd_primary_affinity)
+		return;
+
+	for (i = 0; i < len; i++) {
+		if (osds[i] != CRUSH_ITEM_NONE &&
+		    osdmap->osd_primary_affinity[i] !=
+					CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {
+			break;
+		}
+	}
+	if (i == len)
+		return;
+
+	/*
+	 * Pick the primary.  Feed both the seed (for the pg) and the
+	 * osd into the hash/rng so that a proportional fraction of an
+	 * osd's pgs get rejected as primary.
+	 */
+	for (i = 0; i < len; i++) {
+		int o;
+		u32 a;
+
+		o = osds[i];
+		if (o == CRUSH_ITEM_NONE)
+			continue;
+
+		a = osdmap->osd_primary_affinity[o];
+		if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&
+		    (crush_hash32_2(CRUSH_HASH_RJENKINS1,
+				    pps, o) >> 16) >= a) {
+			/*
+			 * We chose not to use this primary.  Note it
+			 * anyway as a fallback in case we don't pick
+			 * anyone else, but keep looking.
+			 */
+			if (pos < 0)
+				pos = i;
+		} else {
+			pos = i;
+			break;
+		}
+	}
+	if (pos < 0)
+		return;
+
+	*primary = osds[pos];
+
+	if (ceph_can_shift_osds(pool) && pos > 0) {
+		/* move the new primary to the front */
+		for (i = pos; i > 0; i--)
+			osds[i] = osds[i - 1];
+		osds[0] = *primary;
+	}
+}
+
 /*
  * Given up set, apply pg_temp and primary_temp mappings.
  *
@@ -1691,6 +1757,8 @@ int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
 
 	len = raw_to_up_osds(osdmap, pool, osds, len, primary);
 
+	apply_primary_affinity(osdmap, pps, pool, osds, len, primary);
+
 	len = apply_temps(osdmap, pool, pgid, osds, len, primary);
 
 	return len;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 32/33] libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (30 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 31/33] libceph: add support for osd primary affinity Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 21:04   ` Alex Elder
  2014-03-27 18:18 ` [PATCH 33/33] libceph: enable PRIMARY_AFFINITY feature bit Ilya Dryomov
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
and get rid of the now unused calc_pg_raw().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 net/ceph/osdmap.c |   79 +++--------------------------------------------------
 1 file changed, 4 insertions(+), 75 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 8c596a13c60f..f0567d8ca683 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1449,71 +1449,6 @@ static int do_crush(struct ceph_osdmap *map, int ruleno, int x,
 }
 
 /*
- * Calculate raw osd vector for the given pgid.  Return pointer to osd
- * array, or NULL on failure.
- */
-static int *calc_pg_raw(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
-			int *osds, int *num)
-{
-	struct ceph_pg_mapping *pg;
-	struct ceph_pg_pool_info *pool;
-	int ruleno;
-	int r;
-	u32 pps;
-
-	pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
-	if (!pool)
-		return NULL;
-
-	/* pg_temp? */
-	pgid.seed = ceph_stable_mod(pgid.seed, pool->pg_num,
-				    pool->pg_num_mask);
-	pg = __lookup_pg_mapping(&osdmap->pg_temp, pgid);
-	if (pg) {
-		*num = pg->pg_temp.len;
-		return pg->pg_temp.osds;
-	}
-
-	/* crush */
-	ruleno = crush_find_rule(osdmap->crush, pool->crush_ruleset,
-				 pool->type, pool->size);
-	if (ruleno < 0) {
-		pr_err("no crush rule pool %lld ruleset %d type %d size %d\n",
-		       pgid.pool, pool->crush_ruleset, pool->type,
-		       pool->size);
-		return NULL;
-	}
-
-	if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
-		/* hash pool id and seed sothat pool PGs do not overlap */
-		pps = crush_hash32_2(CRUSH_HASH_RJENKINS1,
-				     ceph_stable_mod(pgid.seed, pool->pgp_num,
-						     pool->pgp_num_mask),
-				     pgid.pool);
-	} else {
-		/*
-		 * legacy ehavior: add ps and pool together.  this is
-		 * not a great approach because the PGs from each pool
-		 * will overlap on top of each other: 0.5 == 1.4 ==
-		 * 2.3 == ...
-		 */
-		pps = ceph_stable_mod(pgid.seed, pool->pgp_num,
-				      pool->pgp_num_mask) +
-			(unsigned)pgid.pool;
-	}
-	r = do_crush(osdmap, ruleno, pps, osds, min_t(int, pool->size, *num),
-		     osdmap->osd_weight, osdmap->max_osd);
-	if (r < 0) {
-		pr_err("error %d from crush rule: pool %lld ruleset %d type %d"
-		       " size %d\n", r, pgid.pool, pool->crush_ruleset,
-		       pool->type, pool->size);
-		return NULL;
-	}
-	*num = r;
-	return osds;
-}
-
-/*
  * Calculate raw (crush) set for given pgid.
  *
  * Return raw set length, or error.
@@ -1769,17 +1704,11 @@ int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
  */
 int ceph_calc_pg_primary(struct ceph_osdmap *osdmap, struct ceph_pg pgid)
 {
-	int rawosds[CEPH_PG_MAX_SIZE], *osds;
-	int i, num = CEPH_PG_MAX_SIZE;
+	int osds[CEPH_PG_MAX_SIZE];
+	int primary;
 
-	osds = calc_pg_raw(osdmap, pgid, rawosds, &num);
-	if (!osds)
-		return -1;
+	ceph_calc_pg_acting(osdmap, pgid, osds, &primary);
 
-	/* primary is first up osd */
-	for (i = 0; i < num; i++)
-		if (ceph_osd_is_up(osdmap, osds[i]))
-			return osds[i];
-	return -1;
+	return primary;
 }
 EXPORT_SYMBOL(ceph_calc_pg_primary);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 33/33] libceph: enable PRIMARY_AFFINITY feature bit
  2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
                   ` (31 preceding siblings ...)
  2014-03-27 18:18 ` [PATCH 32/33] libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() Ilya Dryomov
@ 2014-03-27 18:18 ` Ilya Dryomov
  2014-03-27 21:04   ` Alex Elder
  32 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-27 18:18 UTC (permalink / raw)
  To: ceph-devel

Announce our support for osdmaps with non-default primary affinity
values.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/ceph_features.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index 7a4cab50b2cd..d12659ce550d 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -91,7 +91,8 @@ static inline u64 ceph_sanitize_features(u64 features)
 	 CEPH_FEATURE_CRUSH_V2 |		\
 	 CEPH_FEATURE_EXPORT_PEER |		\
 	 CEPH_FEATURE_OSDMAP_ENC |		\
-	 CEPH_FEATURE_CRUSH_TUNABLES3)
+	 CEPH_FEATURE_CRUSH_TUNABLES3 |		\
+	 CEPH_FEATURE_OSD_PRIMARY_AFFINITY)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT   \
 	(CEPH_FEATURE_NOSRCADDR |	 \
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show()
  2014-03-27 18:17 ` [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show() Ilya Dryomov
@ 2014-03-27 19:09   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:09 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> To make it more readable and save screen space.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/debugfs.c |   26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index 258a382e75ed..d225842c7b41 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -53,34 +53,36 @@ static int osdmap_show(struct seq_file *s, void *p)
>  {
>  	int i;
>  	struct ceph_client *client = s->private;
> +	struct ceph_osdmap *map = client->osdc.osdmap;
>  	struct rb_node *n;
>  
> -	if (client->osdc.osdmap == NULL)
> +	if (map == NULL)
>  		return 0;
> -	seq_printf(s, "epoch %d\n", client->osdc.osdmap->epoch);
> +
> +	seq_printf(s, "epoch %d\n", map->epoch);
>  	seq_printf(s, "flags%s%s\n",
> -		   (client->osdc.osdmap->flags & CEPH_OSDMAP_NEARFULL) ?
> -		   " NEARFULL" : "",
> -		   (client->osdc.osdmap->flags & CEPH_OSDMAP_FULL) ?
> -		   " FULL" : "");
> -	for (n = rb_first(&client->osdc.osdmap->pg_pools); n; n = rb_next(n)) {
> +		   (map->flags & CEPH_OSDMAP_NEARFULL) ?  " NEARFULL" : "",
> +		   (map->flags & CEPH_OSDMAP_FULL) ?  " FULL" : "");
> +
> +	for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
>  		struct ceph_pg_pool_info *pool =
>  			rb_entry(n, struct ceph_pg_pool_info, node);
> +
>  		seq_printf(s, "pg_pool %llu pg_num %d / %d\n",
>  			   (unsigned long long)pool->id, pool->pg_num,
>  			   pool->pg_num_mask);
>  	}
> -	for (i = 0; i < client->osdc.osdmap->max_osd; i++) {
> -		struct ceph_entity_addr *addr =
> -			&client->osdc.osdmap->osd_addr[i];
> -		int state = client->osdc.osdmap->osd_state[i];
> +	for (i = 0; i < map->max_osd; i++) {
> +		struct ceph_entity_addr *addr = &map->osd_addr[i];
> +		int state = map->osd_state[i];
>  		char sb[64];
>  
>  		seq_printf(s, "\tosd%d\t%s\t%3d%%\t(%s)\n",
>  			   i, ceph_pr_addr(&addr->in_addr),
> -			   ((client->osdc.osdmap->osd_weight[i]*100) >> 16),
> +			   ((map->osd_weight[i]*100) >> 16),
>  			   ceph_osdmap_state_str(sb, sizeof(sb), state));
>  	}
> +
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/33] libceph: do not prefix osd lines with \t in debugfs output
  2014-03-27 18:17 ` [PATCH 02/33] libceph: do not prefix osd lines with \t in debugfs output Ilya Dryomov
@ 2014-03-27 19:10   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:10 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> To save screen space in anticipation of more fields (e.g. primary
> affinity).
> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

Looks good.

If there are lots of these little trivial transformations
they could probably be consolidated into a smaller set
of patches.

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  net/ceph/debugfs.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index d225842c7b41..112d98edb156 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -77,7 +77,7 @@ static int osdmap_show(struct seq_file *s, void *p)
>  		int state = map->osd_state[i];
>  		char sb[64];
>  
> -		seq_printf(s, "\tosd%d\t%s\t%3d%%\t(%s)\n",
> +		seq_printf(s, "osd%d\t%s\t%3d%%\t(%s)\n",
>  			   i, ceph_pr_addr(&addr->in_addr),
>  			   ((map->osd_weight[i]*100) >> 16),
>  			   ceph_osdmap_state_str(sb, sizeof(sb), state));
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 03/33] libceph: dump pg_temp mappings to debugfs
  2014-03-27 18:17 ` [PATCH 03/33] libceph: dump pg_temp mappings to debugfs Ilya Dryomov
@ 2014-03-27 19:11   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:11 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
> one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g:
> 
>     pg_temp 2.6 [2,3,4]
> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

I didn't look at the broader context, but the new code looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  net/ceph/debugfs.c |   11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index 112d98edb156..c45d235e774e 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -82,6 +82,17 @@ static int osdmap_show(struct seq_file *s, void *p)
>  			   ((map->osd_weight[i]*100) >> 16),
>  			   ceph_osdmap_state_str(sb, sizeof(sb), state));
>  	}
> +	for (n = rb_first(&map->pg_temp); n; n = rb_next(n)) {
> +		struct ceph_pg_mapping *pg =
> +			rb_entry(n, struct ceph_pg_mapping, node);
> +
> +		seq_printf(s, "pg_temp %llu.%x [", pg->pgid.pool,
> +			   pg->pgid.seed);
> +		for (i = 0; i < pg->len; i++)
> +			seq_printf(s, "%s%d", (i == 0 ? "" : ","),
> +				   pg->osds[i]);
> +		seq_printf(s, "]\n");
> +	}
>  
>  	return 0;
>  }
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/33] libceph: dump osdmap and enhance output on decode errors
  2014-03-27 18:17 ` [PATCH 04/33] libceph: dump osdmap and enhance output on decode errors Ilya Dryomov
@ 2014-03-27 19:15   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:15 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> Dump osdmap in hex on both full and incremental decode errors, to make
> it easier to match the contents with error offset.  dout() map epoch
> and max_osd value on success.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 9d1aaa24def6..4dd000d128fd 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -690,6 +690,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
>  	u16 version;
>  	u32 len, max, i;
>  	int err = -EINVAL;
> +	u32 epoch = 0;
>  	void *start = *p;
>  	struct ceph_pg_pool_info *pi;
>  
> @@ -714,7 +715,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
>  
>  	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), bad);
>  	ceph_decode_copy(p, &map->fsid, sizeof(map->fsid));
> -	map->epoch = ceph_decode_32(p);
> +	epoch = map->epoch = ceph_decode_32(p);
>  	ceph_decode_copy(p, &map->created, sizeof(map->created));
>  	ceph_decode_copy(p, &map->modified, sizeof(map->modified));
>  
> @@ -814,14 +815,18 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
>  		goto bad;
>  	}
>  
> -	/* ignore the rest of the map */
> +	/* ignore the rest */
>  	*p = end;
>  
> -	dout("osdmap_decode done %p %p\n", *p, end);
> +	dout("full osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
>  	return map;
>  
>  bad:
> -	dout("osdmap_decode fail err %d\n", err);
> +	pr_err("corrupt full osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
> +	       err, epoch, (int)(*p - start), *p, start, end);
> +	print_hex_dump(KERN_DEBUG, "osdmap: ",
> +		       DUMP_PREFIX_OFFSET, 16, 1,
> +		       start, end - start, true);
>  	ceph_osdmap_destroy(map);
>  	return ERR_PTR(err);
>  }
> @@ -845,6 +850,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	int err = -EINVAL;
>  	u16 version;
>  
> +	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
> +
>  	ceph_decode_16_safe(p, end, version, bad);
>  	if (version != 6) {
>  		pr_warning("got unknown v %d != 6 of inc osdmap\n", version);
> @@ -1032,11 +1039,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  
>  	/* ignore the rest */
>  	*p = end;
> +
> +	dout("inc osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
>  	return map;
>  
>  bad:
> -	pr_err("corrupt inc osdmap epoch %d off %d (%p of %p-%p)\n",
> -	       epoch, (int)(*p - start), *p, start, end);
> +	pr_err("corrupt inc osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
> +	       err, epoch, (int)(*p - start), *p, start, end);
>  	print_hex_dump(KERN_DEBUG, "osdmap: ",
>  		       DUMP_PREFIX_OFFSET, 16, 1,
>  		       start, end - start, true);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/33] libceph: split osdmap allocation and decode steps
  2014-03-27 18:17 ` [PATCH 05/33] libceph: split osdmap allocation and decode steps Ilya Dryomov
@ 2014-03-27 19:18   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:18 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> Split osdmap allocation and initialization into a separate function,
> ceph_osdmap_decode().

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |    2 +-
>  net/ceph/osd_client.c       |    2 +-
>  net/ceph/osdmap.c           |   44 ++++++++++++++++++++++++++++---------------
>  3 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index 8c8b3cefc28b..46c3e304c3d8 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -156,7 +156,7 @@ static inline int ceph_decode_pgid(void **p, void *end, struct ceph_pg *pgid)
>  	return 0;
>  }
>  
> -extern struct ceph_osdmap *osdmap_decode(void **p, void *end);
> +extern struct ceph_osdmap *ceph_osdmap_decode(void **p, void *end);
>  extern struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  					    struct ceph_osdmap *map,
>  					    struct ceph_messenger *msgr);
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 71830d79b0f4..6f64eec18851 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -2062,7 +2062,7 @@ void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>  			int skipped_map = 0;
>  
>  			dout("taking full map %u len %d\n", epoch, maplen);
> -			newmap = osdmap_decode(&p, p+maplen);
> +			newmap = ceph_osdmap_decode(&p, p+maplen);
>  			if (IS_ERR(newmap)) {
>  				err = PTR_ERR(newmap);
>  				goto bad;
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 4dd000d128fd..a82df6ea0749 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -684,9 +684,8 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>  /*
>   * decode a full map.
>   */
> -struct ceph_osdmap *osdmap_decode(void **p, void *end)
> +static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  {
> -	struct ceph_osdmap *map;
>  	u16 version;
>  	u32 len, max, i;
>  	int err = -EINVAL;
> @@ -694,14 +693,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
>  	void *start = *p;
>  	struct ceph_pg_pool_info *pi;
>  
> -	dout("osdmap_decode %p to %p len %d\n", *p, end, (int)(end - *p));
> -
> -	map = kzalloc(sizeof(*map), GFP_NOFS);
> -	if (map == NULL)
> -		return ERR_PTR(-ENOMEM);
> -
> -	map->pg_temp = RB_ROOT;
> -	mutex_init(&map->crush_scratch_mutex);
> +	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
>  
>  	ceph_decode_16_safe(p, end, version, bad);
>  	if (version > 6) {
> @@ -751,7 +743,6 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
>  	err = osdmap_set_max_osd(map, max);
>  	if (err < 0)
>  		goto bad;
> -	dout("osdmap_decode max_osd = %d\n", map->max_osd);
>  
>  	/* osds */
>  	err = -EINVAL;
> @@ -819,7 +810,7 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
>  	*p = end;
>  
>  	dout("full osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
> -	return map;
> +	return 0;
>  
>  bad:
>  	pr_err("corrupt full osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
> @@ -827,8 +818,31 @@ bad:
>  	print_hex_dump(KERN_DEBUG, "osdmap: ",
>  		       DUMP_PREFIX_OFFSET, 16, 1,
>  		       start, end - start, true);
> -	ceph_osdmap_destroy(map);
> -	return ERR_PTR(err);
> +	return err;
> +}
> +
> +/*
> + * Allocate and decode a full map.
> + */
> +struct ceph_osdmap *ceph_osdmap_decode(void **p, void *end)
> +{
> +	struct ceph_osdmap *map;
> +	int ret;
> +
> +	map = kzalloc(sizeof(*map), GFP_NOFS);
> +	if (!map)
> +		return ERR_PTR(-ENOMEM);
> +
> +	map->pg_temp = RB_ROOT;
> +	mutex_init(&map->crush_scratch_mutex);
> +
> +	ret = osdmap_decode(p, end, map);
> +	if (ret) {
> +		ceph_osdmap_destroy(map);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return map;
>  }
>  
>  /*
> @@ -872,7 +886,7 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	if (len > 0) {
>  		dout("apply_incremental full map len %d, %p to %p\n",
>  		     len, *p, end);
> -		return osdmap_decode(p, min(*p+len, end));
> +		return ceph_osdmap_decode(p, min(*p+len, end));
>  	}
>  
>  	/* new crush? */
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/33] libceph: fixup error handling in osdmap_decode()
  2014-03-27 18:17 ` [PATCH 06/33] libceph: fixup error handling in osdmap_decode() Ilya Dryomov
@ 2014-03-27 19:25   ` Alex Elder
  2014-03-28 14:56     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:25 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> The existing error handling scheme requires resetting err to -EINVAL
> prior to calling any ceph_decode_* macro.  This is ugly and fragile,
> and there already are a few places where we would return 0 on error,
> due to a missing reset.  Fix this by adding a special e_inval label to
> be used by all ceph_decode_* macros.

I don't see where it's returning 0 on error, but I think this
is a good change anyway.

I'd use "einval" or "err_inval" instead of "e_inval".  But
no matter.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>


> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   53 +++++++++++++++++++++++++++--------------------------
>  1 file changed, 27 insertions(+), 26 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index a82df6ea0749..298d076eee89 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -688,36 +688,37 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  {
>  	u16 version;
>  	u32 len, max, i;
> -	int err = -EINVAL;
>  	u32 epoch = 0;
>  	void *start = *p;
> +	int err;
>  	struct ceph_pg_pool_info *pi;
>  
>  	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
>  
> -	ceph_decode_16_safe(p, end, version, bad);
> +	ceph_decode_16_safe(p, end, version, e_inval);
>  	if (version > 6) {
>  		pr_warning("got unknown v %d > 6 of osdmap\n", version);
> -		goto bad;
> +		goto e_inval;
>  	}
>  	if (version < 6) {
>  		pr_warning("got old v %d < 6 of osdmap\n", version);
> -		goto bad;
> +		goto e_inval;
>  	}
>  
> -	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), bad);
> +	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), e_inval);
>  	ceph_decode_copy(p, &map->fsid, sizeof(map->fsid));
>  	epoch = map->epoch = ceph_decode_32(p);
>  	ceph_decode_copy(p, &map->created, sizeof(map->created));
>  	ceph_decode_copy(p, &map->modified, sizeof(map->modified));
>  
> -	ceph_decode_32_safe(p, end, max, bad);
> +	ceph_decode_32_safe(p, end, max, e_inval);
>  	while (max--) {
> -		ceph_decode_need(p, end, 8 + 2, bad);
> -		err = -ENOMEM;
> +		ceph_decode_need(p, end, 8 + 2, e_inval);
>  		pi = kzalloc(sizeof(*pi), GFP_NOFS);
> -		if (!pi)
> +		if (!pi) {
> +			err = -ENOMEM;
>  			goto bad;
> +		}
>  		pi->id = ceph_decode_64(p);
>  		err = __decode_pool(p, end, pi);
>  		if (err < 0) {
> @@ -728,27 +729,25 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	}
>  
>  	err = __decode_pool_names(p, end, map);
> -	if (err < 0) {
> -		dout("fail to decode pool names");
> +	if (err)
>  		goto bad;
> -	}
>  
> -	ceph_decode_32_safe(p, end, map->pool_max, bad);
> +	ceph_decode_32_safe(p, end, map->pool_max, e_inval);
>  
> -	ceph_decode_32_safe(p, end, map->flags, bad);
> +	ceph_decode_32_safe(p, end, map->flags, e_inval);
>  
>  	max = ceph_decode_32(p);
>  
>  	/* (re)alloc osd arrays */
>  	err = osdmap_set_max_osd(map, max);
> -	if (err < 0)
> +	if (err)
>  		goto bad;
>  
>  	/* osds */
> -	err = -EINVAL;
>  	ceph_decode_need(p, end, 3*sizeof(u32) +
>  			 map->max_osd*(1 + sizeof(*map->osd_weight) +
> -				       sizeof(*map->osd_addr)), bad);
> +				       sizeof(*map->osd_addr)), e_inval);
> +
>  	*p += 4; /* skip length field (should match max) */
>  	ceph_decode_copy(p, map->osd_state, map->max_osd);
>  
> @@ -762,7 +761,7 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  		ceph_decode_addr(&map->osd_addr[i]);
>  
>  	/* pg_temp */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	for (i = 0; i < len; i++) {
>  		int n, j;
>  		struct ceph_pg pgid;
> @@ -771,16 +770,16 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  		err = ceph_decode_pgid(p, end, &pgid);
>  		if (err)
>  			goto bad;
> -		ceph_decode_need(p, end, sizeof(u32), bad);
> +		ceph_decode_need(p, end, sizeof(u32), e_inval);
>  		n = ceph_decode_32(p);
> -		err = -EINVAL;
>  		if (n > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
> -			goto bad;
> -		ceph_decode_need(p, end, n * sizeof(u32), bad);
> -		err = -ENOMEM;
> +			goto e_inval;
> +		ceph_decode_need(p, end, n * sizeof(u32), e_inval);
>  		pg = kmalloc(sizeof(*pg) + n*sizeof(u32), GFP_NOFS);
> -		if (!pg)
> +		if (!pg) {
> +			err = -ENOMEM;
>  			goto bad;
> +		}
>  		pg->pgid = pgid;
>  		pg->len = n;
>  		for (j = 0; j < n; j++)
> @@ -794,10 +793,10 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	}
>  
>  	/* crush */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	dout("osdmap_decode crush len %d from off 0x%x\n", len,
>  	     (int)(*p - start));
> -	ceph_decode_need(p, end, len, bad);
> +	ceph_decode_need(p, end, len, e_inval);
>  	map->crush = crush_decode(*p, end);
>  	*p += len;
>  	if (IS_ERR(map->crush)) {
> @@ -812,6 +811,8 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	dout("full osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
>  	return 0;
>  
> +e_inval:
> +	err = -EINVAL;
>  bad:
>  	pr_err("corrupt full osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
>  	       err, epoch, (int)(*p - start), *p, start, end);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 07/33] libceph: safely decode max_osd value in osdmap_decode()
  2014-03-27 18:17 ` [PATCH 07/33] libceph: safely decode max_osd value " Ilya Dryomov
@ 2014-03-27 19:27   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:27 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> max_osd value is not covered by any ceph_decode_need().  Use a safe
> version of ceph_decode_* macro to decode it.

I know it's slightly more efficient, but I never liked those
ceph_decode_need() statements that added together a bunch
of things you're about to go decode...

Anyway, this is the right thing to do.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 298d076eee89..ec06010657b3 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -687,9 +687,10 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>  static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  {
>  	u16 version;
> -	u32 len, max, i;
>  	u32 epoch = 0;
>  	void *start = *p;
> +	u32 max;
> +	u32 len, i;
>  	int err;
>  	struct ceph_pg_pool_info *pi;
>  
> @@ -736,7 +737,8 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  
>  	ceph_decode_32_safe(p, end, map->flags, e_inval);
>  
> -	max = ceph_decode_32(p);
> +	/* max_osd */
> +	ceph_decode_32_safe(p, end, max, e_inval);
>  
>  	/* (re)alloc osd arrays */
>  	err = osdmap_set_max_osd(map, max);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 08/33] libceph: assert length of osdmap osd arrays
  2014-03-27 18:17 ` [PATCH 08/33] libceph: assert length of osdmap osd arrays Ilya Dryomov
@ 2014-03-27 19:30   ` Alex Elder
  2014-03-28 14:57     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:30 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> Assert length of osd_state, osd_weight and osd_addr arrays.  They
> should all have exactly max_osd elements after the call to
> osdmap_set_max_osd().

Since this function is allowed to fail, could these
conditions lead to returning an error code rather than
killing the machine?

Your testing incoming data (which you can't necessarily
trust), not a fundamental assumption of the code, so
a BUG() seems harsh.

Checking is absolutely the right thing to do.

Switch it to return an error if you can.  If you feel
BUG() is right, so be it.  Either way:

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index ec06010657b3..19aca4d3c5dd 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -745,19 +745,19 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	if (err)
>  		goto bad;
>  
> -	/* osds */
> +	/* osd_state, osd_weight, osd_addrs->client_addr */
>  	ceph_decode_need(p, end, 3*sizeof(u32) +
>  			 map->max_osd*(1 + sizeof(*map->osd_weight) +
>  				       sizeof(*map->osd_addr)), e_inval);
>  
> -	*p += 4; /* skip length field (should match max) */
> +	BUG_ON(ceph_decode_32(p) != map->max_osd);
>  	ceph_decode_copy(p, map->osd_state, map->max_osd);
>  
> -	*p += 4; /* skip length field (should match max) */
> +	BUG_ON(ceph_decode_32(p) != map->max_osd);
>  	for (i = 0; i < map->max_osd; i++)
>  		map->osd_weight[i] = ceph_decode_32(p);
>  
> -	*p += 4; /* skip length field (should match max) */
> +	BUG_ON(ceph_decode_32(p) != map->max_osd);
>  	ceph_decode_copy(p, map->osd_addr, map->max_osd*sizeof(*map->osd_addr));
>  	for (i = 0; i < map->max_osd; i++)
>  		ceph_decode_addr(&map->osd_addr[i]);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode()
  2014-03-27 18:17 ` [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode() Ilya Dryomov
@ 2014-03-27 19:45   ` Alex Elder
  2014-03-28 14:57     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:45 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> The size of the memory area feeded to crush_decode() should be limited
> not only by osdmap end, but also by the crush map length.  Also, drop

You're also letting crush_decode() verify it has the buffer space
it needs internally, rather than checking it before making the call,
which is good.  (Though I guess you don't have to mention it.)

> unnecessary dout() (dout() in crush_decode() conveys the same info) and
> step past crush map only if it is decoded successfully.

I actually think crush_decode() should take a (void **)
instead, as its first argument and advance the pointer
by as much as it uses (like most of the other routines do).
That's a suggestion, but I don't really care, this is fine.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 19aca4d3c5dd..b70357adbdc0 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -796,16 +796,13 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  
>  	/* crush */
>  	ceph_decode_32_safe(p, end, len, e_inval);
> -	dout("osdmap_decode crush len %d from off 0x%x\n", len,
> -	     (int)(*p - start));
> -	ceph_decode_need(p, end, len, e_inval);
> -	map->crush = crush_decode(*p, end);
> -	*p += len;
> +	map->crush = crush_decode(*p, min(*p + len, end));
>  	if (IS_ERR(map->crush)) {
>  		err = PTR_ERR(map->crush);
>  		map->crush = NULL;
>  		goto bad;
>  	}
> +	*p += len;
>  
>  	/* ignore the rest */
>  	*p = end;
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental()
  2014-03-27 18:17 ` [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental() Ilya Dryomov
@ 2014-03-27 19:49   ` Alex Elder
  2014-03-28 14:58     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:49 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> The existing error handling scheme requires resetting err to -EINVAL
> prior to calling any ceph_decode_* macro.  This is ugly and fragile,
> and there already are a few places where we would return 0 on error,
> due to a missing reset.  Follow osdmap_decode() and fix this by adding
> a special e_inval label to be used by all ceph_decode_* macros.

Same comments as last time.  Otherwise, looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   66 +++++++++++++++++++++++++++--------------------------
>  1 file changed, 34 insertions(+), 32 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index b70357adbdc0..0fc29a930c06 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -861,19 +861,19 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	__s64 new_pool_max;
>  	__s32 new_flags, max;
>  	void *start = *p;
> -	int err = -EINVAL;
> +	int err;
>  	u16 version;
>  
>  	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
>  
> -	ceph_decode_16_safe(p, end, version, bad);
> +	ceph_decode_16_safe(p, end, version, e_inval);
>  	if (version != 6) {
>  		pr_warning("got unknown v %d != 6 of inc osdmap\n", version);
> -		goto bad;
> +		goto e_inval;
>  	}
>  
>  	ceph_decode_need(p, end, sizeof(fsid)+sizeof(modified)+2*sizeof(u32),
> -			 bad);
> +			 e_inval);
>  	ceph_decode_copy(p, &fsid, sizeof(fsid));
>  	epoch = ceph_decode_32(p);
>  	BUG_ON(epoch != map->epoch+1);
> @@ -882,7 +882,7 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	new_flags = ceph_decode_32(p);
>  
>  	/* full map? */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	if (len > 0) {
>  		dout("apply_incremental full map len %d, %p to %p\n",
>  		     len, *p, end);
> @@ -890,13 +890,14 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  
>  	/* new crush? */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	if (len > 0) {
> -		dout("apply_incremental new crush map len %d, %p to %p\n",
> -		     len, *p, end);
>  		newcrush = crush_decode(*p, min(*p+len, end));
> -		if (IS_ERR(newcrush))
> -			return ERR_CAST(newcrush);
> +		if (IS_ERR(newcrush)) {
> +			err = PTR_ERR(newcrush);
> +			newcrush = NULL;
> +			goto bad;
> +		}
>  		*p += len;
>  	}
>  
> @@ -906,13 +907,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	if (new_pool_max >= 0)
>  		map->pool_max = new_pool_max;
>  
> -	ceph_decode_need(p, end, 5*sizeof(u32), bad);
> +	ceph_decode_need(p, end, 5*sizeof(u32), e_inval);
>  
>  	/* new max? */
>  	max = ceph_decode_32(p);
>  	if (max >= 0) {
>  		err = osdmap_set_max_osd(map, max);
> -		if (err < 0)
> +		if (err)
>  			goto bad;
>  	}
>  
> @@ -926,11 +927,11 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  
>  	/* new_pool */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	while (len--) {
>  		struct ceph_pg_pool_info *pi;
>  
> -		ceph_decode_64_safe(p, end, pool, bad);
> +		ceph_decode_64_safe(p, end, pool, e_inval);
>  		pi = __lookup_pg_pool(&map->pg_pools, pool);
>  		if (!pi) {
>  			pi = kzalloc(sizeof(*pi), GFP_NOFS);
> @@ -947,29 +948,28 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  	if (version >= 5) {
>  		err = __decode_pool_names(p, end, map);
> -		if (err < 0)
> +		if (err)
>  			goto bad;
>  	}
>  
>  	/* old_pool */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	while (len--) {
>  		struct ceph_pg_pool_info *pi;
>  
> -		ceph_decode_64_safe(p, end, pool, bad);
> +		ceph_decode_64_safe(p, end, pool, e_inval);
>  		pi = __lookup_pg_pool(&map->pg_pools, pool);
>  		if (pi)
>  			__remove_pg_pool(&map->pg_pools, pi);
>  	}
>  
>  	/* new_up */
> -	err = -EINVAL;
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	while (len--) {
>  		u32 osd;
>  		struct ceph_entity_addr addr;
> -		ceph_decode_32_safe(p, end, osd, bad);
> -		ceph_decode_copy_safe(p, end, &addr, sizeof(addr), bad);
> +		ceph_decode_32_safe(p, end, osd, e_inval);
> +		ceph_decode_copy_safe(p, end, &addr, sizeof(addr), e_inval);
>  		ceph_decode_addr(&addr);
>  		pr_info("osd%d up\n", osd);
>  		BUG_ON(osd >= map->max_osd);
> @@ -978,11 +978,11 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  
>  	/* new_state */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	while (len--) {
>  		u32 osd;
>  		u8 xorstate;
> -		ceph_decode_32_safe(p, end, osd, bad);
> +		ceph_decode_32_safe(p, end, osd, e_inval);
>  		xorstate = **(u8 **)p;
>  		(*p)++;  /* clean flag */
>  		if (xorstate == 0)
> @@ -994,10 +994,10 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  
>  	/* new_weight */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	while (len--) {
>  		u32 osd, off;
> -		ceph_decode_need(p, end, sizeof(u32)*2, bad);
> +		ceph_decode_need(p, end, sizeof(u32)*2, e_inval);
>  		osd = ceph_decode_32(p);
>  		off = ceph_decode_32(p);
>  		pr_info("osd%d weight 0x%x %s\n", osd, off,
> @@ -1008,7 +1008,7 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  
>  	/* new_pg_temp */
> -	ceph_decode_32_safe(p, end, len, bad);
> +	ceph_decode_32_safe(p, end, len, e_inval);
>  	while (len--) {
>  		struct ceph_pg_mapping *pg;
>  		int j;
> @@ -1018,22 +1018,22 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  		err = ceph_decode_pgid(p, end, &pgid);
>  		if (err)
>  			goto bad;
> -		ceph_decode_need(p, end, sizeof(u32), bad);
> +		ceph_decode_need(p, end, sizeof(u32), e_inval);
>  		pglen = ceph_decode_32(p);
>  		if (pglen) {
> -			ceph_decode_need(p, end, pglen*sizeof(u32), bad);
> +			ceph_decode_need(p, end, pglen*sizeof(u32), e_inval);
>  
>  			/* removing existing (if any) */
>  			(void) __remove_pg_mapping(&map->pg_temp, pgid);
>  
>  			/* insert */
> -			err = -EINVAL;
>  			if (pglen > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
> -				goto bad;
> -			err = -ENOMEM;
> +				goto e_inval;
>  			pg = kmalloc(sizeof(*pg) + sizeof(u32)*pglen, GFP_NOFS);
> -			if (!pg)
> +			if (!pg) {
> +				err = -ENOMEM;
>  				goto bad;
> +			}
>  			pg->pgid = pgid;
>  			pg->len = pglen;
>  			for (j = 0; j < pglen; j++)
> @@ -1057,6 +1057,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	dout("inc osdmap epoch %d max_osd %d\n", map->epoch, map->max_osd);
>  	return map;
>  
> +e_inval:
> +	err = -EINVAL;
>  bad:
>  	pr_err("corrupt inc osdmap (%d) epoch %d off %d (%p of %p-%p)\n",
>  	       err, epoch, (int)(*p - start), *p, start, end);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/33] libceph: nuke bogus encoding version check in osdmap_apply_incremental()
  2014-03-27 18:17 ` [PATCH 11/33] libceph: nuke bogus encoding version check " Ilya Dryomov
@ 2014-03-27 19:50   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:50 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> Only version 6 of osdmap encoding is supported, anything other than
> version 6 results in an error and halts the decoding process.  Checking
> if version is >= 5 is therefore bogus.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |    9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 0fc29a930c06..75e192e99173 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -946,11 +946,10 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  		if (err < 0)
>  			goto bad;
>  	}
> -	if (version >= 5) {
> -		err = __decode_pool_names(p, end, map);
> -		if (err)
> -			goto bad;
> -	}
> +
> +	err = __decode_pool_names(p, end, map);
> +	if (err)
> +		goto bad;
>  
>  	/* old_pool */
>  	ceph_decode_32_safe(p, end, len, e_inval);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 12/33] libceph: fix and clarify ceph_decode_need() sizes
  2014-03-27 18:17 ` [PATCH 12/33] libceph: fix and clarify ceph_decode_need() sizes Ilya Dryomov
@ 2014-03-27 19:53   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:53 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> Sum up sizeof(...) results instead of (incorrectly) hard-coding the
> number of bytes, expressed in ints and longs.

Yay!!!

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 75e192e99173..6dd083906a1e 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -706,7 +706,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  		goto e_inval;
>  	}
>  
> -	ceph_decode_need(p, end, 2*sizeof(u64)+6*sizeof(u32), e_inval);
> +	/* fsid, epoch, created, modified */
> +	ceph_decode_need(p, end, sizeof(map->fsid) + sizeof(u32) +
> +			 sizeof(map->created) + sizeof(map->modified), e_inval);
>  	ceph_decode_copy(p, &map->fsid, sizeof(map->fsid));
>  	epoch = map->epoch = ceph_decode_32(p);
>  	ceph_decode_copy(p, &map->created, sizeof(map->created));
> @@ -872,8 +874,9 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  		goto e_inval;
>  	}
>  
> -	ceph_decode_need(p, end, sizeof(fsid)+sizeof(modified)+2*sizeof(u32),
> -			 e_inval);
> +	/* fsid, epoch, modified, new_pool_max, new_flags */
> +	ceph_decode_need(p, end, sizeof(fsid) + sizeof(u32) + sizeof(modified) +
> +			 sizeof(u64) + sizeof(u32), e_inval);
>  	ceph_decode_copy(p, &fsid, sizeof(fsid));
>  	epoch = ceph_decode_32(p);
>  	BUG_ON(epoch != map->epoch+1);
> @@ -907,10 +910,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	if (new_pool_max >= 0)
>  		map->pool_max = new_pool_max;
>  
> -	ceph_decode_need(p, end, 5*sizeof(u32), e_inval);
> -
>  	/* new max? */
> -	max = ceph_decode_32(p);
> +	ceph_decode_32_safe(p, end, max, e_inval);
>  	if (max >= 0) {
>  		err = osdmap_set_max_osd(map, max);
>  		if (err)
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 13/33] libceph: rename __decode_pool{,_names}() to decode_pool{,_names}()
  2014-03-27 18:17 ` [PATCH 13/33] libceph: rename __decode_pool{,_names}() to decode_pool{,_names}() Ilya Dryomov
@ 2014-03-27 19:54   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:54 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
> To be in line with all the other osdmap decode helpers.

I wouldn't object to folding this into another patch, it
doesn't change anything functionally.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>


> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 6dd083906a1e..cd8f34abe7b7 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -506,7 +506,7 @@ static void __remove_pg_pool(struct rb_root *root, struct ceph_pg_pool_info *pi)
>  	kfree(pi);
>  }
>  
> -static int __decode_pool(void **p, void *end, struct ceph_pg_pool_info *pi)
> +static int decode_pool(void **p, void *end, struct ceph_pg_pool_info *pi)
>  {
>  	u8 ev, cv;
>  	unsigned len, num;
> @@ -587,7 +587,7 @@ bad:
>  	return -EINVAL;
>  }
>  
> -static int __decode_pool_names(void **p, void *end, struct ceph_osdmap *map)
> +static int decode_pool_names(void **p, void *end, struct ceph_osdmap *map)
>  {
>  	struct ceph_pg_pool_info *pi;
>  	u32 num, len;
> @@ -723,7 +723,7 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  			goto bad;
>  		}
>  		pi->id = ceph_decode_64(p);
> -		err = __decode_pool(p, end, pi);
> +		err = decode_pool(p, end, pi);
>  		if (err < 0) {
>  			kfree(pi);
>  			goto bad;
> @@ -731,7 +731,8 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  		__insert_pg_pool(&map->pg_pools, pi);
>  	}
>  
> -	err = __decode_pool_names(p, end, map);
> +	/* pool_name */
> +	err = decode_pool_names(p, end, map);
>  	if (err)
>  		goto bad;
>  
> @@ -943,12 +944,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  			pi->id = pool;
>  			__insert_pg_pool(&map->pg_pools, pi);
>  		}
> -		err = __decode_pool(p, end, pi);
> +		err = decode_pool(p, end, pi);
>  		if (err < 0)
>  			goto bad;
>  	}
>  
> -	err = __decode_pool_names(p, end, map);
> +	/* new_pool_names */
> +	err = decode_pool_names(p, end, map);
>  	if (err)
>  		goto bad;
>  
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 14/33] libceph: introduce decode{,_new}_pools() and switch to them
  2014-03-27 18:18 ` [PATCH 14/33] libceph: introduce decode{,_new}_pools() and switch to them Ilya Dryomov
@ 2014-03-27 19:56   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:56 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc
> map, same) decoding logic into a common helper and switch to it.

Nice refactoring.

Reviewed-by: Alex Elder <elder@linaro.org>


> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   94 ++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 57 insertions(+), 37 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index cd8f34abe7b7..d6a569c5508f 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -681,6 +681,55 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>  	return 0;
>  }
>  
> +static int __decode_pools(void **p, void *end, struct ceph_osdmap *map,
> +			  bool incremental)
> +{
> +	u32 n;
> +
> +	ceph_decode_32_safe(p, end, n, e_inval);
> +	while (n--) {
> +		struct ceph_pg_pool_info *pi;
> +		u64 pool;
> +		int ret;
> +
> +		ceph_decode_64_safe(p, end, pool, e_inval);
> +
> +		pi = __lookup_pg_pool(&map->pg_pools, pool);
> +		if (!incremental || !pi) {
> +			pi = kzalloc(sizeof(*pi), GFP_NOFS);
> +			if (!pi)
> +				return -ENOMEM;
> +
> +			pi->id = pool;
> +
> +			ret = __insert_pg_pool(&map->pg_pools, pi);
> +			if (ret) {
> +				kfree(pi);
> +				return ret;
> +			}
> +		}
> +
> +		ret = decode_pool(p, end, pi);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +
> +e_inval:
> +	return -EINVAL;
> +}
> +
> +static int decode_pools(void **p, void *end, struct ceph_osdmap *map)
> +{
> +	return __decode_pools(p, end, map, false);
> +}
> +
> +static int decode_new_pools(void **p, void *end, struct ceph_osdmap *map)
> +{
> +	return __decode_pools(p, end, map, true);
> +}
> +
>  /*
>   * decode a full map.
>   */
> @@ -692,7 +741,6 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	u32 max;
>  	u32 len, i;
>  	int err;
> -	struct ceph_pg_pool_info *pi;
>  
>  	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
>  
> @@ -714,22 +762,10 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	ceph_decode_copy(p, &map->created, sizeof(map->created));
>  	ceph_decode_copy(p, &map->modified, sizeof(map->modified));
>  
> -	ceph_decode_32_safe(p, end, max, e_inval);
> -	while (max--) {
> -		ceph_decode_need(p, end, 8 + 2, e_inval);
> -		pi = kzalloc(sizeof(*pi), GFP_NOFS);
> -		if (!pi) {
> -			err = -ENOMEM;
> -			goto bad;
> -		}
> -		pi->id = ceph_decode_64(p);
> -		err = decode_pool(p, end, pi);
> -		if (err < 0) {
> -			kfree(pi);
> -			goto bad;
> -		}
> -		__insert_pg_pool(&map->pg_pools, pi);
> -	}
> +	/* pools */
> +	err = decode_pools(p, end, map);
> +	if (err)
> +		goto bad;
>  
>  	/* pool_name */
>  	err = decode_pool_names(p, end, map);
> @@ -928,26 +964,10 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  		newcrush = NULL;
>  	}
>  
> -	/* new_pool */
> -	ceph_decode_32_safe(p, end, len, e_inval);
> -	while (len--) {
> -		struct ceph_pg_pool_info *pi;
> -
> -		ceph_decode_64_safe(p, end, pool, e_inval);
> -		pi = __lookup_pg_pool(&map->pg_pools, pool);
> -		if (!pi) {
> -			pi = kzalloc(sizeof(*pi), GFP_NOFS);
> -			if (!pi) {
> -				err = -ENOMEM;
> -				goto bad;
> -			}
> -			pi->id = pool;
> -			__insert_pg_pool(&map->pg_pools, pi);
> -		}
> -		err = decode_pool(p, end, pi);
> -		if (err < 0)
> -			goto bad;
> -	}
> +	/* new_pools */
> +	err = decode_new_pools(p, end, map);
> +	if (err)
> +		goto bad;
>  
>  	/* new_pool_names */
>  	err = decode_pool_names(p, end, map);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15/33] libceph: switch osdmap_set_max_osd() to krealloc()
  2014-03-27 18:18 ` [PATCH 15/33] libceph: switch osdmap_set_max_osd() to krealloc() Ilya Dryomov
@ 2014-03-27 19:59   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 19:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Use krealloc() instead of rolling our own.  (krealloc() with a NULL
> first argument acts as a kmalloc()).  Properly initalize the new array
> elements.  This is needed to make future additions to osdmap easier.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index d6a569c5508f..4565c72fec5c 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -646,38 +646,40 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
>  }
>  
>  /*
> - * adjust max osd value.  reallocate arrays.
> + * Adjust max_osd value, (re)allocate arrays.
> + *
> + * The new elements are properly initialized.
>   */
>  static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>  {
>  	u8 *state;
> -	struct ceph_entity_addr *addr;
>  	u32 *weight;
> +	struct ceph_entity_addr *addr;
> +	int i;
>  
> -	state = kcalloc(max, sizeof(*state), GFP_NOFS);
> -	addr = kcalloc(max, sizeof(*addr), GFP_NOFS);
> -	weight = kcalloc(max, sizeof(*weight), GFP_NOFS);
> -	if (state == NULL || addr == NULL || weight == NULL) {
> +	state = krealloc(map->osd_state, max*sizeof(*state), GFP_NOFS);
> +	weight = krealloc(map->osd_weight, max*sizeof(*weight), GFP_NOFS);
> +	addr = krealloc(map->osd_addr, max*sizeof(*addr), GFP_NOFS);
> +	if (!state || !weight || !addr) {
>  		kfree(state);
> -		kfree(addr);
>  		kfree(weight);
> +		kfree(addr);
> +
>  		return -ENOMEM;
>  	}
>  
> -	/* copy old? */
> -	if (map->osd_state) {
> -		memcpy(state, map->osd_state, map->max_osd*sizeof(*state));
> -		memcpy(addr, map->osd_addr, map->max_osd*sizeof(*addr));
> -		memcpy(weight, map->osd_weight, map->max_osd*sizeof(*weight));
> -		kfree(map->osd_state);
> -		kfree(map->osd_addr);
> -		kfree(map->osd_weight);
> +	for (i = map->max_osd; i < max; i++) {
> +		state[i] = 0;
> +		weight[i] = CEPH_OSD_OUT;
> +		memset(addr + i, 0, sizeof(*addr));
>  	}
>  
>  	map->osd_state = state;
>  	map->osd_weight = weight;
>  	map->osd_addr = addr;
> +
>  	map->max_osd = max;
> +
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 16/33] libceph: introduce decode{,_new}_pg_temp() and switch to them
  2014-03-27 18:18 ` [PATCH 16/33] libceph: introduce decode{,_new}_pg_temp() and switch to them Ilya Dryomov
@ 2014-03-27 20:05   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:05 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp
> (inc map, same) decoding logic into a common helper and switch to it.

Again, it's nice to see this kind of refactoring being done.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>


> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |  139 ++++++++++++++++++++++++++---------------------------
>  1 file changed, 67 insertions(+), 72 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 4565c72fec5c..0134df3639d2 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -732,6 +732,67 @@ static int decode_new_pools(void **p, void *end, struct ceph_osdmap *map)
>  	return __decode_pools(p, end, map, true);
>  }
>  
> +static int __decode_pg_temp(void **p, void *end, struct ceph_osdmap *map,
> +			    bool incremental)
> +{
> +	u32 n;
> +
> +	ceph_decode_32_safe(p, end, n, e_inval);
> +	while (n--) {
> +		struct ceph_pg pgid;
> +		u32 len, i;
> +		int ret;
> +
> +		ret = ceph_decode_pgid(p, end, &pgid);
> +		if (ret)
> +			return ret;
> +
> +		ceph_decode_32_safe(p, end, len, e_inval);
> +
> +		ret = __remove_pg_mapping(&map->pg_temp, pgid);
> +		BUG_ON(!incremental && ret != -ENOENT);
> +
> +		if (!incremental || len > 0) {
> +			struct ceph_pg_mapping *pg;
> +
> +			ceph_decode_need(p, end, len*sizeof(u32), e_inval);
> +
> +			if (len > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
> +				return -EINVAL;
> +
> +			pg = kzalloc(sizeof(*pg) + len*sizeof(u32), GFP_NOFS);
> +			if (!pg)
> +				return -ENOMEM;
> +
> +			pg->pgid = pgid;
> +			pg->len = len;
> +			for (i = 0; i < len; i++)
> +				pg->osds[i] = ceph_decode_32(p);
> +
> +			ret = __insert_pg_mapping(pg, &map->pg_temp);
> +			if (ret) {
> +				kfree(pg);
> +				return ret;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +
> +e_inval:
> +	return -EINVAL;
> +}
> +
> +static int decode_pg_temp(void **p, void *end, struct ceph_osdmap *map)
> +{
> +	return __decode_pg_temp(p, end, map, false);
> +}
> +
> +static int decode_new_pg_temp(void **p, void *end, struct ceph_osdmap *map)
> +{
> +	return __decode_pg_temp(p, end, map, true);
> +}
> +
>  /*
>   * decode a full map.
>   */
> @@ -804,36 +865,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  		ceph_decode_addr(&map->osd_addr[i]);
>  
>  	/* pg_temp */
> -	ceph_decode_32_safe(p, end, len, e_inval);
> -	for (i = 0; i < len; i++) {
> -		int n, j;
> -		struct ceph_pg pgid;
> -		struct ceph_pg_mapping *pg;
> -
> -		err = ceph_decode_pgid(p, end, &pgid);
> -		if (err)
> -			goto bad;
> -		ceph_decode_need(p, end, sizeof(u32), e_inval);
> -		n = ceph_decode_32(p);
> -		if (n > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
> -			goto e_inval;
> -		ceph_decode_need(p, end, n * sizeof(u32), e_inval);
> -		pg = kmalloc(sizeof(*pg) + n*sizeof(u32), GFP_NOFS);
> -		if (!pg) {
> -			err = -ENOMEM;
> -			goto bad;
> -		}
> -		pg->pgid = pgid;
> -		pg->len = n;
> -		for (j = 0; j < n; j++)
> -			pg->osds[j] = ceph_decode_32(p);
> -
> -		err = __insert_pg_mapping(pg, &map->pg_temp);
> -		if (err)
> -			goto bad;
> -		dout(" added pg_temp %lld.%x len %d\n", pgid.pool, pgid.seed,
> -		     len);
> -	}
> +	err = decode_pg_temp(p, end, map);
> +	if (err)
> +		goto bad;
>  
>  	/* crush */
>  	ceph_decode_32_safe(p, end, len, e_inval);
> @@ -1032,48 +1066,9 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	}
>  
>  	/* new_pg_temp */
> -	ceph_decode_32_safe(p, end, len, e_inval);
> -	while (len--) {
> -		struct ceph_pg_mapping *pg;
> -		int j;
> -		struct ceph_pg pgid;
> -		u32 pglen;
> -
> -		err = ceph_decode_pgid(p, end, &pgid);
> -		if (err)
> -			goto bad;
> -		ceph_decode_need(p, end, sizeof(u32), e_inval);
> -		pglen = ceph_decode_32(p);
> -		if (pglen) {
> -			ceph_decode_need(p, end, pglen*sizeof(u32), e_inval);
> -
> -			/* removing existing (if any) */
> -			(void) __remove_pg_mapping(&map->pg_temp, pgid);
> -
> -			/* insert */
> -			if (pglen > (UINT_MAX - sizeof(*pg)) / sizeof(u32))
> -				goto e_inval;
> -			pg = kmalloc(sizeof(*pg) + sizeof(u32)*pglen, GFP_NOFS);
> -			if (!pg) {
> -				err = -ENOMEM;
> -				goto bad;
> -			}
> -			pg->pgid = pgid;
> -			pg->len = pglen;
> -			for (j = 0; j < pglen; j++)
> -				pg->osds[j] = ceph_decode_32(p);
> -			err = __insert_pg_mapping(pg, &map->pg_temp);
> -			if (err) {
> -				kfree(pg);
> -				goto bad;
> -			}
> -			dout(" added pg_temp %lld.%x len %d\n", pgid.pool,
> -			     pgid.seed, pglen);
> -		} else {
> -			/* remove */
> -			__remove_pg_mapping(&map->pg_temp, pgid);
> -		}
> -	}
> +	err = decode_new_pg_temp(p, end, map);
> +	if (err)
> +		goto bad;
>  
>  	/* ignore the rest */
>  	*p = end;
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 17/33] libceph: introduce get_osdmap_client_data_v()
  2014-03-27 18:18 ` [PATCH 17/33] libceph: introduce get_osdmap_client_data_v() Ilya Dryomov
@ 2014-03-27 20:17   ` Alex Elder
  2014-03-28 14:59     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:17 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Full and incremental osdmaps are structured identically and have
> identical headers.  Add a helper to decode both "old" (16-bit version,
> v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
> enconding headers and switch to it.

It wasn't clear to me at first that this was adding a
new bit of functionality--support for v7 OSD map encodings.

A couple comments below but this looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   81 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 65 insertions(+), 16 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 0134df3639d2..ae96c73aff71 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -683,6 +683,63 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>  	return 0;
>  }
>  
> +#define OSDMAP_WRAPPER_COMPAT_VER	7
> +#define OSDMAP_CLIENT_DATA_COMPAT_VER	1

Don't these definitions belong in a common header?

> +/*
> + * Return 0 or error.  On success, *v is set to 0 for old (v6) osdmaps,
> + * to struct_v of the client_data section for new (v7 and above)
> + * osdmaps.
> + */
> +static int get_osdmap_client_data_v(void **p, void *end,
> +				    const char *s, u8 *v)

I like to avoid one-character names (in this case, "s").
Mainly because it's hard to search for them.

You could pass a Boolean "full" and use that to select
what's printed in the warning messages.

> +{
> +	u8 struct_v;
> +
> +	ceph_decode_8_safe(p, end, struct_v, e_inval);
> +	if (struct_v >= 7) {
> +		u8 struct_compat;
> +
> +		ceph_decode_8_safe(p, end, struct_compat, e_inval);
> +		if (struct_compat > OSDMAP_WRAPPER_COMPAT_VER) {
> +			pr_warning("got v %d cv %d > %d of %s ceph_osdmap\n",
> +				   struct_v, struct_compat,
> +				   OSDMAP_WRAPPER_COMPAT_VER, s);
> +			return -EINVAL;
> +		}
> +		*p += 4; /* ignore wrapper struct_len */
> +
> +		ceph_decode_8_safe(p, end, struct_v, e_inval);
> +		ceph_decode_8_safe(p, end, struct_compat, e_inval);
> +		if (struct_compat > OSDMAP_CLIENT_DATA_COMPAT_VER) {
> +			pr_warning("got v %d cv %d > %d of %s ceph_osdmap client data\n",
> +				   struct_v, struct_compat,
> +				   OSDMAP_CLIENT_DATA_COMPAT_VER, s);
> +			return -EINVAL;
> +		}
> +		*p += 4; /* ignore client data struct_len */
> +	} else {
> +		u16 version;
> +
> +		*p -= 1;
> +		ceph_decode_16_safe(p, end, version, e_inval);
> +		if (version < 6) {
> +			pr_warning("got v %d < 6 of %s ceph_osdmap\n", version,
> +				   s);
> +			return -EINVAL;
> +		}
> +
> +		/* old osdmap enconding */
> +		struct_v = 0;
> +	}
> +
> +	*v = struct_v;
> +	return 0;
> +
> +e_inval:
> +	return -EINVAL;
> +}
> +
>  static int __decode_pools(void **p, void *end, struct ceph_osdmap *map,
>  			  bool incremental)
>  {
> @@ -798,7 +855,7 @@ static int decode_new_pg_temp(void **p, void *end, struct ceph_osdmap *map)
>   */
>  static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  {
> -	u16 version;
> +	u8 struct_v;
>  	u32 epoch = 0;
>  	void *start = *p;
>  	u32 max;
> @@ -807,15 +864,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  
>  	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
>  
> -	ceph_decode_16_safe(p, end, version, e_inval);
> -	if (version > 6) {
> -		pr_warning("got unknown v %d > 6 of osdmap\n", version);
> -		goto e_inval;
> -	}
> -	if (version < 6) {
> -		pr_warning("got old v %d < 6 of osdmap\n", version);
> -		goto e_inval;
> -	}
> +	err = get_osdmap_client_data_v(p, end, "full", &struct_v);
> +	if (err)
> +		goto bad;
>  
>  	/* fsid, epoch, created, modified */
>  	ceph_decode_need(p, end, sizeof(map->fsid) + sizeof(u32) +
> @@ -937,15 +988,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	__s32 new_flags, max;
>  	void *start = *p;
>  	int err;
> -	u16 version;
> +	u8 struct_v;
>  
>  	dout("%s %p to %p len %d\n", __func__, *p, end, (int)(end - *p));
>  
> -	ceph_decode_16_safe(p, end, version, e_inval);
> -	if (version != 6) {
> -		pr_warning("got unknown v %d != 6 of inc osdmap\n", version);
> -		goto e_inval;
> -	}
> +	err = get_osdmap_client_data_v(p, end, "inc", &struct_v);
> +	if (err)
> +		goto bad;
>  
>  	/* fsid, epoch, modified, new_pool_max, new_flags */
>  	ceph_decode_need(p, end, sizeof(fsid) + sizeof(u32) + sizeof(modified) +
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 19/33] libceph: primary_temp infrastructure
  2014-03-27 18:18 ` [PATCH 19/33] libceph: primary_temp infrastructure Ilya Dryomov
@ 2014-03-27 20:21   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:21 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Add primary_temp mappings infrastructure.  struct ceph_pg_mapping is
> overloaded, primary_temp mappings are stored in an rb-tree, rooted at
> ceph_osdmap, in a manner similar to pg_temp mappings.
> 
> Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
> one 'primary_temp <pgid> <osd>' per line, e.g:
> 
>     primary_temp 2.6 4

So this just sets up the infrastructure, but doesn't
use it yet.  OK.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |    5 +++++
>  net/ceph/debugfs.c          |    7 +++++++
>  net/ceph/osdmap.c           |   10 +++++++++-
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index 4837e58e3203..db4fb6322aae 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -66,6 +66,9 @@ struct ceph_pg_mapping {
>  			int len;
>  			int osds[];
>  		} pg_temp;
> +		struct {
> +			int osd;
> +		} primary_temp;
>  	};
>  };
>  
> @@ -83,6 +86,8 @@ struct ceph_osdmap {
>  	struct ceph_entity_addr *osd_addr;
>  
>  	struct rb_root pg_temp;
> +	struct rb_root primary_temp;
> +
>  	struct rb_root pg_pools;
>  	u32 pool_max;
>  
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index 5865f2c9580a..612bf55e6a8b 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -93,6 +93,13 @@ static int osdmap_show(struct seq_file *s, void *p)
>  				   pg->pg_temp.osds[i]);
>  		seq_printf(s, "]\n");
>  	}
> +	for (n = rb_first(&map->primary_temp); n; n = rb_next(n)) {
> +		struct ceph_pg_mapping *pg =
> +			rb_entry(n, struct ceph_pg_mapping, node);
> +
> +		seq_printf(s, "primary_temp %llu.%x %d\n", pg->pgid.pool,
> +			   pg->pgid.seed, pg->primary_temp.osd);
> +	}
>  
>  	return 0;
>  }
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 401af78ad741..d78c3e5d60f7 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -343,7 +343,7 @@ bad:
>  
>  /*
>   * rbtree of pg_mapping for handling pg_temp (explicit mapping of pgid
> - * to a set of osds)
> + * to a set of osds) and primary_temp (explicit primary setting)
>   */
>  static int pgid_cmp(struct ceph_pg l, struct ceph_pg r)
>  {
> @@ -633,6 +633,13 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
>  		rb_erase(&pg->node, &map->pg_temp);
>  		kfree(pg);
>  	}
> +	while (!RB_EMPTY_ROOT(&map->primary_temp)) {
> +		struct ceph_pg_mapping *pg =
> +			rb_entry(rb_first(&map->primary_temp),
> +				 struct ceph_pg_mapping, node);
> +		rb_erase(&pg->node, &map->primary_temp);
> +		kfree(pg);
> +	}
>  	while (!RB_EMPTY_ROOT(&map->pg_pools)) {
>  		struct ceph_pg_pool_info *pi =
>  			rb_entry(rb_first(&map->pg_pools),
> @@ -960,6 +967,7 @@ struct ceph_osdmap *ceph_osdmap_decode(void **p, void *end)
>  		return ERR_PTR(-ENOMEM);
>  
>  	map->pg_temp = RB_ROOT;
> +	map->primary_temp = RB_ROOT;
>  	mutex_init(&map->crush_scratch_mutex);
>  
>  	ret = osdmap_decode(p, end, map);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 20/33] libceph: primary_temp decode bits
  2014-03-27 18:18 ` [PATCH 20/33] libceph: primary_temp decode bits Ilya Dryomov
@ 2014-03-27 20:23   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:23 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Add a common helper to decode both primary_temp (full map, map<pg_t,
> u32>) and new_primary_temp (inc map, same) and switch to it.

The code looks reasonable.  I'll have to assume
it's doing the decoding properly.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index d78c3e5d60f7..0ca7f36e88b4 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -857,6 +857,61 @@ static int decode_new_pg_temp(void **p, void *end, struct ceph_osdmap *map)
>  	return __decode_pg_temp(p, end, map, true);
>  }
>  
> +static int __decode_primary_temp(void **p, void *end, struct ceph_osdmap *map,
> +				 bool incremental)
> +{
> +	u32 n;
> +
> +	ceph_decode_32_safe(p, end, n, e_inval);
> +	while (n--) {
> +		struct ceph_pg pgid;
> +		u32 osd;
> +		int ret;
> +
> +		ret = ceph_decode_pgid(p, end, &pgid);
> +		if (ret)
> +			return ret;
> +
> +		ceph_decode_32_safe(p, end, osd, e_inval);
> +
> +		ret = __remove_pg_mapping(&map->primary_temp, pgid);
> +		BUG_ON(!incremental && ret != -ENOENT);
> +
> +		if (!incremental || osd != (u32)-1) {
> +			struct ceph_pg_mapping *pg;
> +
> +			pg = kzalloc(sizeof(*pg), GFP_NOFS);
> +			if (!pg)
> +				return -ENOMEM;
> +
> +			pg->pgid = pgid;
> +			pg->primary_temp.osd = osd;
> +
> +			ret = __insert_pg_mapping(pg, &map->primary_temp);
> +			if (ret) {
> +				kfree(pg);
> +				return ret;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +
> +e_inval:
> +	return -EINVAL;
> +}
> +
> +static int decode_primary_temp(void **p, void *end, struct ceph_osdmap *map)
> +{
> +	return __decode_primary_temp(p, end, map, false);
> +}
> +
> +static int decode_new_primary_temp(void **p, void *end,
> +				   struct ceph_osdmap *map)
> +{
> +	return __decode_primary_temp(p, end, map, true);
> +}
> +
>  /*
>   * decode a full map.
>   */
> @@ -927,6 +982,13 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  	if (err)
>  		goto bad;
>  
> +	/* primary_temp */
> +	if (struct_v >= 1) {
> +		err = decode_primary_temp(p, end, map);
> +		if (err)
> +			goto bad;
> +	}
> +
>  	/* crush */
>  	ceph_decode_32_safe(p, end, len, e_inval);
>  	map->crush = crush_decode(*p, min(*p + len, end));
> @@ -1127,6 +1189,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  	if (err)
>  		goto bad;
>  
> +	/* new_primary_temp */
> +	if (struct_v >= 1) {
> +		err = decode_new_primary_temp(p, end, map);
> +		if (err)
> +			goto bad;
> +	}
> +
>  	/* ignore the rest */
>  	*p = end;
>  
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 21/33] libceph: primary_affinity infrastructure
  2014-03-27 18:18 ` [PATCH 21/33] libceph: primary_affinity infrastructure Ilya Dryomov
@ 2014-03-27 20:26   ` Alex Elder
  2014-03-28 15:01     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:26 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Add primary_affinity infrastructure.  primary_affinity values are
> stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
> a osd_weight array.
> 
> Introduce {get,set}_primary_affinity() helpers, primarily to return
> CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
> abstract out osd_primary_affinity array allocation and initialization.

One comment about some constant definitions, but
this looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |    3 +++
>  include/linux/ceph/rados.h  |    4 ++++
>  net/ceph/debugfs.c          |    5 +++--
>  net/ceph/osdmap.c           |   47 +++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index db4fb6322aae..6e030cb3c9ca 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -88,6 +88,8 @@ struct ceph_osdmap {
>  	struct rb_root pg_temp;
>  	struct rb_root primary_temp;
>  
> +	u32 *osd_primary_affinity;
> +
>  	struct rb_root pg_pools;
>  	u32 pool_max;
>  
> @@ -134,6 +136,7 @@ static inline bool ceph_osdmap_flag(struct ceph_osdmap *map, int flag)
>  }
>  
>  extern char *ceph_osdmap_state_str(char *str, int len, int state);
> +extern u32 ceph_get_primary_affinity(struct ceph_osdmap *map, int osd);
>  
>  static inline struct ceph_entity_addr *ceph_osd_addr(struct ceph_osdmap *map,
>  						     int osd)
> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
> index 2caabef8d369..bb6f40c9cb0f 100644
> --- a/include/linux/ceph/rados.h
> +++ b/include/linux/ceph/rados.h
> @@ -133,6 +133,10 @@ extern const char *ceph_osd_state_name(int s);
>  #define CEPH_OSD_IN  0x10000
>  #define CEPH_OSD_OUT 0
>  
> +/* osd primary-affinity.  fixed point value: 0x10000 == baseline */
> +#define CEPH_OSD_MAX_PRIMARY_AFFINITY 0x10000
> +#define CEPH_OSD_DEFAULT_PRIMARY_AFFINITY 0x10000
> +

It seems like these definitions may also belong in a
common header file.  However I know that in some cases
it's necessary to impose limits in the kernel where
none is enforced in user space.

>  /*
>   * osd map flag bits
> diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
> index 612bf55e6a8b..34453a2b4b4d 100644
> --- a/net/ceph/debugfs.c
> +++ b/net/ceph/debugfs.c
> @@ -77,10 +77,11 @@ static int osdmap_show(struct seq_file *s, void *p)
>  		int state = map->osd_state[i];
>  		char sb[64];
>  
> -		seq_printf(s, "osd%d\t%s\t%3d%%\t(%s)\n",
> +		seq_printf(s, "osd%d\t%s\t%3d%%\t(%s)\t%3d%%\n",
>  			   i, ceph_pr_addr(&addr->in_addr),
>  			   ((map->osd_weight[i]*100) >> 16),
> -			   ceph_osdmap_state_str(sb, sizeof(sb), state));
> +			   ceph_osdmap_state_str(sb, sizeof(sb), state),
> +			   ((ceph_get_primary_affinity(map, i)*100) >> 16));
>  	}
>  	for (n = rb_first(&map->pg_temp); n; n = rb_next(n)) {
>  		struct ceph_pg_mapping *pg =
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 0ca7f36e88b4..538b8dd341e8 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -649,6 +649,7 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
>  	kfree(map->osd_state);
>  	kfree(map->osd_weight);
>  	kfree(map->osd_addr);
> +	kfree(map->osd_primary_affinity);
>  	kfree(map);
>  }
>  
> @@ -685,6 +686,20 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>  	map->osd_weight = weight;
>  	map->osd_addr = addr;
>  
> +	if (map->osd_primary_affinity) {
> +		u32 *affinity;
> +
> +		affinity = krealloc(map->osd_primary_affinity,
> +				    max*sizeof(*affinity), GFP_NOFS);
> +		if (!affinity)
> +			return -ENOMEM;
> +
> +		for (i = map->max_osd; i < max; i++)
> +			affinity[i] = CEPH_OSD_DEFAULT_PRIMARY_AFFINITY;
> +
> +		map->osd_primary_affinity = affinity;
> +	}
> +
>  	map->max_osd = max;
>  
>  	return 0;
> @@ -912,6 +927,38 @@ static int decode_new_primary_temp(void **p, void *end,
>  	return __decode_primary_temp(p, end, map, true);
>  }
>  
> +u32 ceph_get_primary_affinity(struct ceph_osdmap *map, int osd)
> +{
> +	BUG_ON(osd >= map->max_osd);
> +
> +	if (!map->osd_primary_affinity)
> +		return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY;
> +
> +	return map->osd_primary_affinity[osd];
> +}
> +
> +static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
> +{
> +	BUG_ON(osd >= map->max_osd);
> +
> +	if (!map->osd_primary_affinity) {
> +		int i;
> +
> +		map->osd_primary_affinity = kmalloc(map->max_osd*sizeof(u32),
> +						    GFP_NOFS);
> +		if (!map->osd_primary_affinity)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < map->max_osd; i++)
> +			map->osd_primary_affinity[i] =
> +			    CEPH_OSD_DEFAULT_PRIMARY_AFFINITY;
> +	}
> +
> +	map->osd_primary_affinity[osd] = aff;
> +
> +	return 0;
> +}
> +
>  /*
>   * decode a full map.
>   */
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 22/33] libceph: primary_affinity decode bits
  2014-03-27 18:18 ` [PATCH 22/33] libceph: primary_affinity decode bits Ilya Dryomov
@ 2014-03-27 20:31   ` Alex Elder
  2014-03-28 15:01     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:31 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Add two helpers to decode primary_affinity (full map, vector<u32>) and
> new_primary_affinity (inc map, map<u32, u32>) and switch to them.

One comment below, but otherwise looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 538b8dd341e8..3ac2098972ea 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -959,6 +959,59 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
>  	return 0;
>  }
>  
> +static int decode_primary_affinity(void **p, void *end,
> +				   struct ceph_osdmap *map)
> +{
> +	u32 len, i;
> +
> +	ceph_decode_32_safe(p, end, len, e_inval);
> +	if (len == 0) {
> +		kfree(map->osd_primary_affinity);
> +		map->osd_primary_affinity = NULL;
> +		return 0;
> +	}
> +
> +	ceph_decode_need(p, end, map->max_osd*sizeof(u32), e_inval);
> +
> +	BUG_ON(len != map->max_osd);

BUG() here is too much; it should I think just return an error instead.
The test could be done earlier too, prior to ceph_decode_need().

> +	for (i = 0; i < map->max_osd; i++) {
> +		int ret;
> +
> +		ret = set_primary_affinity(map, i, ceph_decode_32(p));
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +
> +e_inval:
> +	return -EINVAL;
> +}
> +
> +static int decode_new_primary_affinity(void **p, void *end,
> +				       struct ceph_osdmap *map)
> +{
> +	u32 n;
> +
> +	ceph_decode_32_safe(p, end, n, e_inval);
> +	while (n--) {
> +		u32 osd, aff;
> +		int ret;
> +
> +		ceph_decode_32_safe(p, end, osd, e_inval);
> +		ceph_decode_32_safe(p, end, aff, e_inval);
> +
> +		ret = set_primary_affinity(map, osd, aff);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +
> +e_inval:
> +	return -EINVAL;
> +}
> +
>  /*
>   * decode a full map.
>   */
> @@ -1036,6 +1089,17 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  			goto bad;
>  	}
>  
> +	/* primary_affinity */
> +	if (struct_v >= 2) {
> +		err = decode_primary_affinity(p, end, map);
> +		if (err)
> +			goto bad;
> +	} else {
> +		/* XXX can this happen? */
> +		kfree(map->osd_primary_affinity);
> +		map->osd_primary_affinity = NULL;
> +	}
> +
>  	/* crush */
>  	ceph_decode_32_safe(p, end, len, e_inval);
>  	map->crush = crush_decode(*p, min(*p + len, end));
> @@ -1243,6 +1307,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
>  			goto bad;
>  	}
>  
> +	/* new_primary_affinity */
> +	if (struct_v >= 2) {
> +		err = decode_new_primary_affinity(p, end, map);
> +		if (err)
> +			goto bad;
> +	}
> +
>  	/* ignore the rest */
>  	*p = end;
>  
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit
  2014-03-27 18:18 ` [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit Ilya Dryomov
@ 2014-03-27 20:32   ` Alex Elder
  2014-03-28 15:01     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:32 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Announce our support for "new" osdmap enconding.

Looks OK to me.  Isn't there a version of this OSD
map encoding?  Maybe there'll be a "newer" one someday?

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/ceph_features.h |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
> index 77c097fe9ea9..7a4cab50b2cd 100644
> --- a/include/linux/ceph/ceph_features.h
> +++ b/include/linux/ceph/ceph_features.h
> @@ -90,6 +90,7 @@ static inline u64 ceph_sanitize_features(u64 features)
>  	 CEPH_FEATURE_OSD_CACHEPOOL |		\
>  	 CEPH_FEATURE_CRUSH_V2 |		\
>  	 CEPH_FEATURE_EXPORT_PEER |		\
> +	 CEPH_FEATURE_OSDMAP_ENC |		\
>  	 CEPH_FEATURE_CRUSH_TUNABLES3)
>  
>  #define CEPH_FEATURES_REQUIRED_DEFAULT   \
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 24/33] libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
  2014-03-27 18:18 ` [PATCH 24/33] libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions Ilya Dryomov
@ 2014-03-27 20:33   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:33 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Sync up with ceph.git definitions.  Bring in ceph_osd_is_down().

Looks good.  (Though I didn't verify it matches Ceph's definitions...)

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |   14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index 6e030cb3c9ca..0895797b9e28 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -125,9 +125,21 @@ static inline void ceph_oid_copy(struct ceph_object_id *dest,
>  	dest->name_len = src->name_len;
>  }
>  
> +static inline int ceph_osd_exists(struct ceph_osdmap *map, int osd)
> +{
> +	return osd >= 0 && osd < map->max_osd &&
> +	       (map->osd_state[osd] & CEPH_OSD_EXISTS);
> +}
> +
>  static inline int ceph_osd_is_up(struct ceph_osdmap *map, int osd)
>  {
> -	return (osd < map->max_osd) && (map->osd_state[osd] & CEPH_OSD_UP);
> +	return ceph_osd_exists(map, osd) &&
> +	       (map->osd_state[osd] & CEPH_OSD_UP);
> +}
> +
> +static inline int ceph_osd_is_down(struct ceph_osdmap *map, int osd)
> +{
> +	return !ceph_osd_is_up(map, osd);
>  }
>  
>  static inline bool ceph_osdmap_flag(struct ceph_osdmap *map, int flag)
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 25/33] libceph: ceph_can_shift_osds(pool) and pool type defines
  2014-03-27 18:18 ` [PATCH 25/33] libceph: ceph_can_shift_osds(pool) and pool type defines Ilya Dryomov
@ 2014-03-27 20:34   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:34 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Bring in pg_pool_t::can_shift_osds() counterpart along with pool type
> defines.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |   12 ++++++++++++
>  include/linux/ceph/rados.h  |    5 +++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index 0895797b9e28..4e28c1e5d62f 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -41,6 +41,18 @@ struct ceph_pg_pool_info {
>  	char *name;
>  };
>  
> +static inline bool ceph_can_shift_osds(struct ceph_pg_pool_info *pool)
> +{
> +	switch (pool->type) {
> +	case CEPH_POOL_TYPE_REP:
> +		return true;
> +	case CEPH_POOL_TYPE_EC:
> +		return false;
> +	default:
> +		BUG_ON(1);
> +	}
> +}
> +
>  struct ceph_object_locator {
>  	s64 pool;
>  };
> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
> index bb6f40c9cb0f..f20e0d8a2155 100644
> --- a/include/linux/ceph/rados.h
> +++ b/include/linux/ceph/rados.h
> @@ -81,8 +81,9 @@ struct ceph_pg_v1 {
>   */
>  #define CEPH_NOPOOL  ((__u64) (-1))  /* pool id not defined */
>  
> -#define CEPH_PG_TYPE_REP     1
> -#define CEPH_PG_TYPE_RAID4   2
> +#define CEPH_POOL_TYPE_REP     1
> +#define CEPH_POOL_TYPE_RAID4   2 /* never implemented */
> +#define CEPH_POOL_TYPE_EC      3
>  
>  /*
>   * stable_mod func is used to control number of placement groups.
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 26/33] libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
  2014-03-27 18:18 ` [PATCH 26/33] libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers Ilya Dryomov
@ 2014-03-27 20:36   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:36 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> pg_to_raw_osds() helper for computing a raw (crush) set, which can
> contain non-existant and down osds.
> 
> raw_to_up_osds() helper for pruning non-existant and down osds from the
> raw set, therefore transforming it into an up set, and determining up
> primary.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 3ac2098972ea..ee095e07cf98 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1514,6 +1514,82 @@ static int *calc_pg_raw(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
>  }
>  
>  /*
> + * Calculate raw (crush) set for given pgid.
> + *
> + * Return raw set length, or error.
> + */
> +static int pg_to_raw_osds(struct ceph_osdmap *osdmap,
> +			  struct ceph_pg_pool_info *pool,
> +			  struct ceph_pg pgid, u32 pps, int *osds)
> +{
> +	int ruleno;
> +	int len;
> +
> +	/* crush */
> +	ruleno = crush_find_rule(osdmap->crush, pool->crush_ruleset,
> +				 pool->type, pool->size);
> +	if (ruleno < 0) {
> +		pr_err("no crush rule: pool %lld ruleset %d type %d size %d\n",
> +		       pgid.pool, pool->crush_ruleset, pool->type,
> +		       pool->size);
> +		return -ENOENT;
> +	}
> +
> +	len = do_crush(osdmap, ruleno, pps, osds,
> +		       min_t(int, pool->size, CEPH_PG_MAX_SIZE),
> +		       osdmap->osd_weight, osdmap->max_osd);
> +	if (len < 0) {
> +		pr_err("error %d from crush rule %d: pool %lld ruleset %d type %d size %d\n",
> +		       len, ruleno, pgid.pool, pool->crush_ruleset,
> +		       pool->type, pool->size);
> +		return len;
> +	}
> +
> +	return len;
> +}
> +
> +/*
> + * Given raw set, calculate up set and up primary.
> + *
> + * Return up set length.  *primary is set to up primary osd id, or -1
> + * if up set is empty.
> + */
> +static int raw_to_up_osds(struct ceph_osdmap *osdmap,
> +			  struct ceph_pg_pool_info *pool,
> +			  int *osds, int len, int *primary)
> +{
> +	int up_primary = -1;
> +	int i;
> +
> +	if (ceph_can_shift_osds(pool)) {
> +		int removed = 0;
> +
> +		for (i = 0; i < len; i++) {
> +			if (ceph_osd_is_down(osdmap, osds[i])) {
> +				removed++;
> +				continue;
> +			}
> +			if (removed)
> +				osds[i - removed] = osds[i];
> +		}
> +
> +		len -= removed;
> +		if (len > 0)
> +			up_primary = osds[0];
> +	} else {
> +		for (i = len - 1; i >= 0; i--) {
> +			if (ceph_osd_is_down(osdmap, osds[i]))
> +				osds[i] = CRUSH_ITEM_NONE;
> +			else
> +				up_primary = osds[i];
> +		}
> +	}
> +
> +	*primary = up_primary;
> +	return len;
> +}
> +
> +/*
>   * Return acting set for given pgid.
>   */
>  int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 27/33] libceph: introduce apply_temps() helper
  2014-03-27 18:18 ` [PATCH 27/33] libceph: introduce apply_temps() helper Ilya Dryomov
@ 2014-03-27 20:41   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:41 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> apply_temp() helper for applying various temporary mappings (at this
> point only pg_temp mappings) to the up set, therefore transforming it
> into an acting set.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index ee095e07cf98..6d418433d80d 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1590,6 +1590,58 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
>  }
>  
>  /*
> + * Given up set, apply pg_temp mapping.
> + *
> + * Return acting set length.  *primary is set to acting primary osd id,
> + * or -1 if acting set is empty.
> + */
> +static int apply_temps(struct ceph_osdmap *osdmap,
> +		       struct ceph_pg_pool_info *pool, struct ceph_pg pgid,
> +		       int *osds, int len, int *primary)
> +{
> +	struct ceph_pg_mapping *pg;
> +	int temp_len;
> +	int temp_primary;
> +	int i;
> +
> +	/* raw_pg -> pg */
> +	pgid.seed = ceph_stable_mod(pgid.seed, pool->pg_num,
> +				    pool->pg_num_mask);
> +
> +	/* pg_temp? */
> +	pg = __lookup_pg_mapping(&osdmap->pg_temp, pgid);
> +	if (pg) {
> +		temp_len = 0;
> +		temp_primary = -1;
> +
> +		for (i = 0; i < pg->pg_temp.len; i++) {
> +			if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
> +				if (ceph_can_shift_osds(pool))
> +					continue;
> +				else
> +					osds[temp_len++] = CRUSH_ITEM_NONE;
> +			} else {
> +				osds[temp_len++] = pg->pg_temp.osds[i];
> +			}
> +		}
> +
> +		/* apply pg_temp's primary */
> +		for (i = 0; i < temp_len; i++) {
> +			if (osds[i] != CRUSH_ITEM_NONE) {
> +				temp_primary = osds[i];
> +				break;
> +			}
> +		}
> +	} else {
> +		temp_len = len;
> +		temp_primary = *primary;
> +	}
> +
> +	*primary = temp_primary;
> +	return temp_len;
> +}
> +
> +/*
>   * Return acting set for given pgid.
>   */
>  int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers
  2014-03-27 18:18 ` [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers Ilya Dryomov
@ 2014-03-27 20:49   ` Alex Elder
  2014-03-28 15:02     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:49 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
> raw_to_up_osds() and apply_temps().

So that's why you have a temp map in each osdmap.
The result is pretty clean and you eliminate the
local rawosds array.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |    2 +-
>  net/ceph/osdmap.c           |   51 ++++++++++++++++++++++++++++++++-----------
>  2 files changed, 39 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index 4e28c1e5d62f..b0c8f8490663 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -212,7 +212,7 @@ extern int ceph_oloc_oid_to_pg(struct ceph_osdmap *osdmap,
>  
>  extern int ceph_calc_pg_acting(struct ceph_osdmap *osdmap,
>  			       struct ceph_pg pgid,
> -			       int *acting);
> +			       int *osds);
>  extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
>  				struct ceph_pg pgid);
>  
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 6d418433d80d..1963623bd488 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1642,24 +1642,49 @@ static int apply_temps(struct ceph_osdmap *osdmap,
>  }
>  
>  /*
> - * Return acting set for given pgid.
> + * Calculate acting set for given pgid.
> + *
> + * Return acting set length, or error.
>   */
>  int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
> -			int *acting)
> +			int *osds)
>  {
> -	int rawosds[CEPH_PG_MAX_SIZE], *osds;
> -	int i, o, num = CEPH_PG_MAX_SIZE;
> +	struct ceph_pg_pool_info *pool;
> +	u32 pps;
> +	int len;
> +	int primary;
>  
> -	osds = calc_pg_raw(osdmap, pgid, rawosds, &num);
> -	if (!osds)
> -		return -1;
> +	pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
> +	if (!pool)
> +		return 0;
>  
> -	/* primary is first up osd */
> -	o = 0;
> -	for (i = 0; i < num; i++)
> -		if (ceph_osd_is_up(osdmap, osds[i]))
> -			acting[o++] = osds[i];
> -	return o;
> +	if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
> +		/* hash pool id and seed so that pool PGs do not overlap */
> +		pps = crush_hash32_2(CRUSH_HASH_RJENKINS1,
> +				     ceph_stable_mod(pgid.seed, pool->pgp_num,
> +						     pool->pgp_num_mask),
> +				     pgid.pool);
> +	} else {
> +		/*
> +		 * legacy ehavior: add ps and pool together.  this is

Typo "behavior"

> +		 * not a great approach because the PGs from each pool
> +		 * will overlap on top of each other: 0.5 == 1.4 ==
> +		 * 2.3 == ...
> +		 */
> +		pps = ceph_stable_mod(pgid.seed, pool->pgp_num,
> +				      pool->pgp_num_mask) +
> +			(unsigned)pgid.pool;
> +	}
> +
> +	len = pg_to_raw_osds(osdmap, pool, pgid, pps, osds);
> +	if (len < 0)
> +		return len;
> +
> +	len = raw_to_up_osds(osdmap, pool, osds, len, &primary);
> +
> +	len = apply_temps(osdmap, pool, pgid, osds, len, &primary);
> +
> +	return len;
>  }
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 29/33] libceph: return primary from ceph_calc_pg_acting()
  2014-03-27 18:18 ` [PATCH 29/33] libceph: return primary from ceph_calc_pg_acting() Ilya Dryomov
@ 2014-03-27 20:50   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:50 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> In preparation for adding support for primary_temp, stop assuming
> primaryness: add a primary out parameter to ceph_calc_pg_acting() and
> change call sites accordingly.  Primary is now specified separately
> from the order of osds in the set.

And the primary is no longer going to be assumed to
be the first.  This is good to hear.

Reviewed-by: Alex Elder <elder@linaro.org>


> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osdmap.h |    2 +-
>  net/ceph/osd_client.c       |   10 ++++------
>  net/ceph/osdmap.c           |   20 ++++++++++++--------
>  3 files changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index b0c8f8490663..561ea896c657 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -212,7 +212,7 @@ extern int ceph_oloc_oid_to_pg(struct ceph_osdmap *osdmap,
>  
>  extern int ceph_calc_pg_acting(struct ceph_osdmap *osdmap,
>  			       struct ceph_pg pgid,
> -			       int *osds);
> +			       int *osds, int *primary);
>  extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
>  				struct ceph_pg pgid);
>  
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 6f64eec18851..b4157dc22199 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1333,7 +1333,7 @@ static int __map_request(struct ceph_osd_client *osdc,
>  {
>  	struct ceph_pg pgid;
>  	int acting[CEPH_PG_MAX_SIZE];
> -	int o = -1, num = 0;
> +	int num, o;
>  	int err;
>  	bool was_paused;
>  
> @@ -1346,11 +1346,9 @@ static int __map_request(struct ceph_osd_client *osdc,
>  	}
>  	req->r_pgid = pgid;
>  
> -	err = ceph_calc_pg_acting(osdc->osdmap, pgid, acting);
> -	if (err > 0) {
> -		o = acting[0];
> -		num = err;
> -	}
> +	num = ceph_calc_pg_acting(osdc->osdmap, pgid, acting, &o);
> +	if (num < 0)
> +		num = 0;
>  
>  	was_paused = req->r_paused;
>  	req->r_paused = __req_should_be_paused(osdc, req);
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 1963623bd488..7193b012ee02 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1644,19 +1644,21 @@ static int apply_temps(struct ceph_osdmap *osdmap,
>  /*
>   * Calculate acting set for given pgid.
>   *
> - * Return acting set length, or error.
> + * Return acting set length, or error.  *primary is set to acting
> + * primary osd id, or -1 if acting set is empty or on error.
>   */
>  int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
> -			int *osds)
> +			int *osds, int *primary)
>  {
>  	struct ceph_pg_pool_info *pool;
>  	u32 pps;
>  	int len;
> -	int primary;
>  
>  	pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
> -	if (!pool)
> -		return 0;
> +	if (!pool) {
> +		*primary = -1;
> +		return -ENOENT;
> +	}
>  
>  	if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
>  		/* hash pool id and seed so that pool PGs do not overlap */
> @@ -1677,12 +1679,14 @@ int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
>  	}
>  
>  	len = pg_to_raw_osds(osdmap, pool, pgid, pps, osds);
> -	if (len < 0)
> +	if (len < 0) {
> +		*primary = -1;
>  		return len;
> +	}
>  
> -	len = raw_to_up_osds(osdmap, pool, osds, len, &primary);
> +	len = raw_to_up_osds(osdmap, pool, osds, len, primary);
>  
> -	len = apply_temps(osdmap, pool, pgid, osds, len, &primary);
> +	len = apply_temps(osdmap, pool, pgid, osds, len, primary);
>  
>  	return len;
>  }
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 30/33] libceph: add support for primary_temp mappings
  2014-03-27 18:18 ` [PATCH 30/33] libceph: add support for primary_temp mappings Ilya Dryomov
@ 2014-03-27 20:51   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:51 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Change apply_temp() to override primary in the same way pg_temp
> overrides osd set.  primary_temp overrides pg_temp primary too.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 7193b012ee02..ed52b47d0ddb 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1590,7 +1590,7 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
>  }
>  
>  /*
> - * Given up set, apply pg_temp mapping.
> + * Given up set, apply pg_temp and primary_temp mappings.
>   *
>   * Return acting set length.  *primary is set to acting primary osd id,
>   * or -1 if acting set is empty.
> @@ -1637,6 +1637,11 @@ static int apply_temps(struct ceph_osdmap *osdmap,
>  		temp_primary = *primary;
>  	}
>  
> +	/* primary_temp? */
> +	pg = __lookup_pg_mapping(&osdmap->primary_temp, pgid);
> +	if (pg)
> +		temp_primary = pg->primary_temp.osd;
> +
>  	*primary = temp_primary;
>  	return temp_len;
>  }
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 31/33] libceph: add support for osd primary affinity
  2014-03-27 18:18 ` [PATCH 31/33] libceph: add support for osd primary affinity Ilya Dryomov
@ 2014-03-27 20:59   ` Alex Elder
  2014-03-28 15:03     ` Ilya Dryomov
  0 siblings, 1 reply; 77+ messages in thread
From: Alex Elder @ 2014-03-27 20:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Respond to non-default primary_affinity values accordingly.  (Primary
> affinity allows the admin to shift 'primary responsibility' away from
> specific osds, effectively shifting around the read side of the
> workload and whatever overhead is incurred by peering and writes by
> virtue of being the primary).

The code looks good, I presume it matches the algorithm.
I have a few questions below but nothing serious.

Reviewed-by: Alex Elder <elder@linaro.org>

> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index ed52b47d0ddb..8c596a13c60f 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1589,6 +1589,72 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
>  	return len;
>  }
>  
> +static void apply_primary_affinity(struct ceph_osdmap *osdmap, u32 pps,
> +				   struct ceph_pg_pool_info *pool,
> +				   int *osds, int len, int *primary)
> +{
> +	int i;
> +	int pos = -1;
> +
> +	/*
> +	 * Do we have any non-default primary_affinity values for these
> +	 * osds?
> +	 */
> +	if (!osdmap->osd_primary_affinity)
> +		return;
> +
> +	for (i = 0; i < len; i++) {
> +		if (osds[i] != CRUSH_ITEM_NONE &&
> +		    osdmap->osd_primary_affinity[i] !=
> +					CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {
> +			break;
> +		}
> +	}
> +	if (i == len)
> +		return;

So if they're all DEFAULT_AFFINITY they you don't bother.

I'm trying to understand what happens if at least one is
DEFAULT and at least one is not DEFAULT.

> +
> +	/*
> +	 * Pick the primary.  Feed both the seed (for the pg) and the
> +	 * osd into the hash/rng so that a proportional fraction of an
> +	 * osd's pgs get rejected as primary.
> +	 */
> +	for (i = 0; i < len; i++) {
> +		int o;
> +		u32 a;

Maybe "osd" and "aff" for osd number and affinity values?

> +
> +		o = osds[i];
> +		if (o == CRUSH_ITEM_NONE)
> +			continue;
> +
> +		a = osdmap->osd_primary_affinity[o];
> +		if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&

So CEPH_OSD_MAX_PRIMARY_AFFINITY is actually one more than
the maximum allowed value, right?

> +		    (crush_hash32_2(CRUSH_HASH_RJENKINS1,
> +				    pps, o) >> 16) >= a) {
> +			/*
> +			 * We chose not to use this primary.  Note it
> +			 * anyway as a fallback in case we don't pick
> +			 * anyone else, but keep looking.
> +			 */
> +			if (pos < 0)
> +				pos = i;
> +		} else {
> +			pos = i;
> +			break;
> +		}
> +	}
> +	if (pos < 0)
> +		return;
> +
> +	*primary = osds[pos];
> +
> +	if (ceph_can_shift_osds(pool) && pos > 0) {
> +		/* move the new primary to the front */
> +		for (i = pos; i > 0; i--)
> +			osds[i] = osds[i - 1];
> +		osds[0] = *primary;
> +	}

So the first one *is* the primary, you just renumber them.
I see.

> +}
> +
>  /*
>   * Given up set, apply pg_temp and primary_temp mappings.
>   *
> @@ -1691,6 +1757,8 @@ int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
>  
>  	len = raw_to_up_osds(osdmap, pool, osds, len, primary);
>  
> +	apply_primary_affinity(osdmap, pps, pool, osds, len, primary);
> +
>  	len = apply_temps(osdmap, pool, pgid, osds, len, primary);
>  
>  	return len;
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 32/33] libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
  2014-03-27 18:18 ` [PATCH 32/33] libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() Ilya Dryomov
@ 2014-03-27 21:04   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 21:04 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
> and get rid of the now unused calc_pg_raw().

I'll be honest, my review of this one isn't very
solid but it looks OK to me.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  net/ceph/osdmap.c |   79 +++--------------------------------------------------
>  1 file changed, 4 insertions(+), 75 deletions(-)
> 
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 8c596a13c60f..f0567d8ca683 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1449,71 +1449,6 @@ static int do_crush(struct ceph_osdmap *map, int ruleno, int x,
>  }
>  
>  /*
> - * Calculate raw osd vector for the given pgid.  Return pointer to osd
> - * array, or NULL on failure.
> - */
> -static int *calc_pg_raw(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
> -			int *osds, int *num)
> -{
> -	struct ceph_pg_mapping *pg;
> -	struct ceph_pg_pool_info *pool;
> -	int ruleno;
> -	int r;
> -	u32 pps;
> -
> -	pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
> -	if (!pool)
> -		return NULL;
> -
> -	/* pg_temp? */
> -	pgid.seed = ceph_stable_mod(pgid.seed, pool->pg_num,
> -				    pool->pg_num_mask);
> -	pg = __lookup_pg_mapping(&osdmap->pg_temp, pgid);
> -	if (pg) {
> -		*num = pg->pg_temp.len;
> -		return pg->pg_temp.osds;
> -	}
> -
> -	/* crush */
> -	ruleno = crush_find_rule(osdmap->crush, pool->crush_ruleset,
> -				 pool->type, pool->size);
> -	if (ruleno < 0) {
> -		pr_err("no crush rule pool %lld ruleset %d type %d size %d\n",
> -		       pgid.pool, pool->crush_ruleset, pool->type,
> -		       pool->size);
> -		return NULL;
> -	}
> -
> -	if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
> -		/* hash pool id and seed sothat pool PGs do not overlap */
> -		pps = crush_hash32_2(CRUSH_HASH_RJENKINS1,
> -				     ceph_stable_mod(pgid.seed, pool->pgp_num,
> -						     pool->pgp_num_mask),
> -				     pgid.pool);
> -	} else {
> -		/*
> -		 * legacy ehavior: add ps and pool together.  this is
> -		 * not a great approach because the PGs from each pool
> -		 * will overlap on top of each other: 0.5 == 1.4 ==
> -		 * 2.3 == ...
> -		 */
> -		pps = ceph_stable_mod(pgid.seed, pool->pgp_num,
> -				      pool->pgp_num_mask) +
> -			(unsigned)pgid.pool;
> -	}
> -	r = do_crush(osdmap, ruleno, pps, osds, min_t(int, pool->size, *num),
> -		     osdmap->osd_weight, osdmap->max_osd);
> -	if (r < 0) {
> -		pr_err("error %d from crush rule: pool %lld ruleset %d type %d"
> -		       " size %d\n", r, pgid.pool, pool->crush_ruleset,
> -		       pool->type, pool->size);
> -		return NULL;
> -	}
> -	*num = r;
> -	return osds;
> -}
> -
> -/*
>   * Calculate raw (crush) set for given pgid.
>   *
>   * Return raw set length, or error.
> @@ -1769,17 +1704,11 @@ int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
>   */
>  int ceph_calc_pg_primary(struct ceph_osdmap *osdmap, struct ceph_pg pgid)
>  {
> -	int rawosds[CEPH_PG_MAX_SIZE], *osds;
> -	int i, num = CEPH_PG_MAX_SIZE;
> +	int osds[CEPH_PG_MAX_SIZE];
> +	int primary;
>  
> -	osds = calc_pg_raw(osdmap, pgid, rawosds, &num);
> -	if (!osds)
> -		return -1;
> +	ceph_calc_pg_acting(osdmap, pgid, osds, &primary);
>  
> -	/* primary is first up osd */
> -	for (i = 0; i < num; i++)
> -		if (ceph_osd_is_up(osdmap, osds[i]))
> -			return osds[i];
> -	return -1;
> +	return primary;
>  }
>  EXPORT_SYMBOL(ceph_calc_pg_primary);
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 33/33] libceph: enable PRIMARY_AFFINITY feature bit
  2014-03-27 18:18 ` [PATCH 33/33] libceph: enable PRIMARY_AFFINITY feature bit Ilya Dryomov
@ 2014-03-27 21:04   ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-27 21:04 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
> Announce our support for osdmaps with non-default primary affinity
> values.

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/ceph_features.h |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
> index 7a4cab50b2cd..d12659ce550d 100644
> --- a/include/linux/ceph/ceph_features.h
> +++ b/include/linux/ceph/ceph_features.h
> @@ -91,7 +91,8 @@ static inline u64 ceph_sanitize_features(u64 features)
>  	 CEPH_FEATURE_CRUSH_V2 |		\
>  	 CEPH_FEATURE_EXPORT_PEER |		\
>  	 CEPH_FEATURE_OSDMAP_ENC |		\
> -	 CEPH_FEATURE_CRUSH_TUNABLES3)
> +	 CEPH_FEATURE_CRUSH_TUNABLES3 |		\
> +	 CEPH_FEATURE_OSD_PRIMARY_AFFINITY)
>  
>  #define CEPH_FEATURES_REQUIRED_DEFAULT   \
>  	(CEPH_FEATURE_NOSRCADDR |	 \
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/33] libceph: fixup error handling in osdmap_decode()
  2014-03-27 19:25   ` Alex Elder
@ 2014-03-28 14:56     ` Ilya Dryomov
  2014-03-28 16:22       ` Alex Elder
  0 siblings, 1 reply; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 14:56 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 9:25 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
>> The existing error handling scheme requires resetting err to -EINVAL
>> prior to calling any ceph_decode_* macro.  This is ugly and fragile,
>> and there already are a few places where we would return 0 on error,
>> due to a missing reset.  Fix this by adding a special e_inval label to
>> be used by all ceph_decode_* macros.
>
> I don't see where it's returning 0 on error, but I think this
> is a good change anyway.

Here:

        err = __decode_pool_names(p, end, map); <--
        if (err < 0) {
                dout("fail to decode pool names");
                goto bad;
        }

        ceph_decode_32_safe(p, end, map->pool_max, bad); <--

        ceph_decode_32_safe(p, end, map->flags, bad); <--

Or here (if at least one pg_temp mapping is present):

                err = __insert_pg_mapping(pg, &map->pg_temp); <--
                if (err)
                        goto bad;
                dout(" added pg_temp %lld.%x len %d\n", pgid.pool, pgid.seed,
                     len);
        }

        /* crush */
        ceph_decode_32_safe(p, end, len, bad); <--
        dout("osdmap_decode crush len %d from off 0x%x\n", len,
             (int)(*p - start));
        ceph_decode_need(p, end, len, bad); <--

And a lot more in osdmap_apply_incremental().  There are three ways out:

(1) a separate variable for helper retvals;
(2) resetting err to -EINVAL prior to each ceph_decode_* (if it's
    not already -EINVAL, of course);
(3) a separate e_inval label.

(3) is the only reasonable way to do this.  (1) leads to things like

    ret = foo_helper();
    if (ret) {
            err = ret;
            ...
    }

and (2) is error-prone and hardly maintainable.

>
> I'd use "einval" or "err_inval" instead of "e_inval".  But
> no matter.

We already use "e_inval" in osd_client.c (OK, that was me), so I'll
keep it for consistency.  I could rename them all to "Einval" to make
them stand out though (I see some "Efoo" labels in fs/).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 08/33] libceph: assert length of osdmap osd arrays
  2014-03-27 19:30   ` Alex Elder
@ 2014-03-28 14:57     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 14:57 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 9:30 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
>> Assert length of osd_state, osd_weight and osd_addr arrays.  They
>> should all have exactly max_osd elements after the call to
>> osdmap_set_max_osd().
>
> Since this function is allowed to fail, could these
> conditions lead to returning an error code rather than
> killing the machine?
>
> Your testing incoming data (which you can't necessarily
> trust), not a fundamental assumption of the code, so
> a BUG() seems harsh.
>
> Checking is absolutely the right thing to do.
>
> Switch it to return an error if you can.  If you feel
> BUG() is right, so be it.  Either way:

Changed to returning -EINVAL.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode()
  2014-03-27 19:45   ` Alex Elder
@ 2014-03-28 14:57     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 14:57 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 9:45 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
>> The size of the memory area feeded to crush_decode() should be limited
>> not only by osdmap end, but also by the crush map length.  Also, drop
>
> You're also letting crush_decode() verify it has the buffer space
> it needs internally, rather than checking it before making the call,
> which is good.  (Though I guess you don't have to mention it.)

Yes.

>
>> unnecessary dout() (dout() in crush_decode() conveys the same info) and
>> step past crush map only if it is decoded successfully.
>
> I actually think crush_decode() should take a (void **)
> instead, as its first argument and advance the pointer
> by as much as it uses (like most of the other routines do).
> That's a suggestion, but I don't really care, this is fine.

Me too, and I considered it, but it's the only decode helper that takes
a (void *) and it even names it "pbyval", which suggests that it was
intentional, so I kept it the way it is.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental()
  2014-03-27 19:49   ` Alex Elder
@ 2014-03-28 14:58     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 14:58 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 9:49 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
>> The existing error handling scheme requires resetting err to -EINVAL
>> prior to calling any ceph_decode_* macro.  This is ugly and fragile,
>> and there already are a few places where we would return 0 on error,
>> due to a missing reset.  Follow osdmap_decode() and fix this by adding
>> a special e_inval label to be used by all ceph_decode_* macros.
>
> Same comments as last time.  Otherwise, looks good.

Replied to the osdmap_decode() comment.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 17/33] libceph: introduce get_osdmap_client_data_v()
  2014-03-27 20:17   ` Alex Elder
@ 2014-03-28 14:59     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 14:59 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 10:17 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
>> Full and incremental osdmaps are structured identically and have
>> identical headers.  Add a helper to decode both "old" (16-bit version,
>> v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap
>> enconding headers and switch to it.
>
> It wasn't clear to me at first that this was adding a
> new bit of functionality--support for v7 OSD map encodings.
>
> A couple comments below but this looks good.
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>> ---
>>  net/ceph/osdmap.c |   81 ++++++++++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 65 insertions(+), 16 deletions(-)
>>
>> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
>> index 0134df3639d2..ae96c73aff71 100644
>> --- a/net/ceph/osdmap.c
>> +++ b/net/ceph/osdmap.c
>> @@ -683,6 +683,63 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
>>       return 0;
>>  }
>>
>> +#define OSDMAP_WRAPPER_COMPAT_VER    7
>> +#define OSDMAP_CLIENT_DATA_COMPAT_VER        1
>
> Don't these definitions belong in a common header?

I think it would be out of context in a common header.  It's just so
that the error string is updated whenever the actual integer changes,
a poor substitute for a piece of ceph.git DECODE magic.

>
>> +/*
>> + * Return 0 or error.  On success, *v is set to 0 for old (v6) osdmaps,
>> + * to struct_v of the client_data section for new (v7 and above)
>> + * osdmaps.
>> + */
>> +static int get_osdmap_client_data_v(void **p, void *end,
>> +                                 const char *s, u8 *v)
>
> I like to avoid one-character names (in this case, "s").
> Mainly because it's hard to search for them.
>
> You could pass a Boolean "full" and use that to select
> what's printed in the warning messages.

I was passing "incremental" bool at one point, similar to the other
helpers, but then changed it because it made it look like this function
does different things for full and incremental maps, when the whole
point is that their headers are the same.  I'll rename "s" to "prefix"
though.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 21/33] libceph: primary_affinity infrastructure
  2014-03-27 20:26   ` Alex Elder
@ 2014-03-28 15:01     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 15:01 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 10:26 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
>> Add primary_affinity infrastructure.  primary_affinity values are
>> stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
>> a osd_weight array.
>>
>> Introduce {get,set}_primary_affinity() helpers, primarily to return
>> CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
>> abstract out osd_primary_affinity array allocation and initialization.
>
> One comment about some constant definitions, but
> this looks good.
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>> ---
>>  include/linux/ceph/osdmap.h |    3 +++
>>  include/linux/ceph/rados.h  |    4 ++++
>>  net/ceph/debugfs.c          |    5 +++--
>>  net/ceph/osdmap.c           |   47 +++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 57 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
>> index db4fb6322aae..6e030cb3c9ca 100644
>> --- a/include/linux/ceph/osdmap.h
>> +++ b/include/linux/ceph/osdmap.h
>> @@ -88,6 +88,8 @@ struct ceph_osdmap {
>>       struct rb_root pg_temp;
>>       struct rb_root primary_temp;
>>
>> +     u32 *osd_primary_affinity;
>> +
>>       struct rb_root pg_pools;
>>       u32 pool_max;
>>
>> @@ -134,6 +136,7 @@ static inline bool ceph_osdmap_flag(struct ceph_osdmap *map, int flag)
>>  }
>>
>>  extern char *ceph_osdmap_state_str(char *str, int len, int state);
>> +extern u32 ceph_get_primary_affinity(struct ceph_osdmap *map, int osd);
>>
>>  static inline struct ceph_entity_addr *ceph_osd_addr(struct ceph_osdmap *map,
>>                                                    int osd)
>> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
>> index 2caabef8d369..bb6f40c9cb0f 100644
>> --- a/include/linux/ceph/rados.h
>> +++ b/include/linux/ceph/rados.h
>> @@ -133,6 +133,10 @@ extern const char *ceph_osd_state_name(int s);
>>  #define CEPH_OSD_IN  0x10000
>>  #define CEPH_OSD_OUT 0
>>
>> +/* osd primary-affinity.  fixed point value: 0x10000 == baseline */
>> +#define CEPH_OSD_MAX_PRIMARY_AFFINITY 0x10000
>> +#define CEPH_OSD_DEFAULT_PRIMARY_AFFINITY 0x10000
>> +
>
> It seems like these definitions may also belong in a
> common header file.  However I know that in some cases
> it's necessary to impose limits in the kernel where
> none is enforced in user space.

They are in a common header - linux/ceph/rados.h - and come from
userspace.  Primary affinity is somewhat similar to osd_weight values.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 22/33] libceph: primary_affinity decode bits
  2014-03-27 20:31   ` Alex Elder
@ 2014-03-28 15:01     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 15:01 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 10:31 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
>> Add two helpers to decode primary_affinity (full map, vector<u32>) and
>> new_primary_affinity (inc map, map<u32, u32>) and switch to them.
>
> One comment below, but otherwise looks good.
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>> ---
>>  net/ceph/osdmap.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 71 insertions(+)
>>
>> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
>> index 538b8dd341e8..3ac2098972ea 100644
>> --- a/net/ceph/osdmap.c
>> +++ b/net/ceph/osdmap.c
>> @@ -959,6 +959,59 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
>>       return 0;
>>  }
>>
>> +static int decode_primary_affinity(void **p, void *end,
>> +                                struct ceph_osdmap *map)
>> +{
>> +     u32 len, i;
>> +
>> +     ceph_decode_32_safe(p, end, len, e_inval);
>> +     if (len == 0) {
>> +             kfree(map->osd_primary_affinity);
>> +             map->osd_primary_affinity = NULL;
>> +             return 0;
>> +     }
>> +
>> +     ceph_decode_need(p, end, map->max_osd*sizeof(u32), e_inval);
>> +
>> +     BUG_ON(len != map->max_osd);
>
> BUG() here is too much; it should I think just return an error instead.
> The test could be done earlier too, prior to ceph_decode_need().

Moved the test, changed to returning -EINVAL.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit
  2014-03-27 20:32   ` Alex Elder
@ 2014-03-28 15:01     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 15:01 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 10:32 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
>> Announce our support for "new" osdmap enconding.
>
> Looks OK to me.  Isn't there a version of this OSD
> map encoding?  Maybe there'll be a "newer" one someday?

Reworded to "Announce our support for "new" (v7 - split and separately
versioned client and osd sections) osdmap enconding."

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers
  2014-03-27 20:49   ` Alex Elder
@ 2014-03-28 15:02     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 15:02 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 10:49 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
>> Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
>> raw_to_up_osds() and apply_temps().
>
> So that's why you have a temp map in each osdmap.
> The result is pretty clean and you eliminate the
> local rawosds array.
>
> Looks good.
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>> ---
>>  include/linux/ceph/osdmap.h |    2 +-
>>  net/ceph/osdmap.c           |   51 ++++++++++++++++++++++++++++++++-----------
>>  2 files changed, 39 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
>> index 4e28c1e5d62f..b0c8f8490663 100644
>> --- a/include/linux/ceph/osdmap.h
>> +++ b/include/linux/ceph/osdmap.h
>> @@ -212,7 +212,7 @@ extern int ceph_oloc_oid_to_pg(struct ceph_osdmap *osdmap,
>>
>>  extern int ceph_calc_pg_acting(struct ceph_osdmap *osdmap,
>>                              struct ceph_pg pgid,
>> -                            int *acting);
>> +                            int *osds);
>>  extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
>>                               struct ceph_pg pgid);
>>
>> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
>> index 6d418433d80d..1963623bd488 100644
>> --- a/net/ceph/osdmap.c
>> +++ b/net/ceph/osdmap.c
>> @@ -1642,24 +1642,49 @@ static int apply_temps(struct ceph_osdmap *osdmap,
>>  }
>>
>>  /*
>> - * Return acting set for given pgid.
>> + * Calculate acting set for given pgid.
>> + *
>> + * Return acting set length, or error.
>>   */
>>  int ceph_calc_pg_acting(struct ceph_osdmap *osdmap, struct ceph_pg pgid,
>> -                     int *acting)
>> +                     int *osds)
>>  {
>> -     int rawosds[CEPH_PG_MAX_SIZE], *osds;
>> -     int i, o, num = CEPH_PG_MAX_SIZE;
>> +     struct ceph_pg_pool_info *pool;
>> +     u32 pps;
>> +     int len;
>> +     int primary;
>>
>> -     osds = calc_pg_raw(osdmap, pgid, rawosds, &num);
>> -     if (!osds)
>> -             return -1;
>> +     pool = __lookup_pg_pool(&osdmap->pg_pools, pgid.pool);
>> +     if (!pool)
>> +             return 0;
>>
>> -     /* primary is first up osd */
>> -     o = 0;
>> -     for (i = 0; i < num; i++)
>> -             if (ceph_osd_is_up(osdmap, osds[i]))
>> -                     acting[o++] = osds[i];
>> -     return o;
>> +     if (pool->flags & CEPH_POOL_FLAG_HASHPSPOOL) {
>> +             /* hash pool id and seed so that pool PGs do not overlap */
>> +             pps = crush_hash32_2(CRUSH_HASH_RJENKINS1,
>> +                                  ceph_stable_mod(pgid.seed, pool->pgp_num,
>> +                                                  pool->pgp_num_mask),
>> +                                  pgid.pool);
>> +     } else {
>> +             /*
>> +              * legacy ehavior: add ps and pool together.  this is
>
> Typo "behavior"

Fixed.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 31/33] libceph: add support for osd primary affinity
  2014-03-27 20:59   ` Alex Elder
@ 2014-03-28 15:03     ` Ilya Dryomov
  0 siblings, 0 replies; 77+ messages in thread
From: Ilya Dryomov @ 2014-03-28 15:03 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Thu, Mar 27, 2014 at 10:59 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/27/2014 01:18 PM, Ilya Dryomov wrote:
>> Respond to non-default primary_affinity values accordingly.  (Primary
>> affinity allows the admin to shift 'primary responsibility' away from
>> specific osds, effectively shifting around the read side of the
>> workload and whatever overhead is incurred by peering and writes by
>> virtue of being the primary).
>
> The code looks good, I presume it matches the algorithm.
> I have a few questions below but nothing serious.
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>> ---
>>  net/ceph/osdmap.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 68 insertions(+)
>>
>> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
>> index ed52b47d0ddb..8c596a13c60f 100644
>> --- a/net/ceph/osdmap.c
>> +++ b/net/ceph/osdmap.c
>> @@ -1589,6 +1589,72 @@ static int raw_to_up_osds(struct ceph_osdmap *osdmap,
>>       return len;
>>  }
>>
>> +static void apply_primary_affinity(struct ceph_osdmap *osdmap, u32 pps,
>> +                                struct ceph_pg_pool_info *pool,
>> +                                int *osds, int len, int *primary)
>> +{
>> +     int i;
>> +     int pos = -1;
>> +
>> +     /*
>> +      * Do we have any non-default primary_affinity values for these
>> +      * osds?
>> +      */
>> +     if (!osdmap->osd_primary_affinity)
>> +             return;
>> +
>> +     for (i = 0; i < len; i++) {
>> +             if (osds[i] != CRUSH_ITEM_NONE &&
>> +                 osdmap->osd_primary_affinity[i] !=
>> +                                     CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) {
>> +                     break;
>> +             }
>> +     }
>> +     if (i == len)
>> +             return;
>
> So if they're all DEFAULT_AFFINITY they you don't bother.

Exactly.

>
> I'm trying to understand what happens if at least one is
> DEFAULT and at least one is not DEFAULT.
>
>> +
>> +     /*
>> +      * Pick the primary.  Feed both the seed (for the pg) and the
>> +      * osd into the hash/rng so that a proportional fraction of an
>> +      * osd's pgs get rejected as primary.
>> +      */
>> +     for (i = 0; i < len; i++) {
>> +             int o;
>> +             u32 a;
>
> Maybe "osd" and "aff" for osd number and affinity values?

Done.

>
>> +
>> +             o = osds[i];
>> +             if (o == CRUSH_ITEM_NONE)
>> +                     continue;
>> +
>> +             a = osdmap->osd_primary_affinity[o];
>> +             if (a < CEPH_OSD_MAX_PRIMARY_AFFINITY &&
>
> So CEPH_OSD_MAX_PRIMARY_AFFINITY is actually one more than
> the maximum allowed value, right?

No, like I mentioned in my reply to another patch, primary affinity is
very similar to osd weights.  Conceptually, it's a floating point
value, [0..1].  If it's 1 (DEFAULT, and also MAX) crush output is left
intact and the first osd in the up set is primary.  If it's less than
1, a different osd in the up set is "preferred" for the primary role,
with appropriate probability.  If it's 0, that osd will never be
primary, not for a single pg, if possible of course.

And, similar to osd weights, primary affinity is serialized to a fixed
point value, [0..0x10000].  0x10000 === 1, hence the if (a < MAX).

>
>> +                 (crush_hash32_2(CRUSH_HASH_RJENKINS1,
>> +                                 pps, o) >> 16) >= a) {
>> +                     /*
>> +                      * We chose not to use this primary.  Note it
>> +                      * anyway as a fallback in case we don't pick
>> +                      * anyone else, but keep looking.
>> +                      */
>> +                     if (pos < 0)
>> +                             pos = i;
>> +             } else {
>> +                     pos = i;
>> +                     break;
>> +             }
>> +     }
>> +     if (pos < 0)
>> +             return;
>> +
>> +     *primary = osds[pos];
>> +
>> +     if (ceph_can_shift_osds(pool) && pos > 0) {
>> +             /* move the new primary to the front */
>> +             for (i = pos; i > 0; i--)
>> +                     osds[i] = osds[i - 1];
>> +             osds[0] = *primary;
>> +     }
>
> So the first one *is* the primary, you just renumber them.
> I see.

Yeah, we still move it to the front, for replicated pgs.  However, if
primary_temp mapping for that pg exists, the primary will be whatever
that mapping says it is, and at that point osds won't be reshuffled no
matter what.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/33] libceph: fixup error handling in osdmap_decode()
  2014-03-28 14:56     ` Ilya Dryomov
@ 2014-03-28 16:22       ` Alex Elder
  0 siblings, 0 replies; 77+ messages in thread
From: Alex Elder @ 2014-03-28 16:22 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

On 03/28/2014 09:56 AM, Ilya Dryomov wrote:
> On Thu, Mar 27, 2014 at 9:25 PM, Alex Elder <elder@ieee.org> wrote:
>> On 03/27/2014 01:17 PM, Ilya Dryomov wrote:
>>> The existing error handling scheme requires resetting err to -EINVAL
>>> prior to calling any ceph_decode_* macro.  This is ugly and fragile,
>>> and there already are a few places where we would return 0 on error,
>>> due to a missing reset.  Fix this by adding a special e_inval label to
>>> be used by all ceph_decode_* macros.
>>
>> I don't see where it's returning 0 on error, but I think this
>> is a good change anyway.
> 
> Here:
> 
>         err = __decode_pool_names(p, end, map); <--
>         if (err < 0) {
>                 dout("fail to decode pool names");
>                 goto bad;
>         }
> 
>         ceph_decode_32_safe(p, end, map->pool_max, bad); <--
> 
>         ceph_decode_32_safe(p, end, map->flags, bad); <--

Fragile indeed.

I don't particularly like the encoding of an assumed
label in a macro the way these *_safe() macros do either
but they do the job and they're all over the place.
The biggest reason is that it assumes something about
context, but this is another one, it makes things less
obvious.  Oh well.

> Or here (if at least one pg_temp mapping is present):
> 
>                 err = __insert_pg_mapping(pg, &map->pg_temp); <--
>                 if (err)
>                         goto bad;
>                 dout(" added pg_temp %lld.%x len %d\n", pgid.pool, pgid.seed,
>                      len);
>         }
> 
>         /* crush */
>         ceph_decode_32_safe(p, end, len, bad); <--
>         dout("osdmap_decode crush len %d from off 0x%x\n", len,
>              (int)(*p - start));
>         ceph_decode_need(p, end, len, bad); <--
> 
> And a lot more in osdmap_apply_incremental().  There are three ways out:
> 
> (1) a separate variable for helper retvals;
> (2) resetting err to -EINVAL prior to each ceph_decode_* (if it's
>     not already -EINVAL, of course);
> (3) a separate e_inval label.
> 
> (3) is the only reasonable way to do this.  (1) leads to things like
> 
>     ret = foo_helper();
>     if (ret) {
>             err = ret;
>             ...
>     }
> 
> and (2) is error-prone and hardly maintainable.

I agree, (3) is the right fix.

>> I'd use "einval" or "err_inval" instead of "e_inval".  But
>> no matter.
> 
> We already use "e_inval" in osd_client.c (OK, that was me), so I'll
> keep it for consistency.  I could rename them all to "Einval" to make
> them stand out though (I see some "Efoo" labels in fs/).

Not a big deal.  Beauty is in the eye of the beholder.

					-Alex
> Thanks,
> 
>                 Ilya
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2014-03-28 16:22 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-27 18:17 [PATCH 00/33] OSDMAP_ENC, primary_temp, PRIMARY_AFFINITY Ilya Dryomov
2014-03-27 18:17 ` [PATCH 01/33] libceph: refer to osdmap directly in osdmap_show() Ilya Dryomov
2014-03-27 19:09   ` Alex Elder
2014-03-27 18:17 ` [PATCH 02/33] libceph: do not prefix osd lines with \t in debugfs output Ilya Dryomov
2014-03-27 19:10   ` Alex Elder
2014-03-27 18:17 ` [PATCH 03/33] libceph: dump pg_temp mappings to debugfs Ilya Dryomov
2014-03-27 19:11   ` Alex Elder
2014-03-27 18:17 ` [PATCH 04/33] libceph: dump osdmap and enhance output on decode errors Ilya Dryomov
2014-03-27 19:15   ` Alex Elder
2014-03-27 18:17 ` [PATCH 05/33] libceph: split osdmap allocation and decode steps Ilya Dryomov
2014-03-27 19:18   ` Alex Elder
2014-03-27 18:17 ` [PATCH 06/33] libceph: fixup error handling in osdmap_decode() Ilya Dryomov
2014-03-27 19:25   ` Alex Elder
2014-03-28 14:56     ` Ilya Dryomov
2014-03-28 16:22       ` Alex Elder
2014-03-27 18:17 ` [PATCH 07/33] libceph: safely decode max_osd value " Ilya Dryomov
2014-03-27 19:27   ` Alex Elder
2014-03-27 18:17 ` [PATCH 08/33] libceph: assert length of osdmap osd arrays Ilya Dryomov
2014-03-27 19:30   ` Alex Elder
2014-03-28 14:57     ` Ilya Dryomov
2014-03-27 18:17 ` [PATCH 09/33] libceph: fix crush_decode() call site in osdmap_decode() Ilya Dryomov
2014-03-27 19:45   ` Alex Elder
2014-03-28 14:57     ` Ilya Dryomov
2014-03-27 18:17 ` [PATCH 10/33] libceph: fixup error handling in osdmap_apply_incremental() Ilya Dryomov
2014-03-27 19:49   ` Alex Elder
2014-03-28 14:58     ` Ilya Dryomov
2014-03-27 18:17 ` [PATCH 11/33] libceph: nuke bogus encoding version check " Ilya Dryomov
2014-03-27 19:50   ` Alex Elder
2014-03-27 18:17 ` [PATCH 12/33] libceph: fix and clarify ceph_decode_need() sizes Ilya Dryomov
2014-03-27 19:53   ` Alex Elder
2014-03-27 18:17 ` [PATCH 13/33] libceph: rename __decode_pool{,_names}() to decode_pool{,_names}() Ilya Dryomov
2014-03-27 19:54   ` Alex Elder
2014-03-27 18:18 ` [PATCH 14/33] libceph: introduce decode{,_new}_pools() and switch to them Ilya Dryomov
2014-03-27 19:56   ` Alex Elder
2014-03-27 18:18 ` [PATCH 15/33] libceph: switch osdmap_set_max_osd() to krealloc() Ilya Dryomov
2014-03-27 19:59   ` Alex Elder
2014-03-27 18:18 ` [PATCH 16/33] libceph: introduce decode{,_new}_pg_temp() and switch to them Ilya Dryomov
2014-03-27 20:05   ` Alex Elder
2014-03-27 18:18 ` [PATCH 17/33] libceph: introduce get_osdmap_client_data_v() Ilya Dryomov
2014-03-27 20:17   ` Alex Elder
2014-03-28 14:59     ` Ilya Dryomov
2014-03-27 18:18 ` [PATCH 18/33] libceph: generalize ceph_pg_mapping Ilya Dryomov
2014-03-27 18:18 ` [PATCH 19/33] libceph: primary_temp infrastructure Ilya Dryomov
2014-03-27 20:21   ` Alex Elder
2014-03-27 18:18 ` [PATCH 20/33] libceph: primary_temp decode bits Ilya Dryomov
2014-03-27 20:23   ` Alex Elder
2014-03-27 18:18 ` [PATCH 21/33] libceph: primary_affinity infrastructure Ilya Dryomov
2014-03-27 20:26   ` Alex Elder
2014-03-28 15:01     ` Ilya Dryomov
2014-03-27 18:18 ` [PATCH 22/33] libceph: primary_affinity decode bits Ilya Dryomov
2014-03-27 20:31   ` Alex Elder
2014-03-28 15:01     ` Ilya Dryomov
2014-03-27 18:18 ` [PATCH 23/33] libceph: enable OSDMAP_ENC feature bit Ilya Dryomov
2014-03-27 20:32   ` Alex Elder
2014-03-28 15:01     ` Ilya Dryomov
2014-03-27 18:18 ` [PATCH 24/33] libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions Ilya Dryomov
2014-03-27 20:33   ` Alex Elder
2014-03-27 18:18 ` [PATCH 25/33] libceph: ceph_can_shift_osds(pool) and pool type defines Ilya Dryomov
2014-03-27 20:34   ` Alex Elder
2014-03-27 18:18 ` [PATCH 26/33] libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers Ilya Dryomov
2014-03-27 20:36   ` Alex Elder
2014-03-27 18:18 ` [PATCH 27/33] libceph: introduce apply_temps() helper Ilya Dryomov
2014-03-27 20:41   ` Alex Elder
2014-03-27 18:18 ` [PATCH 28/33] libceph: switch ceph_calc_pg_acting() to new helpers Ilya Dryomov
2014-03-27 20:49   ` Alex Elder
2014-03-28 15:02     ` Ilya Dryomov
2014-03-27 18:18 ` [PATCH 29/33] libceph: return primary from ceph_calc_pg_acting() Ilya Dryomov
2014-03-27 20:50   ` Alex Elder
2014-03-27 18:18 ` [PATCH 30/33] libceph: add support for primary_temp mappings Ilya Dryomov
2014-03-27 20:51   ` Alex Elder
2014-03-27 18:18 ` [PATCH 31/33] libceph: add support for osd primary affinity Ilya Dryomov
2014-03-27 20:59   ` Alex Elder
2014-03-28 15:03     ` Ilya Dryomov
2014-03-27 18:18 ` [PATCH 32/33] libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() Ilya Dryomov
2014-03-27 21:04   ` Alex Elder
2014-03-27 18:18 ` [PATCH 33/33] libceph: enable PRIMARY_AFFINITY feature bit Ilya Dryomov
2014-03-27 21:04   ` Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.