Crushmap Design Question

All of lore.kernel.org
 help / color / mirror / Atom feed

* Crushmap Design Question
@ 2013-01-08 20:20 Moore, Shawn M
  2013-01-09  0:53 ` Chen, Xiaoxi
  2013-01-10 21:34 ` Gregory Farnum
  0 siblings, 2 replies; 6+ messages in thread
From: Moore, Shawn M @ 2013-01-08 20:20 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...

Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)

I will put hopefully relevant files at the end of this email.

When all 28 osds are up, I get:
2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

When I fail a datacenter (including 1 of 3 mon's) I eventually get:
2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)

At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way: 
2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)

I'm hoping I've done something wrong, so please advise.  Below are my configs.  If you need something more to help, just ask.

Normal output with all datacenters up.
# ceph osd tree
# id	weight	type name	up/down	reweight
-1	80	root default
-3	36		datacenter hok
-2	1			host blade151
0	1				osd.0	up	1	
-4	1			host blade152
1	1				osd.1	up	1	
-15	1			host blade153
2	1				osd.2	up	1	
-17	1			host blade154
3	1				osd.3	up	1	
-18	1			host blade155
4	1				osd.4	up	1	
-19	1			host blade159
5	1				osd.5	up	1	
-20	1			host blade160
6	1				osd.6	up	1	
-21	1			host blade161
7	1				osd.7	up	1	
-22	1			host blade162
8	1				osd.8	up	1	
-23	1			host blade163
9	1				osd.9	up	1	
-24	36		datacenter csc
-5	1			host admbc0-01
10	1				osd.10	up	1	
-6	1			host admbc0-02
11	1				osd.11	up	1	
-7	1			host admbc0-03
12	1				osd.12	up	1	
-8	1			host admbc0-04
13	1				osd.13	up	1	
-9	1			host admbc0-05
14	1				osd.14	up	1	
-10	1			host admbc0-06
15	1				osd.15	up	1	
-11	1			host admbc0-09
16	1				osd.16	up	1	
-12	1			host admbc0-10
17	1				osd.17	up	1	
-13	1			host admbc0-11
18	1				osd.18	up	1	
-14	1			host admbc0-12
19	1				osd.19	up	1	
-25	8		datacenter adm
-16	8			host admdisk0
20	1				osd.20	up	1	
21	1				osd.21	up	1	
22	1				osd.22	up	1	
23	1				osd.23	up	1	
24	1				osd.24	up	1	
25	1				osd.25	up	1	
26	1				osd.26	up	1	
27	1				osd.27	up	1



Showing copes set to 3.
# ceph osd dump | grep " size "
pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0
pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0




Crushmap
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host blade151 {
	id -2		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1.000
}
host blade152 {
	id -4		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 1.000
}
host blade153 {
	id -15		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.2 weight 1.000
}
host blade154 {
	id -17		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.3 weight 1.000
}
host blade155 {
	id -18		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 1.000
}
host blade159 {
	id -19		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.5 weight 1.000
}
host blade160 {
	id -20		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.6 weight 1.000
}
host blade161 {
	id -21		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.7 weight 1.000
}
host blade162 {
	id -22		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.8 weight 1.000
}
host blade163 {
	id -23		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.9 weight 1.000
}
datacenter hok {
	id -3		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item blade151 weight 1.000
	item blade152 weight 1.000
	item blade153 weight 1.000
	item blade154 weight 1.000
	item blade155 weight 1.000
	item blade159 weight 1.000
	item blade160 weight 1.000
	item blade161 weight 1.000
	item blade162 weight 1.000
	item blade163 weight 1.000
}
host admbc0-01 {
	id -5		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 1.000
}
host admbc0-02 {
	id -6		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.11 weight 1.000
}
host admbc0-03 {
	id -7		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.12 weight 1.000
}
host admbc0-04 {
	id -8		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.13 weight 1.000
}
host admbc0-05 {
	id -9		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.14 weight 1.000
}
host admbc0-06 {
	id -10		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.15 weight 1.000
}
host admbc0-09 {
	id -11		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.16 weight 1.000
}
host admbc0-10 {
	id -12		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.17 weight 1.000
}
host admbc0-11 {
	id -13		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.18 weight 1.000
}
host admbc0-12 {
	id -14		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.19 weight 1.000
}
datacenter csc {
	id -24		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item admbc0-01 weight 1.000
	item admbc0-02 weight 1.000
	item admbc0-03 weight 1.000
	item admbc0-04 weight 1.000
	item admbc0-05 weight 1.000
	item admbc0-06 weight 1.000
	item admbc0-09 weight 1.000
	item admbc0-10 weight 1.000
	item admbc0-11 weight 1.000
	item admbc0-12 weight 1.000
}
host admdisk0 {
	id -16		# do not change unnecessarily
	# weight 8.000
	alg straw
	hash 0	# rjenkins1
	item osd.20 weight 1.000
	item osd.21 weight 1.000
	item osd.22 weight 1.000
	item osd.23 weight 1.000
	item osd.24 weight 1.000
	item osd.25 weight 1.000
	item osd.26 weight 1.000
	item osd.27 weight 1.000
}
datacenter adm {
	id -25		# do not change unnecessarily
	# weight 8.000
	alg straw
	hash 0	# rjenkins1
	item admdisk0 weight 8.000
}
root default {
	id -1		# do not change unnecessarily
	# weight 80.000
	alg straw
	hash 0	# rjenkins1
	item hok weight 36.000
	item csc weight 36.000
	item adm weight 8.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}

# end crush map


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Crushmap Design Question
  2013-01-08 20:20 Crushmap Design Question Moore, Shawn M
@ 2013-01-09  0:53 ` Chen, Xiaoxi
  2013-01-09  8:59   ` Wido den Hollander
  2013-01-10 21:34 ` Gregory Farnum
  1 sibling, 1 reply; 6+ messages in thread
From: Chen, Xiaoxi @ 2013-01-09  0:53 UTC (permalink / raw)
  To: Moore, Shawn M; +Cc: ceph-devel@vger.kernel.org

Hi，
	Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
	But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed.
	It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable.
					
																												Xiaoxi


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Moore, Shawn M
Sent: 2013年1月9日 4:21
To: ceph-devel@vger.kernel.org
Subject: Crushmap Design Question

I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...

Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)

I will put hopefully relevant files at the end of this email.

When all 28 osds are up, I get:
2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

When I fail a datacenter (including 1 of 3 mon's) I eventually get:
2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)

At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way: 
2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)

I'm hoping I've done something wrong, so please advise.  Below are my configs.  If you need something more to help, just ask.

Normal output with all datacenters up.
# ceph osd tree
# id	weight	type name	up/down	reweight
-1	80	root default
-3	36		datacenter hok
-2	1			host blade151
0	1				osd.0	up	1	
-4	1			host blade152
1	1				osd.1	up	1	
-15	1			host blade153
2	1				osd.2	up	1	
-17	1			host blade154
3	1				osd.3	up	1	
-18	1			host blade155
4	1				osd.4	up	1	
-19	1			host blade159
5	1				osd.5	up	1	
-20	1			host blade160
6	1				osd.6	up	1	
-21	1			host blade161
7	1				osd.7	up	1	
-22	1			host blade162
8	1				osd.8	up	1	
-23	1			host blade163
9	1				osd.9	up	1	
-24	36		datacenter csc
-5	1			host admbc0-01
10	1				osd.10	up	1	
-6	1			host admbc0-02
11	1				osd.11	up	1	
-7	1			host admbc0-03
12	1				osd.12	up	1	
-8	1			host admbc0-04
13	1				osd.13	up	1	
-9	1			host admbc0-05
14	1				osd.14	up	1	
-10	1			host admbc0-06
15	1				osd.15	up	1	
-11	1			host admbc0-09
16	1				osd.16	up	1	
-12	1			host admbc0-10
17	1				osd.17	up	1	
-13	1			host admbc0-11
18	1				osd.18	up	1	
-14	1			host admbc0-12
19	1				osd.19	up	1	
-25	8		datacenter adm
-16	8			host admdisk0
20	1				osd.20	up	1	
21	1				osd.21	up	1	
22	1				osd.22	up	1	
23	1				osd.23	up	1	
24	1				osd.24	up	1	
25	1				osd.25	up	1	
26	1				osd.26	up	1	
27	1				osd.27	up	1



Showing copes set to 3.
# ceph osd dump | grep " size "
pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0 pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0




Crushmap
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host blade151 {
	id -2		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1.000
}
host blade152 {
	id -4		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 1.000
}
host blade153 {
	id -15		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.2 weight 1.000
}
host blade154 {
	id -17		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.3 weight 1.000
}
host blade155 {
	id -18		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 1.000
}
host blade159 {
	id -19		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.5 weight 1.000
}
host blade160 {
	id -20		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.6 weight 1.000
}
host blade161 {
	id -21		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.7 weight 1.000
}
host blade162 {
	id -22		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.8 weight 1.000
}
host blade163 {
	id -23		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.9 weight 1.000
}
datacenter hok {
	id -3		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item blade151 weight 1.000
	item blade152 weight 1.000
	item blade153 weight 1.000
	item blade154 weight 1.000
	item blade155 weight 1.000
	item blade159 weight 1.000
	item blade160 weight 1.000
	item blade161 weight 1.000
	item blade162 weight 1.000
	item blade163 weight 1.000
}
host admbc0-01 {
	id -5		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 1.000
}
host admbc0-02 {
	id -6		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.11 weight 1.000
}
host admbc0-03 {
	id -7		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.12 weight 1.000
}
host admbc0-04 {
	id -8		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.13 weight 1.000
}
host admbc0-05 {
	id -9		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.14 weight 1.000
}
host admbc0-06 {
	id -10		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.15 weight 1.000
}
host admbc0-09 {
	id -11		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.16 weight 1.000
}
host admbc0-10 {
	id -12		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.17 weight 1.000
}
host admbc0-11 {
	id -13		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.18 weight 1.000
}
host admbc0-12 {
	id -14		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.19 weight 1.000
}
datacenter csc {
	id -24		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item admbc0-01 weight 1.000
	item admbc0-02 weight 1.000
	item admbc0-03 weight 1.000
	item admbc0-04 weight 1.000
	item admbc0-05 weight 1.000
	item admbc0-06 weight 1.000
	item admbc0-09 weight 1.000
	item admbc0-10 weight 1.000
	item admbc0-11 weight 1.000
	item admbc0-12 weight 1.000
}
host admdisk0 {
	id -16		# do not change unnecessarily
	# weight 8.000
	alg straw
	hash 0	# rjenkins1
	item osd.20 weight 1.000
	item osd.21 weight 1.000
	item osd.22 weight 1.000
	item osd.23 weight 1.000
	item osd.24 weight 1.000
	item osd.25 weight 1.000
	item osd.26 weight 1.000
	item osd.27 weight 1.000
}
datacenter adm {
	id -25		# do not change unnecessarily
	# weight 8.000
	alg straw
	hash 0	# rjenkins1
	item admdisk0 weight 8.000
}
root default {
	id -1		# do not change unnecessarily
	# weight 80.000
	alg straw
	hash 0	# rjenkins1
	item hok weight 36.000
	item csc weight 36.000
	item adm weight 8.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}

# end crush map

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crushmap Design Question
  2013-01-09  0:53 ` Chen, Xiaoxi
@ 2013-01-09  8:59   ` Wido den Hollander
  2013-01-09 14:41     ` Moore, Shawn M
  2013-01-09 15:00     ` Joao Eduardo Luis
  0 siblings, 2 replies; 6+ messages in thread
From: Wido den Hollander @ 2013-01-09  8:59 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Moore, Shawn M, ceph-devel@vger.kernel.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=GB2312, Size: 11539 bytes --]

Hi,

On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
> Hi£¬
> 	Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
> 	But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed.
> 	It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable.
> 

From what I understand he didn't kill the second mon, still leaving 2
out of 3 mons running.

Could you check if your PGs are actually mapped to OSDs spread out over
the 3 DCs?

"ceph pg dump" should tell you to which OSDs the PGs are mapped.

I've never tried before, but you don't have equal weights for the
datacenters, I don't know how that effects the situation.

Wido

> 																												Xiaoxi
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Moore, Shawn M
> Sent: 2013Äê1ÔÂ9ÈÕ 4:21
> To: ceph-devel@vger.kernel.org
> Subject: Crushmap Design Question
> 
> I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...
> 
> Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 
> I will put hopefully relevant files at the end of this email.
> 
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)
> 
> At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)
> 
> I'm hoping I've done something wrong, so please advise.  Below are my configs.  If you need something more to help, just ask.
> 
> Normal output with all datacenters up.
> # ceph osd tree
> # id	weight	type name	up/down	reweight
> -1	80	root default
> -3	36		datacenter hok
> -2	1			host blade151
> 0	1				osd.0	up	1	
> -4	1			host blade152
> 1	1				osd.1	up	1	
> -15	1			host blade153
> 2	1				osd.2	up	1	
> -17	1			host blade154
> 3	1				osd.3	up	1	
> -18	1			host blade155
> 4	1				osd.4	up	1	
> -19	1			host blade159
> 5	1				osd.5	up	1	
> -20	1			host blade160
> 6	1				osd.6	up	1	
> -21	1			host blade161
> 7	1				osd.7	up	1	
> -22	1			host blade162
> 8	1				osd.8	up	1	
> -23	1			host blade163
> 9	1				osd.9	up	1	
> -24	36		datacenter csc
> -5	1			host admbc0-01
> 10	1				osd.10	up	1	
> -6	1			host admbc0-02
> 11	1				osd.11	up	1	
> -7	1			host admbc0-03
> 12	1				osd.12	up	1	
> -8	1			host admbc0-04
> 13	1				osd.13	up	1	
> -9	1			host admbc0-05
> 14	1				osd.14	up	1	
> -10	1			host admbc0-06
> 15	1				osd.15	up	1	
> -11	1			host admbc0-09
> 16	1				osd.16	up	1	
> -12	1			host admbc0-10
> 17	1				osd.17	up	1	
> -13	1			host admbc0-11
> 18	1				osd.18	up	1	
> -14	1			host admbc0-12
> 19	1				osd.19	up	1	
> -25	8		datacenter adm
> -16	8			host admdisk0
> 20	1				osd.20	up	1	
> 21	1				osd.21	up	1	
> 22	1				osd.22	up	1	
> 23	1				osd.23	up	1	
> 24	1				osd.24	up	1	
> 25	1				osd.25	up	1	
> 26	1				osd.26	up	1	
> 27	1				osd.27	up	1
> 
> 
> 
> Showing copes set to 3.
> # ceph osd dump | grep " size "
> pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0 pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0
> 
> 
> 
> 
> Crushmap
> # begin crush map
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> 
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
> 
> # buckets
> host blade151 {
> 	id -2		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.0 weight 1.000
> }
> host blade152 {
> 	id -4		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.1 weight 1.000
> }
> host blade153 {
> 	id -15		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.2 weight 1.000
> }
> host blade154 {
> 	id -17		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.3 weight 1.000
> }
> host blade155 {
> 	id -18		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.4 weight 1.000
> }
> host blade159 {
> 	id -19		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.5 weight 1.000
> }
> host blade160 {
> 	id -20		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.6 weight 1.000
> }
> host blade161 {
> 	id -21		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.7 weight 1.000
> }
> host blade162 {
> 	id -22		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.8 weight 1.000
> }
> host blade163 {
> 	id -23		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.9 weight 1.000
> }
> datacenter hok {
> 	id -3		# do not change unnecessarily
> 	# weight 10.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item blade151 weight 1.000
> 	item blade152 weight 1.000
> 	item blade153 weight 1.000
> 	item blade154 weight 1.000
> 	item blade155 weight 1.000
> 	item blade159 weight 1.000
> 	item blade160 weight 1.000
> 	item blade161 weight 1.000
> 	item blade162 weight 1.000
> 	item blade163 weight 1.000
> }
> host admbc0-01 {
> 	id -5		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.10 weight 1.000
> }
> host admbc0-02 {
> 	id -6		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.11 weight 1.000
> }
> host admbc0-03 {
> 	id -7		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.12 weight 1.000
> }
> host admbc0-04 {
> 	id -8		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.13 weight 1.000
> }
> host admbc0-05 {
> 	id -9		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.14 weight 1.000
> }
> host admbc0-06 {
> 	id -10		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.15 weight 1.000
> }
> host admbc0-09 {
> 	id -11		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.16 weight 1.000
> }
> host admbc0-10 {
> 	id -12		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.17 weight 1.000
> }
> host admbc0-11 {
> 	id -13		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.18 weight 1.000
> }
> host admbc0-12 {
> 	id -14		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.19 weight 1.000
> }
> datacenter csc {
> 	id -24		# do not change unnecessarily
> 	# weight 10.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item admbc0-01 weight 1.000
> 	item admbc0-02 weight 1.000
> 	item admbc0-03 weight 1.000
> 	item admbc0-04 weight 1.000
> 	item admbc0-05 weight 1.000
> 	item admbc0-06 weight 1.000
> 	item admbc0-09 weight 1.000
> 	item admbc0-10 weight 1.000
> 	item admbc0-11 weight 1.000
> 	item admbc0-12 weight 1.000
> }
> host admdisk0 {
> 	id -16		# do not change unnecessarily
> 	# weight 8.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.20 weight 1.000
> 	item osd.21 weight 1.000
> 	item osd.22 weight 1.000
> 	item osd.23 weight 1.000
> 	item osd.24 weight 1.000
> 	item osd.25 weight 1.000
> 	item osd.26 weight 1.000
> 	item osd.27 weight 1.000
> }
> datacenter adm {
> 	id -25		# do not change unnecessarily
> 	# weight 8.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item admdisk0 weight 8.000
> }
> root default {
> 	id -1		# do not change unnecessarily
> 	# weight 80.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item hok weight 36.000
> 	item csc weight 36.000
> 	item adm weight 8.000
> }
> 
> # rules
> rule data {
> 	ruleset 0
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> rule metadata {
> 	ruleset 1
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> rule rbd {
> 	ruleset 2
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> 
> # end crush map
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^\x7f)Þº{.n\x7f+‰·œz˜]z÷¥Š{ay\x7f\x1dÊ‡Ú™\x7f,j\a¢f£¢·hš‹àz\x7f\x1e®w¥¢\x7f\f¢·¦j:+v‰¨ŠwèjØm¶Ÿ\x7f\x7f\a«‘êçzZ+ƒùšŽŠÝ¢j"ú!tml=
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Crushmap Design Question
  2013-01-09  8:59   ` Wido den Hollander
@ 2013-01-09 14:41     ` Moore, Shawn M
  2013-01-09 15:00     ` Joao Eduardo Luis
  1 sibling, 0 replies; 6+ messages in thread
From: Moore, Shawn M @ 2013-01-09 14:41 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Correct, it never went below N+1 (3 total mons and 2 of them never went down).

Several times in the past I verified that a pg was actually mapped to valid dc's with that command.  I just wrote a quick script that will do this on the fly and after recovering the cluster last night, every pg has an osd mapping respective to an osd in a dc.  I will fail the cluster again later today and see what it looks like after 1 dc fails and then again after the 2nd fails.

As far as the weighting goes, I'm not sure how I ended up this way.  So should I change the "adm" tree:
FROM
-25	8		datacenter adm
-16	8			host admdisk0
TO
-25	36		datacenter adm
-16	1			host admdisk0

Regards


-----Original Message-----
From: Wido den Hollander [mailto:wido@widodh.nl] 
Sent: Wednesday, January 09, 2013 4:00 AM
To: Chen, Xiaoxi
Cc: Moore, Shawn M; ceph-devel@vger.kernel.org
Subject: Re: Crushmap Design Question

Hi,

On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
> Hi，
> 	Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
> 	But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed.
> 	It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable.
> 

From what I understand he didn't kill the second mon, still leaving 2
out of 3 mons running.

Could you check if your PGs are actually mapped to OSDs spread out over
the 3 DCs?

"ceph pg dump" should tell you to which OSDs the PGs are mapped.

I've never tried before, but you don't have equal weights for the
datacenters, I don't know how that effects the situation.

Wido

> 																												Xiaoxi
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Moore, Shawn M
> Sent: 2013年1月9日 4:21
> To: ceph-devel@vger.kernel.org
> Subject: Crushmap Design Question
> 
> I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...
> 
> Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 
> I will put hopefully relevant files at the end of this email.
> 
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)
> 
> At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)

> 
> I'm hoping I've done something wrong, so please advise.  Below are my configs.  If you need something more to help, just ask.
> 
> Normal output with all datacenters up.
> # ceph osd tree
> # id	weight	type name	up/down	reweight
> -1	80	root default
> -3	36		datacenter hok
> -2	1			host blade151
> 0	1				osd.0	up	1	
> -4	1			host blade152
> 1	1				osd.1	up	1	
> -15	1			host blade153
> 2	1				osd.2	up	1	
> -17	1			host blade154
> 3	1				osd.3	up	1	
> -18	1			host blade155
> 4	1				osd.4	up	1	
> -19	1			host blade159
> 5	1				osd.5	up	1	
> -20	1			host blade160
> 6	1				osd.6	up	1	
> -21	1			host blade161
> 7	1				osd.7	up	1	
> -22	1			host blade162
> 8	1				osd.8	up	1	
> -23	1			host blade163
> 9	1				osd.9	up	1	
> -24	36		datacenter csc
> -5	1			host admbc0-01
> 10	1				osd.10	up	1	
> -6	1			host admbc0-02
> 11	1				osd.11	up	1	
> -7	1			host admbc0-03
> 12	1				osd.12	up	1	
> -8	1			host admbc0-04
> 13	1				osd.13	up	1	
> -9	1			host admbc0-05
> 14	1				osd.14	up	1	
> -10	1			host admbc0-06
> 15	1				osd.15	up	1	
> -11	1			host admbc0-09
> 16	1				osd.16	up	1	
> -12	1			host admbc0-10
> 17	1				osd.17	up	1	
> -13	1			host admbc0-11
> 18	1				osd.18	up	1	
> -14	1			host admbc0-12
> 19	1				osd.19	up	1	
> -25	8		datacenter adm
> -16	8			host admdisk0
> 20	1				osd.20	up	1	
> 21	1				osd.21	up	1	
> 22	1				osd.22	up	1	
> 23	1				osd.23	up	1	
> 24	1				osd.24	up	1	
> 25	1				osd.25	up	1	
> 26	1				osd.26	up	1	
> 27	1				osd.27	up	1
> 
> 
> 
> Showing copes set to 3.
> # ceph osd dump | grep " size "
> pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0 pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0
> 
> 
> 
> 
> Crushmap
> # begin crush map
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> 
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
> 
> # buckets
> host blade151 {
> 	id -2		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.0 weight 1.000
> }
> host blade152 {
> 	id -4		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.1 weight 1.000
> }
> host blade153 {
> 	id -15		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.2 weight 1.000
> }
> host blade154 {
> 	id -17		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.3 weight 1.000
> }
> host blade155 {
> 	id -18		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.4 weight 1.000
> }
> host blade159 {
> 	id -19		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.5 weight 1.000
> }
> host blade160 {
> 	id -20		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.6 weight 1.000
> }
> host blade161 {
> 	id -21		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.7 weight 1.000
> }
> host blade162 {
> 	id -22		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.8 weight 1.000
> }
> host blade163 {
> 	id -23		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.9 weight 1.000
> }
> datacenter hok {
> 	id -3		# do not change unnecessarily
> 	# weight 10.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item blade151 weight 1.000
> 	item blade152 weight 1.000
> 	item blade153 weight 1.000
> 	item blade154 weight 1.000
> 	item blade155 weight 1.000
> 	item blade159 weight 1.000
> 	item blade160 weight 1.000
> 	item blade161 weight 1.000
> 	item blade162 weight 1.000
> 	item blade163 weight 1.000
> }
> host admbc0-01 {
> 	id -5		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.10 weight 1.000
> }
> host admbc0-02 {
> 	id -6		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.11 weight 1.000
> }
> host admbc0-03 {
> 	id -7		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.12 weight 1.000
> }
> host admbc0-04 {
> 	id -8		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.13 weight 1.000
> }
> host admbc0-05 {
> 	id -9		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.14 weight 1.000
> }
> host admbc0-06 {
> 	id -10		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.15 weight 1.000
> }
> host admbc0-09 {
> 	id -11		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.16 weight 1.000
> }
> host admbc0-10 {
> 	id -12		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.17 weight 1.000
> }
> host admbc0-11 {
> 	id -13		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.18 weight 1.000
> }
> host admbc0-12 {
> 	id -14		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.19 weight 1.000
> }
> datacenter csc {
> 	id -24		# do not change unnecessarily
> 	# weight 10.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item admbc0-01 weight 1.000
> 	item admbc0-02 weight 1.000
> 	item admbc0-03 weight 1.000
> 	item admbc0-04 weight 1.000
> 	item admbc0-05 weight 1.000
> 	item admbc0-06 weight 1.000
> 	item admbc0-09 weight 1.000
> 	item admbc0-10 weight 1.000
> 	item admbc0-11 weight 1.000
> 	item admbc0-12 weight 1.000
> }
> host admdisk0 {
> 	id -16		# do not change unnecessarily
> 	# weight 8.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.20 weight 1.000
> 	item osd.21 weight 1.000
> 	item osd.22 weight 1.000
> 	item osd.23 weight 1.000
> 	item osd.24 weight 1.000
> 	item osd.25 weight 1.000
> 	item osd.26 weight 1.000
> 	item osd.27 weight 1.000
> }
> datacenter adm {
> 	id -25		# do not change unnecessarily
> 	# weight 8.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item admdisk0 weight 8.000
> }
> root default {
> 	id -1		# do not change unnecessarily
> 	# weight 80.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item hok weight 36.000
> 	item csc weight 36.000
> 	item adm weight 8.000
> }
> 
> # rules
> rule data {
> 	ruleset 0
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> rule metadata {
> 	ruleset 1
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> rule rbd {
> 	ruleset 2
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> 
> # end crush map
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N嫥叉靣笡y氊b瞂千v豝?)藓{.n?+壏渮榏z鳐妠ay?\x1d蕠跈?,j\rf＂穐殝鄗?\x1e畐ア?
⒎:+v墾妛鑚豰稛??\r珣赙zZ+凒殠娸"濟!tml=
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crushmap Design Question
  2013-01-09  8:59   ` Wido den Hollander
  2013-01-09 14:41     ` Moore, Shawn M
@ 2013-01-09 15:00     ` Joao Eduardo Luis
  1 sibling, 0 replies; 6+ messages in thread
From: Joao Eduardo Luis @ 2013-01-09 15:00 UTC (permalink / raw)
  To: Wido den Hollander
  Cc: Chen, Xiaoxi, Moore, Shawn M, ceph-devel@vger.kernel.org

On 01/09/2013 08:59 AM, Wido den Hollander wrote:
> Hi,
> 
> On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
>> Hi，
>> 	Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
>> 	But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed.
>> 	It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable.
>>
> 
>  From what I understand he didn't kill the second mon, still leaving 2
> out of 3 mons running.

Indeed. A good hint that this is the case is this bit of Shawn's message:

>> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
>> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)
>>
>> At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
>> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

If you still manage to get these messages, it means your monitors are
still handling and answering requests, and that only happens when you
have a quorum :)

  -Joao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crushmap Design Question
  2013-01-08 20:20 Crushmap Design Question Moore, Shawn M
  2013-01-09  0:53 ` Chen, Xiaoxi
@ 2013-01-10 21:34 ` Gregory Farnum
  1 sibling, 0 replies; 6+ messages in thread
From: Gregory Farnum @ 2013-01-10 21:34 UTC (permalink / raw)
  To: Moore, Shawn M; +Cc: ceph-devel@vger.kernel.org

On Tue, Jan 8, 2013 at 12:20 PM, Moore, Shawn M <smmoore@catawba.edu> wrote:
> I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...
>
> Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>
> I will put hopefully relevant files at the end of this email.
>
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
>
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)
>
> At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
>
> Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)

This took me a bit to work out as well, but you've run afoul of a new
post-argonaut feature intended to prevent people from writing with
insufficient durability. Pools now have a "min size" and PGs in that
pool won't go active if they don't have that many OSDs to write on.
The clue here is the "incomplete" state. You can change it with "ceph
osd pool foo set min_size 1", where "foo" is the name of the pool
whose min_size you wish to change (and this command sets the min size
to 1, obviously). The default for new pools is controlled by the "osd
pool default min size" config value (which you should put in the
global section). By default it'll be half of your default pool size.

So in your case your pools have a default size of 3, and the min size
is (3/2 = 1.5 rounded up), and the OSDs are refusing to go active
because of the dramatically reduced redundancy. You can set the min
size down though and they will go active.
-Greg

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-01-10 21:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-08 20:20 Crushmap Design Question Moore, Shawn M
2013-01-09  0:53 ` Chen, Xiaoxi
2013-01-09  8:59   ` Wido den Hollander
2013-01-09 14:41     ` Moore, Shawn M
2013-01-09 15:00     ` Joao Eduardo Luis
2013-01-10 21:34 ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.