From: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
To: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Unexpected pg placement in degraded mode with custom crush rule
Date: Fri, 05 Jul 2013 15:36:52 +1200 [thread overview]
Message-ID: <51D63F54.70900@catalyst.net.nz> (raw)
I have a 4 osd system (4 hosts, 1 osd per host), in two (imagined) racks
(osd 0 and 1 in rack 0, osd 2 and 3 in rack1). All pools have number of
replicas = 2. I have a crush rule that puts one pg copy on each rack
(see notes) - but is essentially:
step take root
step chooseleaf firstn 0 type rack
step emit
I created a pool (called obj) with 200 pgs, and created 100 objects each
of size 20MB.
I simulate a rack failure by stopping ceph on the hosts in one rack. I
*expected* that the system would continue to run in 50% degraded mode,
as we would not place replicas on/in the same rack. Indeed initially I see:
2013-07-04 16:38:36.403975 mon.0 [INF] pgmap v760: 1160 pgs: 3 peering,
1157 active+degraded; 2000 MB data, 6488 MB used, 3075 MB / 10075 MB
avail; 100/200 degraded (50.000%)
However what I see is (after a while) is:
2013-07-04 16:39:07.071987 mon.0 [INF] pgmap v775: 1160 pgs: 308
active+remapped, 852 active+degraded; 2000 MB data, 6934 MB used, 2629
MB / 10075 MB avail; 78/200 degraded (39.000%)
Hmm - sure enough if I dump the pg map for each object in the pool, most
look like:
osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4
(3.64) -> up [1] acting [1]
but some are like:
osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18
(3.18) -> up [1] acting [1,0]
Clearly I have misunderstood something here! How am I getting replicas
on osd.0 and osd.1, when they are in the same rack?
Notes:
The version
ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
on Ubuntu 12.04 (KVM guest).
The osd tree
# id weight type name up/down reweight
-7 4 root root
-5 2 rack rack0
-1 1 host ceph1
0 1 osd.0 up 1
-2 1 host ceph2
1 1 osd.1 up 1
-6 2 rack rack1
-3 1 host ceph3
2 1 osd.2 down 0
-4 1 host ceph4
3 1 osd.3 down 0
The crushmap
# begin crush map
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
# types
type 0 device
type 1 host
type 2 rack
type 3 root
# buckets
host ceph1 {
id -1 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
}
host ceph2 {
id -2 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.000
}
host ceph3 {
id -3 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.000
}
host ceph4 {
id -4 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.3 weight 1.000
}
rack rack0 {
id -5 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item ceph1 weight 1.000
item ceph2 weight 1.000
}
rack rack1 {
id -6 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item ceph3 weight 1.000
item ceph4 weight 1.000
}
root root {
id -7 # do not change unnecessarily
# weight 4.000
alg straw
hash 0 # rjenkins1
item rack0 weight 2.000
item rack1 weight 2.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}
# end crush map
The pg map with all 4 osd up:
osdmap e39 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4
(3.64) -> up [1,2] acting [1,2]
osdmap e39 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd
(3.7d) -> up [2,1] acting [2,1]
osdmap e39 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d
(3.9d) -> up [0,2] acting [0,2]
osdmap e39 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8
(3.48) -> up [1,3] acting [1,3]
osdmap e39 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42
(3.42) -> up [1,2] acting [1,2]
osdmap e39 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18
(3.18) -> up [3,0] acting [3,0]
osdmap e39 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0
(3.c0) -> up [0,3] acting [0,3]
osdmap e39 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675
(3.75) -> up [0,2] acting [0,2]
osdmap e39 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892
(3.92) -> up [3,0] acting [3,0]
...
The pg map with 2 osd (1 rack) up:
osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4
(3.64) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd
(3.7d) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d
(3.9d) -> up [0] acting [0]
osdmap e55 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8
(3.48) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42
(3.42) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18
(3.18) -> up [1] acting [1,0]
osdmap e55 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0
(3.c0) -> up [0] acting [0]
osdmap e55 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675
(3.75) -> up [0] acting [0]
osdmap e55 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892
(3.92) -> up [1] acting [1,0]
...
Cheers
Mark
next reply other threads:[~2013-07-05 3:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-05 3:36 Mark Kirkwood [this message]
2013-07-05 4:01 ` Unexpected pg placement in degraded mode with custom crush rule Sage Weil
2013-07-05 4:32 ` Mark Kirkwood
2013-07-05 4:53 ` Mark Kirkwood
2013-07-05 15:46 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51D63F54.70900@catalyst.net.nz \
--to=mark.kirkwood@catalyst.net.nz \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.