* Unexpected pg placement in degraded mode with custom crush rule
@ 2013-07-05 3:36 Mark Kirkwood
2013-07-05 4:01 ` Sage Weil
0 siblings, 1 reply; 5+ messages in thread
From: Mark Kirkwood @ 2013-07-05 3:36 UTC (permalink / raw)
To: ceph-devel
I have a 4 osd system (4 hosts, 1 osd per host), in two (imagined) racks
(osd 0 and 1 in rack 0, osd 2 and 3 in rack1). All pools have number of
replicas = 2. I have a crush rule that puts one pg copy on each rack
(see notes) - but is essentially:
step take root
step chooseleaf firstn 0 type rack
step emit
I created a pool (called obj) with 200 pgs, and created 100 objects each
of size 20MB.
I simulate a rack failure by stopping ceph on the hosts in one rack. I
*expected* that the system would continue to run in 50% degraded mode,
as we would not place replicas on/in the same rack. Indeed initially I see:
2013-07-04 16:38:36.403975 mon.0 [INF] pgmap v760: 1160 pgs: 3 peering,
1157 active+degraded; 2000 MB data, 6488 MB used, 3075 MB / 10075 MB
avail; 100/200 degraded (50.000%)
However what I see is (after a while) is:
2013-07-04 16:39:07.071987 mon.0 [INF] pgmap v775: 1160 pgs: 308
active+remapped, 852 active+degraded; 2000 MB data, 6934 MB used, 2629
MB / 10075 MB avail; 78/200 degraded (39.000%)
Hmm - sure enough if I dump the pg map for each object in the pool, most
look like:
osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4
(3.64) -> up [1] acting [1]
but some are like:
osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18
(3.18) -> up [1] acting [1,0]
Clearly I have misunderstood something here! How am I getting replicas
on osd.0 and osd.1, when they are in the same rack?
Notes:
The version
ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
on Ubuntu 12.04 (KVM guest).
The osd tree
# id weight type name up/down reweight
-7 4 root root
-5 2 rack rack0
-1 1 host ceph1
0 1 osd.0 up 1
-2 1 host ceph2
1 1 osd.1 up 1
-6 2 rack rack1
-3 1 host ceph3
2 1 osd.2 down 0
-4 1 host ceph4
3 1 osd.3 down 0
The crushmap
# begin crush map
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
# types
type 0 device
type 1 host
type 2 rack
type 3 root
# buckets
host ceph1 {
id -1 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
}
host ceph2 {
id -2 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.000
}
host ceph3 {
id -3 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.000
}
host ceph4 {
id -4 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.3 weight 1.000
}
rack rack0 {
id -5 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item ceph1 weight 1.000
item ceph2 weight 1.000
}
rack rack1 {
id -6 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item ceph3 weight 1.000
item ceph4 weight 1.000
}
root root {
id -7 # do not change unnecessarily
# weight 4.000
alg straw
hash 0 # rjenkins1
item rack0 weight 2.000
item rack1 weight 2.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}
# end crush map
The pg map with all 4 osd up:
osdmap e39 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4
(3.64) -> up [1,2] acting [1,2]
osdmap e39 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd
(3.7d) -> up [2,1] acting [2,1]
osdmap e39 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d
(3.9d) -> up [0,2] acting [0,2]
osdmap e39 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8
(3.48) -> up [1,3] acting [1,3]
osdmap e39 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42
(3.42) -> up [1,2] acting [1,2]
osdmap e39 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18
(3.18) -> up [3,0] acting [3,0]
osdmap e39 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0
(3.c0) -> up [0,3] acting [0,3]
osdmap e39 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675
(3.75) -> up [0,2] acting [0,2]
osdmap e39 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892
(3.92) -> up [3,0] acting [3,0]
...
The pg map with 2 osd (1 rack) up:
osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4
(3.64) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd
(3.7d) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d
(3.9d) -> up [0] acting [0]
osdmap e55 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8
(3.48) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42
(3.42) -> up [1] acting [1]
osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18
(3.18) -> up [1] acting [1,0]
osdmap e55 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0
(3.c0) -> up [0] acting [0]
osdmap e55 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675
(3.75) -> up [0] acting [0]
osdmap e55 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892
(3.92) -> up [1] acting [1,0]
...
Cheers
Mark
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unexpected pg placement in degraded mode with custom crush rule
2013-07-05 3:36 Unexpected pg placement in degraded mode with custom crush rule Mark Kirkwood
@ 2013-07-05 4:01 ` Sage Weil
2013-07-05 4:32 ` Mark Kirkwood
0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2013-07-05 4:01 UTC (permalink / raw)
To: Mark Kirkwood; +Cc: ceph-devel
Hi Mark,
If you're not using a kernel cephfs or rbd client older than ~3.9, or
ceph-fuse/librbd/librados older than bobtail, then you should
ceph osd crush tunables optimal
and I suspect that this will suddenly work perfectly. The defaults are
still using semi-broken legacy values because client support is pretty
new. Trees like yours, with sparsely populated leaves, tend to be most
affected.
(I bet you're seeing the rack separation rule violated because the
previous copy of the PG was already there and ceph won't throw out old
copies before creating new ones.)
sage
On Fri, 5 Jul 2013, Mark Kirkwood wrote:
> I have a 4 osd system (4 hosts, 1 osd per host), in two (imagined) racks (osd
> 0 and 1 in rack 0, osd 2 and 3 in rack1). All pools have number of replicas =
> 2. I have a crush rule that puts one pg copy on each rack (see notes) - but is
> essentially:
>
> step take root
> step chooseleaf firstn 0 type rack
> step emit
>
> I created a pool (called obj) with 200 pgs, and created 100 objects each of
> size 20MB.
>
> I simulate a rack failure by stopping ceph on the hosts in one rack. I
> *expected* that the system would continue to run in 50% degraded mode, as we
> would not place replicas on/in the same rack. Indeed initially I see:
>
> 2013-07-04 16:38:36.403975 mon.0 [INF] pgmap v760: 1160 pgs: 3 peering, 1157
> active+degraded; 2000 MB data, 6488 MB used, 3075 MB / 10075 MB avail; 100/200
> degraded (50.000%)
>
> However what I see is (after a while) is:
>
> 2013-07-04 16:39:07.071987 mon.0 [INF] pgmap v775: 1160 pgs: 308
> active+remapped, 852 active+degraded; 2000 MB data, 6934 MB used, 2629 MB /
> 10075 MB avail; 78/200 degraded (39.000%)
>
> Hmm - sure enough if I dump the pg map for each object in the pool, most look
> like:
>
> osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) ->
> up [1] acting [1]
>
> but some are like:
>
> osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) ->
> up [1] acting [1,0]
>
>
> Clearly I have misunderstood something here! How am I getting replicas on
> osd.0 and osd.1, when they are in the same rack?
>
>
> Notes:
>
> The version
>
> ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
> on Ubuntu 12.04 (KVM guest).
>
> The osd tree
>
> # id weight type name up/down reweight
> -7 4 root root
> -5 2 rack rack0
> -1 1 host ceph1
> 0 1 osd.0 up 1
> -2 1 host ceph2
> 1 1 osd.1 up 1
> -6 2 rack rack1
> -3 1 host ceph3
> 2 1 osd.2 down 0
> -4 1 host ceph4
> 3 1 osd.3 down 0
>
> The crushmap
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
>
> # types
> type 0 device
> type 1 host
> type 2 rack
> type 3 root
>
> # buckets
> host ceph1 {
> id -1 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> }
> host ceph2 {
> id -2 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 1.000
> }
> host ceph3 {
> id -3 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.2 weight 1.000
> }
> host ceph4 {
> id -4 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.3 weight 1.000
> }
> rack rack0 {
> id -5 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item ceph1 weight 1.000
> item ceph2 weight 1.000
> }
> rack rack1 {
> id -6 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item ceph3 weight 1.000
> item ceph4 weight 1.000
> }
> root root {
> id -7 # do not change unnecessarily
> # weight 4.000
> alg straw
> hash 0 # rjenkins1
> item rack0 weight 2.000
> item rack1 weight 2.000
> }
>
> # rules
> rule data {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take root
> step chooseleaf firstn 0 type rack
> step emit
> }
> rule metadata {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take root
> step chooseleaf firstn 0 type rack
> step emit
> }
> rule rbd {
> ruleset 2
> type replicated
> min_size 1
> max_size 10
> step take root
> step chooseleaf firstn 0 type rack
> step emit
> }
>
> # end crush map
>
>
> The pg map with all 4 osd up:
>
> osdmap e39 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) ->
> up [1,2] acting [1,2]
> osdmap e39 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd (3.7d) ->
> up [2,1] acting [2,1]
> osdmap e39 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d (3.9d) ->
> up [0,2] acting [0,2]
> osdmap e39 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8 (3.48) ->
> up [1,3] acting [1,3]
> osdmap e39 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42 (3.42) ->
> up [1,2] acting [1,2]
> osdmap e39 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) ->
> up [3,0] acting [3,0]
> osdmap e39 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0 (3.c0) ->
> up [0,3] acting [0,3]
> osdmap e39 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675 (3.75) ->
> up [0,2] acting [0,2]
> osdmap e39 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892 (3.92) ->
> up [3,0] acting [3,0]
> ...
>
> The pg map with 2 osd (1 rack) up:
>
> osdmap e55 pool 'obj' (3) object 'smallnode0.dat' -> pg 3.c85a49e4 (3.64) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode1.dat' -> pg 3.da72a8fd (3.7d) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode2.dat' -> pg 3.5309389d (3.9d) ->
> up [0] acting [0]
> osdmap e55 pool 'obj' (3) object 'smallnode3.dat' -> pg 3.2fa9d2c8 (3.48) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode4.dat' -> pg 3.e29c0d42 (3.42) ->
> up [1] acting [1]
> osdmap e55 pool 'obj' (3) object 'smallnode5.dat' -> pg 3.9b37ca18 (3.18) ->
> up [1] acting [1,0]
> osdmap e55 pool 'obj' (3) object 'smallnode6.dat' -> pg 3.4c1a3bc0 (3.c0) ->
> up [0] acting [0]
> osdmap e55 pool 'obj' (3) object 'smallnode7.dat' -> pg 3.e260b675 (3.75) ->
> up [0] acting [0]
> osdmap e55 pool 'obj' (3) object 'smallnode8.dat' -> pg 3.b05e3892 (3.92) ->
> up [1] acting [1,0]
> ...
>
> Cheers
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unexpected pg placement in degraded mode with custom crush rule
2013-07-05 4:01 ` Sage Weil
@ 2013-07-05 4:32 ` Mark Kirkwood
2013-07-05 4:53 ` Mark Kirkwood
0 siblings, 1 reply; 5+ messages in thread
From: Mark Kirkwood @ 2013-07-05 4:32 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Hi Sage,
I don't believe so, I'm loading the objects directly from another host
(which is running 0.64 built from src) with:
$ rados -m 192.168.122.21 -p obj put smallnode$n.dat smallnode.dat #
$n=0->99
and the osd's are all running 0.56.6, so I don't think there is any
kernel rbd or librbd involved.
I did try:
$ ceph osd crush tunables optimal
In one run - no difference.
I have updated to 0.61.4 and am running the test again, will update with
the results!
Cheers
Mark
On 05/07/13 16:01, Sage Weil wrote:
> Hi Mark,
>
> If you're not using a kernel cephfs or rbd client older than ~3.9, or
> ceph-fuse/librbd/librados older than bobtail, then you should
>
> ceph osd crush tunables optimal
>
> and I suspect that this will suddenly work perfectly. The defaults are
> still using semi-broken legacy values because client support is pretty
> new. Trees like yours, with sparsely populated leaves, tend to be most
> affected.
>
> (I bet you're seeing the rack separation rule violated because the
> previous copy of the PG was already there and ceph won't throw out old
> copies before creating new ones.)
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unexpected pg placement in degraded mode with custom crush rule
2013-07-05 4:32 ` Mark Kirkwood
@ 2013-07-05 4:53 ` Mark Kirkwood
2013-07-05 15:46 ` Sage Weil
0 siblings, 1 reply; 5+ messages in thread
From: Mark Kirkwood @ 2013-07-05 4:53 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Retesting with 0.61.4:
Immediately after stopping 2 osd in rack1:
2013-07-05 16:23:02.852386 mon.0 [INF] pgmap v450: 1160 pgs: 1160
active+degraded; 2000 MB data, 12991 MB used, 6135 MB / 20150 MB avail;
100/200 degraded (50.000%)
... time passes:
2013-07-05 16:51:03.248198 mon.0 [INF] pgmap v465: 1160 pgs: 1160
active+degraded; 2000 MB data, 12993 MB used, 6133 MB / 20150 MB avail;
100/200 degraded (50.000%)
So looks like Cuttlefish is behaving as expected. Is this due to tweaks
in the 'choose' algorithm in the later code?
Cheers
Mark
On 05/07/13 16:32, Mark Kirkwood wrote:
> Hi Sage,
>
> I don't believe so, I'm loading the objects directly from another host
> (which is running 0.64 built from src) with:
>
> $ rados -m 192.168.122.21 -p obj put smallnode$n.dat smallnode.dat #
> $n=0->99
>
> and the osd's are all running 0.56.6, so I don't think there is any
> kernel rbd or librbd involved.
>
>
> I did try:
>
> $ ceph osd crush tunables optimal
>
> In one run - no difference.
>
> I have updated to 0.61.4 and am running the test again, will update
> with the results!
>
> Cheers
>
> Mark
>
> On 05/07/13 16:01, Sage Weil wrote:
>> Hi Mark,
>>
>> If you're not using a kernel cephfs or rbd client older than ~3.9, or
>> ceph-fuse/librbd/librados older than bobtail, then you should
>>
>> ceph osd crush tunables optimal
>>
>> and I suspect that this will suddenly work perfectly. The defaults are
>> still using semi-broken legacy values because client support is pretty
>> new. Trees like yours, with sparsely populated leaves, tend to be most
>> affected.
>>
>> (I bet you're seeing the rack separation rule violated because the
>> previous copy of the PG was already there and ceph won't throw out old
>> copies before creating new ones.)
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Unexpected pg placement in degraded mode with custom crush rule
2013-07-05 4:53 ` Mark Kirkwood
@ 2013-07-05 15:46 ` Sage Weil
0 siblings, 0 replies; 5+ messages in thread
From: Sage Weil @ 2013-07-05 15:46 UTC (permalink / raw)
To: Mark Kirkwood; +Cc: ceph-devel
On Fri, 5 Jul 2013, Mark Kirkwood wrote:
> Retesting with 0.61.4:
>
> Immediately after stopping 2 osd in rack1:
>
> 2013-07-05 16:23:02.852386 mon.0 [INF] pgmap v450: 1160 pgs: 1160
> active+degraded; 2000 MB data, 12991 MB used, 6135 MB / 20150 MB avail;
> 100/200 degraded (50.000%)
>
> ... time passes:
>
> 2013-07-05 16:51:03.248198 mon.0 [INF] pgmap v465: 1160 pgs: 1160
> active+degraded; 2000 MB data, 12993 MB used, 6133 MB / 20150 MB avail;
> 100/200 degraded (50.000%)
>
> So looks like Cuttlefish is behaving as expected. Is this due to tweaks in the
> 'choose' algorithm in the later code?
Yes. Glad to hear it's working!
Just keep in that when moving from one map/distribution to another, if we
find that the old distribution provided more locations than the new one
(e.g., because a rack is down), rados will keep the old copy around. I
didn't follow your procedure closely, but that may explain part of what
you saw.
Cheers-
sage
>
> Cheers
>
> Mark
>
> On 05/07/13 16:32, Mark Kirkwood wrote:
> > Hi Sage,
> >
> > I don't believe so, I'm loading the objects directly from another host
> > (which is running 0.64 built from src) with:
> >
> > $ rados -m 192.168.122.21 -p obj put smallnode$n.dat smallnode.dat #
> > $n=0->99
> >
> > and the osd's are all running 0.56.6, so I don't think there is any kernel
> > rbd or librbd involved.
> >
> >
> > I did try:
> >
> > $ ceph osd crush tunables optimal
> >
> > In one run - no difference.
> >
> > I have updated to 0.61.4 and am running the test again, will update with the
> > results!
> >
> > Cheers
> >
> > Mark
> >
> > On 05/07/13 16:01, Sage Weil wrote:
> > > Hi Mark,
> > >
> > > If you're not using a kernel cephfs or rbd client older than ~3.9, or
> > > ceph-fuse/librbd/librados older than bobtail, then you should
> > >
> > > ceph osd crush tunables optimal
> > >
> > > and I suspect that this will suddenly work perfectly. The defaults are
> > > still using semi-broken legacy values because client support is pretty
> > > new. Trees like yours, with sparsely populated leaves, tend to be most
> > > affected.
> > >
> > > (I bet you're seeing the rack separation rule violated because the
> > > previous copy of the PG was already there and ceph won't throw out old
> > > copies before creating new ones.)
> > >
> > >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-07-05 15:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-05 3:36 Unexpected pg placement in degraded mode with custom crush rule Mark Kirkwood
2013-07-05 4:01 ` Sage Weil
2013-07-05 4:32 ` Mark Kirkwood
2013-07-05 4:53 ` Mark Kirkwood
2013-07-05 15:46 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.