All of lore.kernel.org
 help / color / mirror / Atom feed
* Crushmap ruleset for rack aware PG placement
@ 2014-09-15  7:47 Amit Vijairania
       [not found] ` <CADgBPFCVUVQZtHNkSsuMhskL2X-FTBTYy6zcav3CRpNCBGHJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Amit Vijairania @ 2014-09-15  7:47 UTC (permalink / raw)
  To: ceph-users@lists.ceph.com, William Bloom (wibloom),
	Cecil Lee -X (cecille - TEKSYSTEMS INC at Cisco),
	ceph-devel@vger.kernel.org

Hello!

In a two (2) rack Ceph cluster, with 15 hosts per rack (10 OSD per
host / 150 OSDs per rack), is it possible to create a ruleset for a
pool such that the Primary and Secondary PGs/replicas are placed in
one rack and Tertiary PG/replica is placed in the other rack?

root standard {
  id -1 # do not change unnecessarily
  # weight 734.400
  alg straw
  hash 0 # rjenkins1
  item rack1 weight 367.200
  item rack2 weight 367.200
}

Given there are only two (2) buckets, but three (3) replica, is it
even possible?

I think following Giant blueprint is trying to address scenario I
described above.. Is the following blueprint targeted for Giant
release?
http://wiki.ceph.com/Planning/Blueprints/Giant/crush_extension_for_more_flexible_object_placement


Regards,
Amit Vijairania  |  Cisco Systems, Inc.
--*--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Crushmap ruleset for rack aware PG placement
       [not found] ` <CADgBPFCVUVQZtHNkSsuMhskL2X-FTBTYy6zcav3CRpNCBGHJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-15 15:28   ` Sage Weil
  2014-09-15 18:21     ` Amit Vijairania
       [not found]     ` <alpine.DEB.2.00.1409150826060.513-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 2 replies; 10+ messages in thread
From: Sage Weil @ 2014-09-15 15:28 UTC (permalink / raw)
  To: Amit Vijairania
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org,
	Cecil Lee -X (cecille - TEKSYSTEMS INC at Cisco),
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi Amit,

On Mon, 15 Sep 2014, Amit Vijairania wrote:
> Hello!
> 
> In a two (2) rack Ceph cluster, with 15 hosts per rack (10 OSD per
> host / 150 OSDs per rack), is it possible to create a ruleset for a
> pool such that the Primary and Secondary PGs/replicas are placed in
> one rack and Tertiary PG/replica is placed in the other rack?
> 
> root standard {
>   id -1 # do not change unnecessarily
>   # weight 734.400
>   alg straw
>   hash 0 # rjenkins1
>   item rack1 weight 367.200
>   item rack2 weight 367.200
> }
> 
> Given there are only two (2) buckets, but three (3) replica, is it
> even possible?

Yes:

rule myrule {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step choose firstn 2 type rack
	step chooseleaf firstn 2 type host
	step emit
}

That will give you 4 osds, spread across 2 hosts in each rack.  The pool 
size (replication factor) is 3, so RADOS will just use the first three (2 
hosts in first rack, 1 host in second rack).

sage




> I think following Giant blueprint is trying to address scenario I
> described above.. Is the following blueprint targeted for Giant
> release?
> http://wiki.ceph.com/Planning/Blueprints/Giant/crush_extension_for_more_flexible_object_placement
> 
> 
> Regards,
> Amit Vijairania  |  Cisco Systems, Inc.
> --*--
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Crushmap ruleset for rack aware PG placement
  2014-09-15 15:28   ` Sage Weil
@ 2014-09-15 18:21     ` Amit Vijairania
       [not found]     ` <alpine.DEB.2.00.1409150826060.513-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  1 sibling, 0 replies; 10+ messages in thread
From: Amit Vijairania @ 2014-09-15 18:21 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-users@lists.ceph.com, William Bloom (wibloom),
	Cecil Lee -X (cecille - TEKSYSTEMS INC at Cisco),
	ceph-devel@vger.kernel.org

Thanks Sage!  We will test this and share our observations..

Regards,
Amit

Amit Vijairania  |  415.610.9908
--*--


On Mon, Sep 15, 2014 at 8:28 AM, Sage Weil <sweil@redhat.com> wrote:
> Hi Amit,
>
> On Mon, 15 Sep 2014, Amit Vijairania wrote:
>> Hello!
>>
>> In a two (2) rack Ceph cluster, with 15 hosts per rack (10 OSD per
>> host / 150 OSDs per rack), is it possible to create a ruleset for a
>> pool such that the Primary and Secondary PGs/replicas are placed in
>> one rack and Tertiary PG/replica is placed in the other rack?
>>
>> root standard {
>>   id -1 # do not change unnecessarily
>>   # weight 734.400
>>   alg straw
>>   hash 0 # rjenkins1
>>   item rack1 weight 367.200
>>   item rack2 weight 367.200
>> }
>>
>> Given there are only two (2) buckets, but three (3) replica, is it
>> even possible?
>
> Yes:
>
> rule myrule {
>         ruleset 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step choose firstn 2 type rack
>         step chooseleaf firstn 2 type host
>         step emit
> }
>
> That will give you 4 osds, spread across 2 hosts in each rack.  The pool
> size (replication factor) is 3, so RADOS will just use the first three (2
> hosts in first rack, 1 host in second rack).
>
> sage
>
>
>
>
>> I think following Giant blueprint is trying to address scenario I
>> described above.. Is the following blueprint targeted for Giant
>> release?
>> http://wiki.ceph.com/Planning/Blueprints/Giant/crush_extension_for_more_flexible_object_placement
>>
>>
>> Regards,
>> Amit Vijairania  |  Cisco Systems, Inc.
>> --*--
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Crushmap ruleset for rack aware PG placement
       [not found]     ` <alpine.DEB.2.00.1409150826060.513-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-09-16 10:02       ` Daniel Swarbrick
       [not found]         ` <54184711.3080002@dachary.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Swarbrick @ 2014-09-16 10:02 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA

On 15/09/14 17:28, Sage Weil wrote:
>
> rule myrule {
> 	ruleset 1
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step choose firstn 2 type rack
> 	step chooseleaf firstn 2 type host
> 	step emit
> }
>
> That will give you 4 osds, spread across 2 hosts in each rack.  The pool 
> size (replication factor) is 3, so RADOS will just use the first three (2 
> hosts in first rack, 1 host in second rack).

I have a similar requirement, where we currently have four nodes, two in
each fire zone, with pool size 3. At the moment, due to the number of
nodes, we are guaranteed at least one replica in each fire zone (which
we represent with bucket type "room"). If we add more nodes in future,
the current ruleset may cause all three replicas of a PG to land in a
single zone.

I tried the ruleset suggested above (replacing "rack" with "room"), but
when testing it with crushtool --test --show-utilization, I simply get
segfaults. No amount of fiddling around seems to make it work - even
adding two new hypothetical nodes to the crushmap doesn't help.

What could I perhaps be doing wrong?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ceph-users] Crushmap ruleset for rack aware PG placement
       [not found]               ` <541945F8.60001@dachary.org>
@ 2014-09-17 14:42                 ` Johnu George (johnugeo)
  2014-09-17 16:10                   ` Loic Dachary
  0 siblings, 1 reply; 10+ messages in thread
From: Johnu George (johnugeo) @ 2014-09-17 14:42 UTC (permalink / raw)
  To: ceph-devel; +Cc: Loic Dachary

Adding ceph-devel 

On 9/17/14, 1:27 AM, "Loic Dachary" <loic@dachary.org> wrote:

>
>Could you resend with ceph-devel in cc ? It's better for archive purposes
>;-)
>
>On 17/09/2014 09:37, Johnu George (johnugeo) wrote:
>> Hi Sage,
>>          I was looking at the crash that was reported in this mail
>>chain.
>> I am seeing that the crash happens when number of replicas configured is
>> less than total number of osds to be selected as per rule. This is
>> because, the crush temporary buffers are allocated as per num_rep size.
>> (scratch array has size num_rep * 3) So, when number of osds to be
>> selected is more, buffer overflow happens and it causes error/crash. I
>>saw
>> your earlier comment in this mail  where you asked to create a rule that
>> selects two osds per rack(2 racks) with num_rep=3. I feel that buffer
>> overflow issue should happen in this situation too, that can cause 'out
>>of
>> array' access. Am I wrong somewhere or am I missing something?
>> 
>> Johnu
>> 
>> On 9/16/14, 9:39 AM, "Daniel Swarbrick"
>> <daniel.swarbrick@profitbricks.com> wrote:
>> 
>>> Hi Loic,
>>>
>>> Thanks for providing a detailed example. I'm able to run the example
>>> that you provide, and also got my own live crushmap to produce some
>>> results, when I appended the "--num-rep 3" option to the command.
>>> Without that option, even your example is throwing segfaults - maybe a
>>> bug in crushtool?
>>>
>>> One other area I wasn't sure about - can the final "chooseleaf" step
>>> specify "firstn 0" for simplicity's sake (and to automatically handle a
>>> larger pool size in future) ? Would there be any downside to this?
>>>
>>> Cheers
>>>
>>> On 16/09/14 16:20, Loic Dachary wrote:
>>>> Hi Daniel,
>>>>
>>>> When I run
>>>>
>>>> crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
>>>> straw 10 default straw 0
>>>> crushtool -d crushmap -o crushmap.txt
>>>> cat >> crushmap.txt <<EOF
>>>> rule myrule {
>>>> 	ruleset 1
>>>> 	type replicated
>>>> 	min_size 1
>>>> 	max_size 10
>>>> 	step take default
>>>> 	step choose firstn 2 type rack
>>>> 	step chooseleaf firstn 2 type host
>>>> 	step emit
>>>> }
>>>> EOF
>>>> crushtool -c crushmap.txt -o crushmap
>>>> crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
>>>> --max-x 10 --num-rep 3
>>>>
>>>> I get
>>>>
>>>> rule 1 (myrule), x = 1..10, numrep = 3..3
>>>> CRUSH rule 1 x 1 [79,69,10]
>>>> CRUSH rule 1 x 2 [56,58,60]
>>>> CRUSH rule 1 x 3 [30,26,19]
>>>> CRUSH rule 1 x 4 [14,8,69]
>>>> CRUSH rule 1 x 5 [7,4,88]
>>>> CRUSH rule 1 x 6 [54,52,37]
>>>> CRUSH rule 1 x 7 [69,67,19]
>>>> CRUSH rule 1 x 8 [51,46,83]
>>>> CRUSH rule 1 x 9 [55,56,35]
>>>> CRUSH rule 1 x 10 [54,51,95]
>>>> rule 1 (myrule) num_rep 3 result size == 3:	10/10
>>>>
>>>> What command are you running to get a core dump ?
>>>>
>>>> Cheers
>>>>
>>>> On 16/09/2014 12:02, Daniel Swarbrick wrote:
>>>>> On 15/09/14 17:28, Sage Weil wrote:
>>>>>> rule myrule {
>>>>>> 	ruleset 1
>>>>>> 	type replicated
>>>>>> 	min_size 1
>>>>>> 	max_size 10
>>>>>> 	step take default
>>>>>> 	step choose firstn 2 type rack
>>>>>> 	step chooseleaf firstn 2 type host
>>>>>> 	step emit
>>>>>> }
>>>>>>
>>>>>> That will give you 4 osds, spread across 2 hosts in each rack.  The
>>>>>> pool 
>>>>>> size (replication factor) is 3, so RADOS will just use the first
>>>>>> three (2 
>>>>>> hosts in first rack, 1 host in second rack).
>>>>> I have a similar requirement, where we currently have four nodes, two
>>>>> in
>>>>> each fire zone, with pool size 3. At the moment, due to the number of
>>>>> nodes, we are guaranteed at least one replica in each fire zone
>>>>>(which
>>>>> we represent with bucket type "room"). If we add more nodes in
>>>>>future,
>>>>> the current ruleset may cause all three replicas of a PG to land in a
>>>>> single zone.
>>>>>
>>>>> I tried the ruleset suggested above (replacing "rack" with "room"),
>>>>>but
>>>>> when testing it with crushtool --test --show-utilization, I simply
>>>>>get
>>>>> segfaults. No amount of fiddling around seems to make it work - even
>>>>> adding two new hypothetical nodes to the crushmap doesn't help.
>>>>>
>>>>> What could I perhaps be doing wrong?
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>
>-- 
>Loïc Dachary, Artisan Logiciel Libre
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ceph-users] Crushmap ruleset for rack aware PG placement
  2014-09-17 14:42                 ` [ceph-users] " Johnu George (johnugeo)
@ 2014-09-17 16:10                   ` Loic Dachary
  2014-09-17 20:03                     ` Johnu George (johnugeo)
  0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-09-17 16:10 UTC (permalink / raw)
  To: Johnu George (johnugeo), ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5470 bytes --]

Hi,

If the number of replica desired is 1, then

https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915

will be called with maxout = 1 and scratch will be maxout * 3. But if the rule always selects 4 items, then it overflows. Is it what you also read ?

Cheers

On 17/09/2014 16:42, Johnu George (johnugeo) wrote:
> Adding ceph-devel 
> 
> On 9/17/14, 1:27 AM, "Loic Dachary" <loic@dachary.org> wrote:
> 
>>
>> Could you resend with ceph-devel in cc ? It's better for archive purposes
>> ;-)
>>
>> On 17/09/2014 09:37, Johnu George (johnugeo) wrote:
>>> Hi Sage,
>>>          I was looking at the crash that was reported in this mail
>>> chain.
>>> I am seeing that the crash happens when number of replicas configured is
>>> less than total number of osds to be selected as per rule. This is
>>> because, the crush temporary buffers are allocated as per num_rep size.
>>> (scratch array has size num_rep * 3) So, when number of osds to be
>>> selected is more, buffer overflow happens and it causes error/crash. I
>>> saw
>>> your earlier comment in this mail  where you asked to create a rule that
>>> selects two osds per rack(2 racks) with num_rep=3. I feel that buffer
>>> overflow issue should happen in this situation too, that can cause 'out
>>> of
>>> array' access. Am I wrong somewhere or am I missing something?
>>>
>>> Johnu
>>>
>>> On 9/16/14, 9:39 AM, "Daniel Swarbrick"
>>> <daniel.swarbrick@profitbricks.com> wrote:
>>>
>>>> Hi Loic,
>>>>
>>>> Thanks for providing a detailed example. I'm able to run the example
>>>> that you provide, and also got my own live crushmap to produce some
>>>> results, when I appended the "--num-rep 3" option to the command.
>>>> Without that option, even your example is throwing segfaults - maybe a
>>>> bug in crushtool?
>>>>
>>>> One other area I wasn't sure about - can the final "chooseleaf" step
>>>> specify "firstn 0" for simplicity's sake (and to automatically handle a
>>>> larger pool size in future) ? Would there be any downside to this?
>>>>
>>>> Cheers
>>>>
>>>> On 16/09/14 16:20, Loic Dachary wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> When I run
>>>>>
>>>>> crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
>>>>> straw 10 default straw 0
>>>>> crushtool -d crushmap -o crushmap.txt
>>>>> cat >> crushmap.txt <<EOF
>>>>> rule myrule {
>>>>> 	ruleset 1
>>>>> 	type replicated
>>>>> 	min_size 1
>>>>> 	max_size 10
>>>>> 	step take default
>>>>> 	step choose firstn 2 type rack
>>>>> 	step chooseleaf firstn 2 type host
>>>>> 	step emit
>>>>> }
>>>>> EOF
>>>>> crushtool -c crushmap.txt -o crushmap
>>>>> crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
>>>>> --max-x 10 --num-rep 3
>>>>>
>>>>> I get
>>>>>
>>>>> rule 1 (myrule), x = 1..10, numrep = 3..3
>>>>> CRUSH rule 1 x 1 [79,69,10]
>>>>> CRUSH rule 1 x 2 [56,58,60]
>>>>> CRUSH rule 1 x 3 [30,26,19]
>>>>> CRUSH rule 1 x 4 [14,8,69]
>>>>> CRUSH rule 1 x 5 [7,4,88]
>>>>> CRUSH rule 1 x 6 [54,52,37]
>>>>> CRUSH rule 1 x 7 [69,67,19]
>>>>> CRUSH rule 1 x 8 [51,46,83]
>>>>> CRUSH rule 1 x 9 [55,56,35]
>>>>> CRUSH rule 1 x 10 [54,51,95]
>>>>> rule 1 (myrule) num_rep 3 result size == 3:	10/10
>>>>>
>>>>> What command are you running to get a core dump ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 16/09/2014 12:02, Daniel Swarbrick wrote:
>>>>>> On 15/09/14 17:28, Sage Weil wrote:
>>>>>>> rule myrule {
>>>>>>> 	ruleset 1
>>>>>>> 	type replicated
>>>>>>> 	min_size 1
>>>>>>> 	max_size 10
>>>>>>> 	step take default
>>>>>>> 	step choose firstn 2 type rack
>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>> 	step emit
>>>>>>> }
>>>>>>>
>>>>>>> That will give you 4 osds, spread across 2 hosts in each rack.  The
>>>>>>> pool 
>>>>>>> size (replication factor) is 3, so RADOS will just use the first
>>>>>>> three (2 
>>>>>>> hosts in first rack, 1 host in second rack).
>>>>>> I have a similar requirement, where we currently have four nodes, two
>>>>>> in
>>>>>> each fire zone, with pool size 3. At the moment, due to the number of
>>>>>> nodes, we are guaranteed at least one replica in each fire zone
>>>>>> (which
>>>>>> we represent with bucket type "room"). If we add more nodes in
>>>>>> future,
>>>>>> the current ruleset may cause all three replicas of a PG to land in a
>>>>>> single zone.
>>>>>>
>>>>>> I tried the ruleset suggested above (replacing "rack" with "room"),
>>>>>> but
>>>>>> when testing it with crushtool --test --show-utilization, I simply
>>>>>> get
>>>>>> segfaults. No amount of fiddling around seems to make it work - even
>>>>>> adding two new hypothetical nodes to the crushmap doesn't help.
>>>>>>
>>>>>> What could I perhaps be doing wrong?
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ceph-users] Crushmap ruleset for rack aware PG placement
  2014-09-17 16:10                   ` Loic Dachary
@ 2014-09-17 20:03                     ` Johnu George (johnugeo)
  2014-09-17 20:11                       ` Loic Dachary
  0 siblings, 1 reply; 10+ messages in thread
From: Johnu George (johnugeo) @ 2014-09-17 20:03 UTC (permalink / raw)
  To: Loic Dachary, ceph-devel

Loic,
      You are right.  Are we planning to support configurations where
replica number is different from the number of osds selected from a  rule?
If not, One solution is to add a validation check when a rule is activated
for a pool of a specific replica.

Johnu

On 9/17/14, 9:10 AM, "Loic Dachary" <loic@dachary.org> wrote:

>Hi,
>
>If the number of replica desired is 1, then
>
>https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915
>
>will be called with maxout = 1 and scratch will be maxout * 3. But if the
>rule always selects 4 items, then it overflows. Is it what you also read ?
>
>Cheers
>
>On 17/09/2014 16:42, Johnu George (johnugeo) wrote:
>> Adding ceph-devel
>> 
>> On 9/17/14, 1:27 AM, "Loic Dachary" <loic@dachary.org> wrote:
>> 
>>>
>>> Could you resend with ceph-devel in cc ? It's better for archive
>>>purposes
>>> ;-)
>>>
>>> On 17/09/2014 09:37, Johnu George (johnugeo) wrote:
>>>> Hi Sage,
>>>>          I was looking at the crash that was reported in this mail
>>>> chain.
>>>> I am seeing that the crash happens when number of replicas configured
>>>>is
>>>> less than total number of osds to be selected as per rule. This is
>>>> because, the crush temporary buffers are allocated as per num_rep
>>>>size.
>>>> (scratch array has size num_rep * 3) So, when number of osds to be
>>>> selected is more, buffer overflow happens and it causes error/crash. I
>>>> saw
>>>> your earlier comment in this mail  where you asked to create a rule
>>>>that
>>>> selects two osds per rack(2 racks) with num_rep=3. I feel that buffer
>>>> overflow issue should happen in this situation too, that can cause
>>>>'out
>>>> of
>>>> array' access. Am I wrong somewhere or am I missing something?
>>>>
>>>> Johnu
>>>>
>>>> On 9/16/14, 9:39 AM, "Daniel Swarbrick"
>>>> <daniel.swarbrick@profitbricks.com> wrote:
>>>>
>>>>> Hi Loic,
>>>>>
>>>>> Thanks for providing a detailed example. I'm able to run the example
>>>>> that you provide, and also got my own live crushmap to produce some
>>>>> results, when I appended the "--num-rep 3" option to the command.
>>>>> Without that option, even your example is throwing segfaults - maybe
>>>>>a
>>>>> bug in crushtool?
>>>>>
>>>>> One other area I wasn't sure about - can the final "chooseleaf" step
>>>>> specify "firstn 0" for simplicity's sake (and to automatically
>>>>>handle a
>>>>> larger pool size in future) ? Would there be any downside to this?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 16/09/14 16:20, Loic Dachary wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> When I run
>>>>>>
>>>>>> crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
>>>>>> straw 10 default straw 0
>>>>>> crushtool -d crushmap -o crushmap.txt
>>>>>> cat >> crushmap.txt <<EOF
>>>>>> rule myrule {
>>>>>> 	ruleset 1
>>>>>> 	type replicated
>>>>>> 	min_size 1
>>>>>> 	max_size 10
>>>>>> 	step take default
>>>>>> 	step choose firstn 2 type rack
>>>>>> 	step chooseleaf firstn 2 type host
>>>>>> 	step emit
>>>>>> }
>>>>>> EOF
>>>>>> crushtool -c crushmap.txt -o crushmap
>>>>>> crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
>>>>>> --max-x 10 --num-rep 3
>>>>>>
>>>>>> I get
>>>>>>
>>>>>> rule 1 (myrule), x = 1..10, numrep = 3..3
>>>>>> CRUSH rule 1 x 1 [79,69,10]
>>>>>> CRUSH rule 1 x 2 [56,58,60]
>>>>>> CRUSH rule 1 x 3 [30,26,19]
>>>>>> CRUSH rule 1 x 4 [14,8,69]
>>>>>> CRUSH rule 1 x 5 [7,4,88]
>>>>>> CRUSH rule 1 x 6 [54,52,37]
>>>>>> CRUSH rule 1 x 7 [69,67,19]
>>>>>> CRUSH rule 1 x 8 [51,46,83]
>>>>>> CRUSH rule 1 x 9 [55,56,35]
>>>>>> CRUSH rule 1 x 10 [54,51,95]
>>>>>> rule 1 (myrule) num_rep 3 result size == 3:	10/10
>>>>>>
>>>>>> What command are you running to get a core dump ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 16/09/2014 12:02, Daniel Swarbrick wrote:
>>>>>>> On 15/09/14 17:28, Sage Weil wrote:
>>>>>>>> rule myrule {
>>>>>>>> 	ruleset 1
>>>>>>>> 	type replicated
>>>>>>>> 	min_size 1
>>>>>>>> 	max_size 10
>>>>>>>> 	step take default
>>>>>>>> 	step choose firstn 2 type rack
>>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>>> 	step emit
>>>>>>>> }
>>>>>>>>
>>>>>>>> That will give you 4 osds, spread across 2 hosts in each rack.
>>>>>>>>The
>>>>>>>> pool 
>>>>>>>> size (replication factor) is 3, so RADOS will just use the first
>>>>>>>> three (2 
>>>>>>>> hosts in first rack, 1 host in second rack).
>>>>>>> I have a similar requirement, where we currently have four nodes,
>>>>>>>two
>>>>>>> in
>>>>>>> each fire zone, with pool size 3. At the moment, due to the number
>>>>>>>of
>>>>>>> nodes, we are guaranteed at least one replica in each fire zone
>>>>>>> (which
>>>>>>> we represent with bucket type "room"). If we add more nodes in
>>>>>>> future,
>>>>>>> the current ruleset may cause all three replicas of a PG to land
>>>>>>>in a
>>>>>>> single zone.
>>>>>>>
>>>>>>> I tried the ruleset suggested above (replacing "rack" with "room"),
>>>>>>> but
>>>>>>> when testing it with crushtool --test --show-utilization, I simply
>>>>>>> get
>>>>>>> segfaults. No amount of fiddling around seems to make it work -
>>>>>>>even
>>>>>>> adding two new hypothetical nodes to the crushmap doesn't help.
>>>>>>>
>>>>>>> What could I perhaps be doing wrong?
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>> 
>
>-- 
>Loïc Dachary, Artisan Logiciel Libre
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ceph-users] Crushmap ruleset for rack aware PG placement
  2014-09-17 20:03                     ` Johnu George (johnugeo)
@ 2014-09-17 20:11                       ` Loic Dachary
  2014-09-17 22:40                         ` Johnu George (johnugeo)
  0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-09-17 20:11 UTC (permalink / raw)
  To: Johnu George (johnugeo), ceph-devel

[-- Attachment #1: Type: text/plain, Size: 6507 bytes --]



On 17/09/2014 22:03, Johnu George (johnugeo) wrote:
> Loic,
>       You are right.  Are we planning to support configurations where
> replica number is different from the number of osds selected from a  rule?

I think crush should support it, yes. If a rule can provide 10 OSDs there is no reason for it to fail to provide just one.

Cheers

> If not, One solution is to add a validation check when a rule is activated
> for a pool of a specific replica.
> 
> Johnu
> 
> On 9/17/14, 9:10 AM, "Loic Dachary" <loic@dachary.org> wrote:
> 
>> Hi,
>>
>> If the number of replica desired is 1, then
>>
>> https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915
>>
>> will be called with maxout = 1 and scratch will be maxout * 3. But if the
>> rule always selects 4 items, then it overflows. Is it what you also read ?
>>
>> Cheers
>>
>> On 17/09/2014 16:42, Johnu George (johnugeo) wrote:
>>> Adding ceph-devel
>>>
>>> On 9/17/14, 1:27 AM, "Loic Dachary" <loic@dachary.org> wrote:
>>>
>>>>
>>>> Could you resend with ceph-devel in cc ? It's better for archive
>>>> purposes
>>>> ;-)
>>>>
>>>> On 17/09/2014 09:37, Johnu George (johnugeo) wrote:
>>>>> Hi Sage,
>>>>>          I was looking at the crash that was reported in this mail
>>>>> chain.
>>>>> I am seeing that the crash happens when number of replicas configured
>>>>> is
>>>>> less than total number of osds to be selected as per rule. This is
>>>>> because, the crush temporary buffers are allocated as per num_rep
>>>>> size.
>>>>> (scratch array has size num_rep * 3) So, when number of osds to be
>>>>> selected is more, buffer overflow happens and it causes error/crash. I
>>>>> saw
>>>>> your earlier comment in this mail  where you asked to create a rule
>>>>> that
>>>>> selects two osds per rack(2 racks) with num_rep=3. I feel that buffer
>>>>> overflow issue should happen in this situation too, that can cause
>>>>> 'out
>>>>> of
>>>>> array' access. Am I wrong somewhere or am I missing something?
>>>>>
>>>>> Johnu
>>>>>
>>>>> On 9/16/14, 9:39 AM, "Daniel Swarbrick"
>>>>> <daniel.swarbrick@profitbricks.com> wrote:
>>>>>
>>>>>> Hi Loic,
>>>>>>
>>>>>> Thanks for providing a detailed example. I'm able to run the example
>>>>>> that you provide, and also got my own live crushmap to produce some
>>>>>> results, when I appended the "--num-rep 3" option to the command.
>>>>>> Without that option, even your example is throwing segfaults - maybe
>>>>>> a
>>>>>> bug in crushtool?
>>>>>>
>>>>>> One other area I wasn't sure about - can the final "chooseleaf" step
>>>>>> specify "firstn 0" for simplicity's sake (and to automatically
>>>>>> handle a
>>>>>> larger pool size in future) ? Would there be any downside to this?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 16/09/14 16:20, Loic Dachary wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> When I run
>>>>>>>
>>>>>>> crushtool --outfn crushmap --build --num_osds 100 host straw 2 rack
>>>>>>> straw 10 default straw 0
>>>>>>> crushtool -d crushmap -o crushmap.txt
>>>>>>> cat >> crushmap.txt <<EOF
>>>>>>> rule myrule {
>>>>>>> 	ruleset 1
>>>>>>> 	type replicated
>>>>>>> 	min_size 1
>>>>>>> 	max_size 10
>>>>>>> 	step take default
>>>>>>> 	step choose firstn 2 type rack
>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>> 	step emit
>>>>>>> }
>>>>>>> EOF
>>>>>>> crushtool -c crushmap.txt -o crushmap
>>>>>>> crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
>>>>>>> --max-x 10 --num-rep 3
>>>>>>>
>>>>>>> I get
>>>>>>>
>>>>>>> rule 1 (myrule), x = 1..10, numrep = 3..3
>>>>>>> CRUSH rule 1 x 1 [79,69,10]
>>>>>>> CRUSH rule 1 x 2 [56,58,60]
>>>>>>> CRUSH rule 1 x 3 [30,26,19]
>>>>>>> CRUSH rule 1 x 4 [14,8,69]
>>>>>>> CRUSH rule 1 x 5 [7,4,88]
>>>>>>> CRUSH rule 1 x 6 [54,52,37]
>>>>>>> CRUSH rule 1 x 7 [69,67,19]
>>>>>>> CRUSH rule 1 x 8 [51,46,83]
>>>>>>> CRUSH rule 1 x 9 [55,56,35]
>>>>>>> CRUSH rule 1 x 10 [54,51,95]
>>>>>>> rule 1 (myrule) num_rep 3 result size == 3:	10/10
>>>>>>>
>>>>>>> What command are you running to get a core dump ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 16/09/2014 12:02, Daniel Swarbrick wrote:
>>>>>>>> On 15/09/14 17:28, Sage Weil wrote:
>>>>>>>>> rule myrule {
>>>>>>>>> 	ruleset 1
>>>>>>>>> 	type replicated
>>>>>>>>> 	min_size 1
>>>>>>>>> 	max_size 10
>>>>>>>>> 	step take default
>>>>>>>>> 	step choose firstn 2 type rack
>>>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>>>> 	step emit
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> That will give you 4 osds, spread across 2 hosts in each rack.
>>>>>>>>> The
>>>>>>>>> pool 
>>>>>>>>> size (replication factor) is 3, so RADOS will just use the first
>>>>>>>>> three (2 
>>>>>>>>> hosts in first rack, 1 host in second rack).
>>>>>>>> I have a similar requirement, where we currently have four nodes,
>>>>>>>> two
>>>>>>>> in
>>>>>>>> each fire zone, with pool size 3. At the moment, due to the number
>>>>>>>> of
>>>>>>>> nodes, we are guaranteed at least one replica in each fire zone
>>>>>>>> (which
>>>>>>>> we represent with bucket type "room"). If we add more nodes in
>>>>>>>> future,
>>>>>>>> the current ruleset may cause all three replicas of a PG to land
>>>>>>>> in a
>>>>>>>> single zone.
>>>>>>>>
>>>>>>>> I tried the ruleset suggested above (replacing "rack" with "room"),
>>>>>>>> but
>>>>>>>> when testing it with crushtool --test --show-utilization, I simply
>>>>>>>> get
>>>>>>>> segfaults. No amount of fiddling around seems to make it work -
>>>>>>>> even
>>>>>>>> adding two new hypothetical nodes to the crushmap doesn't help.
>>>>>>>>
>>>>>>>> What could I perhaps be doing wrong?
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>> -- 
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ceph-users] Crushmap ruleset for rack aware PG placement
  2014-09-17 20:11                       ` Loic Dachary
@ 2014-09-17 22:40                         ` Johnu George (johnugeo)
  2014-09-18  2:03                           ` Chen, Xiaoxi
  0 siblings, 1 reply; 10+ messages in thread
From: Johnu George (johnugeo) @ 2014-09-17 22:40 UTC (permalink / raw)
  To: Loic Dachary, ceph-devel

In such a case, we can initialize scratch array in
crush/CrushWrapper.h#L919 with maximum number of osds that can be
selected. Since we know the rule no, it should be possible to calculate
the maximum osds that can be selected.

Johnu

On 9/17/14, 1:11 PM, "Loic Dachary" <loic@dachary.org> wrote:

>
>
>On 17/09/2014 22:03, Johnu George (johnugeo) wrote:
>> Loic,
>>       You are right.  Are we planning to support configurations where
>> replica number is different from the number of osds selected from a
>>rule?
>
>I think crush should support it, yes. If a rule can provide 10 OSDs there
>is no reason for it to fail to provide just one.
>
>Cheers
>
>> If not, One solution is to add a validation check when a rule is
>>activated
>> for a pool of a specific replica.
>> 
>> Johnu
>> 
>> On 9/17/14, 9:10 AM, "Loic Dachary" <loic@dachary.org> wrote:
>> 
>>> Hi,
>>>
>>> If the number of replica desired is 1, then
>>>
>>> https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915
>>>
>>> will be called with maxout = 1 and scratch will be maxout * 3. But if
>>>the
>>> rule always selects 4 items, then it overflows. Is it what you also
>>>read ?
>>>
>>> Cheers
>>>
>>> On 17/09/2014 16:42, Johnu George (johnugeo) wrote:
>>>> Adding ceph-devel
>>>>
>>>> On 9/17/14, 1:27 AM, "Loic Dachary" <loic@dachary.org> wrote:
>>>>
>>>>>
>>>>> Could you resend with ceph-devel in cc ? It's better for archive
>>>>> purposes
>>>>> ;-)
>>>>>
>>>>> On 17/09/2014 09:37, Johnu George (johnugeo) wrote:
>>>>>> Hi Sage,
>>>>>>          I was looking at the crash that was reported in this mail
>>>>>> chain.
>>>>>> I am seeing that the crash happens when number of replicas
>>>>>>configured
>>>>>> is
>>>>>> less than total number of osds to be selected as per rule. This is
>>>>>> because, the crush temporary buffers are allocated as per num_rep
>>>>>> size.
>>>>>> (scratch array has size num_rep * 3) So, when number of osds to be
>>>>>> selected is more, buffer overflow happens and it causes
>>>>>>error/crash. I
>>>>>> saw
>>>>>> your earlier comment in this mail  where you asked to create a rule
>>>>>> that
>>>>>> selects two osds per rack(2 racks) with num_rep=3. I feel that
>>>>>>buffer
>>>>>> overflow issue should happen in this situation too, that can cause
>>>>>> 'out
>>>>>> of
>>>>>> array' access. Am I wrong somewhere or am I missing something?
>>>>>>
>>>>>> Johnu
>>>>>>
>>>>>> On 9/16/14, 9:39 AM, "Daniel Swarbrick"
>>>>>> <daniel.swarbrick@profitbricks.com> wrote:
>>>>>>
>>>>>>> Hi Loic,
>>>>>>>
>>>>>>> Thanks for providing a detailed example. I'm able to run the
>>>>>>>example
>>>>>>> that you provide, and also got my own live crushmap to produce some
>>>>>>> results, when I appended the "--num-rep 3" option to the command.
>>>>>>> Without that option, even your example is throwing segfaults -
>>>>>>>maybe
>>>>>>> a
>>>>>>> bug in crushtool?
>>>>>>>
>>>>>>> One other area I wasn't sure about - can the final "chooseleaf"
>>>>>>>step
>>>>>>> specify "firstn 0" for simplicity's sake (and to automatically
>>>>>>> handle a
>>>>>>> larger pool size in future) ? Would there be any downside to this?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 16/09/14 16:20, Loic Dachary wrote:
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> When I run
>>>>>>>>
>>>>>>>> crushtool --outfn crushmap --build --num_osds 100 host straw 2
>>>>>>>>rack
>>>>>>>> straw 10 default straw 0
>>>>>>>> crushtool -d crushmap -o crushmap.txt
>>>>>>>> cat >> crushmap.txt <<EOF
>>>>>>>> rule myrule {
>>>>>>>> 	ruleset 1
>>>>>>>> 	type replicated
>>>>>>>> 	min_size 1
>>>>>>>> 	max_size 10
>>>>>>>> 	step take default
>>>>>>>> 	step choose firstn 2 type rack
>>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>>> 	step emit
>>>>>>>> }
>>>>>>>> EOF
>>>>>>>> crushtool -c crushmap.txt -o crushmap
>>>>>>>> crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1
>>>>>>>> --max-x 10 --num-rep 3
>>>>>>>>
>>>>>>>> I get
>>>>>>>>
>>>>>>>> rule 1 (myrule), x = 1..10, numrep = 3..3
>>>>>>>> CRUSH rule 1 x 1 [79,69,10]
>>>>>>>> CRUSH rule 1 x 2 [56,58,60]
>>>>>>>> CRUSH rule 1 x 3 [30,26,19]
>>>>>>>> CRUSH rule 1 x 4 [14,8,69]
>>>>>>>> CRUSH rule 1 x 5 [7,4,88]
>>>>>>>> CRUSH rule 1 x 6 [54,52,37]
>>>>>>>> CRUSH rule 1 x 7 [69,67,19]
>>>>>>>> CRUSH rule 1 x 8 [51,46,83]
>>>>>>>> CRUSH rule 1 x 9 [55,56,35]
>>>>>>>> CRUSH rule 1 x 10 [54,51,95]
>>>>>>>> rule 1 (myrule) num_rep 3 result size == 3:	10/10
>>>>>>>>
>>>>>>>> What command are you running to get a core dump ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 16/09/2014 12:02, Daniel Swarbrick wrote:
>>>>>>>>> On 15/09/14 17:28, Sage Weil wrote:
>>>>>>>>>> rule myrule {
>>>>>>>>>> 	ruleset 1
>>>>>>>>>> 	type replicated
>>>>>>>>>> 	min_size 1
>>>>>>>>>> 	max_size 10
>>>>>>>>>> 	step take default
>>>>>>>>>> 	step choose firstn 2 type rack
>>>>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>>>>> 	step emit
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> That will give you 4 osds, spread across 2 hosts in each rack.
>>>>>>>>>> The
>>>>>>>>>> pool 
>>>>>>>>>> size (replication factor) is 3, so RADOS will just use the first
>>>>>>>>>> three (2
>>>>>>>>>> hosts in first rack, 1 host in second rack).
>>>>>>>>> I have a similar requirement, where we currently have four nodes,
>>>>>>>>> two
>>>>>>>>> in
>>>>>>>>> each fire zone, with pool size 3. At the moment, due to the
>>>>>>>>>number
>>>>>>>>> of
>>>>>>>>> nodes, we are guaranteed at least one replica in each fire zone
>>>>>>>>> (which
>>>>>>>>> we represent with bucket type "room"). If we add more nodes in
>>>>>>>>> future,
>>>>>>>>> the current ruleset may cause all three replicas of a PG to land
>>>>>>>>> in a
>>>>>>>>> single zone.
>>>>>>>>>
>>>>>>>>> I tried the ruleset suggested above (replacing "rack" with
>>>>>>>>>"room"),
>>>>>>>>> but
>>>>>>>>> when testing it with crushtool --test --show-utilization, I
>>>>>>>>>simply
>>>>>>>>> get
>>>>>>>>> segfaults. No amount of fiddling around seems to make it work -
>>>>>>>>> even
>>>>>>>>> adding two new hypothetical nodes to the crushmap doesn't help.
>>>>>>>>>
>>>>>>>>> What could I perhaps be doing wrong?
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>> -- 
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>
>>>>
>>>
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>> 
>
>-- 
>Loïc Dachary, Artisan Logiciel Libre
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [ceph-users] Crushmap ruleset for rack aware PG placement
  2014-09-17 22:40                         ` Johnu George (johnugeo)
@ 2014-09-18  2:03                           ` Chen, Xiaoxi
  0 siblings, 0 replies; 10+ messages in thread
From: Chen, Xiaoxi @ 2014-09-18  2:03 UTC (permalink / raw)
  To: Johnu George (johnugeo), Loic Dachary, ceph-devel

The rule has max_size, can we just use that value?

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Johnu George (johnugeo)
Sent: Thursday, September 18, 2014 6:41 AM
To: Loic Dachary; ceph-devel
Subject: Re: [ceph-users] Crushmap ruleset for rack aware PG placement

In such a case, we can initialize scratch array in
crush/CrushWrapper.h#L919 with maximum number of osds that can be selected. Since we know the rule no, it should be possible to calculate the maximum osds that can be selected.

Johnu

On 9/17/14, 1:11 PM, "Loic Dachary" <loic@dachary.org> wrote:

>
>
>On 17/09/2014 22:03, Johnu George (johnugeo) wrote:
>> Loic,
>>       You are right.  Are we planning to support configurations where  
>>replica number is different from the number of osds selected from a 
>>rule?
>
>I think crush should support it, yes. If a rule can provide 10 OSDs 
>there is no reason for it to fail to provide just one.
>
>Cheers
>
>> If not, One solution is to add a validation check when a rule is 
>>activated  for a pool of a specific replica.
>> 
>> Johnu
>> 
>> On 9/17/14, 9:10 AM, "Loic Dachary" <loic@dachary.org> wrote:
>> 
>>> Hi,
>>>
>>> If the number of replica desired is 1, then
>>>
>>> https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L
>>> 915
>>>
>>> will be called with maxout = 1 and scratch will be maxout * 3. But 
>>>if the  rule always selects 4 items, then it overflows. Is it what 
>>>you also read ?
>>>
>>> Cheers
>>>
>>> On 17/09/2014 16:42, Johnu George (johnugeo) wrote:
>>>> Adding ceph-devel
>>>>
>>>> On 9/17/14, 1:27 AM, "Loic Dachary" <loic@dachary.org> wrote:
>>>>
>>>>>
>>>>> Could you resend with ceph-devel in cc ? It's better for archive 
>>>>> purposes
>>>>> ;-)
>>>>>
>>>>> On 17/09/2014 09:37, Johnu George (johnugeo) wrote:
>>>>>> Hi Sage,
>>>>>>          I was looking at the crash that was reported in this 
>>>>>>mail  chain.
>>>>>> I am seeing that the crash happens when number of replicas 
>>>>>>configured  is  less than total number of osds to be selected as 
>>>>>>per rule. This is  because, the crush temporary buffers are 
>>>>>>allocated as per num_rep  size.
>>>>>> (scratch array has size num_rep * 3) So, when number of osds to 
>>>>>>be  selected is more, buffer overflow happens and it causes 
>>>>>>error/crash. I  saw  your earlier comment in this mail  where you 
>>>>>>asked to create a rule  that  selects two osds per rack(2 racks) 
>>>>>>with num_rep=3. I feel that buffer  overflow issue should happen 
>>>>>>in this situation too, that can cause  'out  of  array' access. Am 
>>>>>>I wrong somewhere or am I missing something?
>>>>>>
>>>>>> Johnu
>>>>>>
>>>>>> On 9/16/14, 9:39 AM, "Daniel Swarbrick"
>>>>>> <daniel.swarbrick@profitbricks.com> wrote:
>>>>>>
>>>>>>> Hi Loic,
>>>>>>>
>>>>>>> Thanks for providing a detailed example. I'm able to run the 
>>>>>>>example  that you provide, and also got my own live crushmap to 
>>>>>>>produce some  results, when I appended the "--num-rep 3" option 
>>>>>>>to the command.
>>>>>>> Without that option, even your example is throwing segfaults - 
>>>>>>>maybe  a  bug in crushtool?
>>>>>>>
>>>>>>> One other area I wasn't sure about - can the final "chooseleaf"
>>>>>>>step
>>>>>>> specify "firstn 0" for simplicity's sake (and to automatically  
>>>>>>>handle a  larger pool size in future) ? Would there be any 
>>>>>>>downside to this?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 16/09/14 16:20, Loic Dachary wrote:
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> When I run
>>>>>>>>
>>>>>>>> crushtool --outfn crushmap --build --num_osds 100 host straw 2 
>>>>>>>>rack  straw 10 default straw 0  crushtool -d crushmap -o 
>>>>>>>>crushmap.txt  cat >> crushmap.txt <<EOF  rule myrule {
>>>>>>>> 	ruleset 1
>>>>>>>> 	type replicated
>>>>>>>> 	min_size 1
>>>>>>>> 	max_size 10
>>>>>>>> 	step take default
>>>>>>>> 	step choose firstn 2 type rack
>>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>>> 	step emit
>>>>>>>> }
>>>>>>>> EOF
>>>>>>>> crushtool -c crushmap.txt -o crushmap  crushtool -i crushmap 
>>>>>>>>--test --show-utilization --rule 1 --min-x 1  --max-x 10 
>>>>>>>>--num-rep 3
>>>>>>>>
>>>>>>>> I get
>>>>>>>>
>>>>>>>> rule 1 (myrule), x = 1..10, numrep = 3..3 CRUSH rule 1 x 1 
>>>>>>>> [79,69,10] CRUSH rule 1 x 2 [56,58,60] CRUSH rule 1 x 3 
>>>>>>>> [30,26,19] CRUSH rule 1 x 4 [14,8,69] CRUSH rule 1 x 5 [7,4,88] 
>>>>>>>> CRUSH rule 1 x 6 [54,52,37] CRUSH rule 1 x 7 [69,67,19] CRUSH 
>>>>>>>> rule 1 x 8 [51,46,83] CRUSH rule 1 x 9 [55,56,35] CRUSH rule 1 
>>>>>>>> x 10 [54,51,95]
>>>>>>>> rule 1 (myrule) num_rep 3 result size == 3:	10/10
>>>>>>>>
>>>>>>>> What command are you running to get a core dump ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 16/09/2014 12:02, Daniel Swarbrick wrote:
>>>>>>>>> On 15/09/14 17:28, Sage Weil wrote:
>>>>>>>>>> rule myrule {
>>>>>>>>>> 	ruleset 1
>>>>>>>>>> 	type replicated
>>>>>>>>>> 	min_size 1
>>>>>>>>>> 	max_size 10
>>>>>>>>>> 	step take default
>>>>>>>>>> 	step choose firstn 2 type rack
>>>>>>>>>> 	step chooseleaf firstn 2 type host
>>>>>>>>>> 	step emit
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> That will give you 4 osds, spread across 2 hosts in each rack.
>>>>>>>>>> The
>>>>>>>>>> pool
>>>>>>>>>> size (replication factor) is 3, so RADOS will just use the 
>>>>>>>>>> first three (2 hosts in first rack, 1 host in second rack).
>>>>>>>>> I have a similar requirement, where we currently have four 
>>>>>>>>>nodes,  two  in  each fire zone, with pool size 3. At the 
>>>>>>>>>moment, due to the number  of  nodes, we are guaranteed at 
>>>>>>>>>least one replica in each fire zone  (which  we represent with 
>>>>>>>>>bucket type "room"). If we add more nodes in  future,  the 
>>>>>>>>>current ruleset may cause all three replicas of a PG to land  
>>>>>>>>>in a  single zone.
>>>>>>>>>
>>>>>>>>> I tried the ruleset suggested above (replacing "rack" with 
>>>>>>>>>"room"),  but  when testing it with crushtool --test 
>>>>>>>>>--show-utilization, I simply  get  segfaults. No amount of 
>>>>>>>>>fiddling around seems to make it work -  even  adding two new 
>>>>>>>>>hypothetical nodes to the crushmap doesn't help.
>>>>>>>>>
>>>>>>>>> What could I perhaps be doing wrong?
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>> 
>
>--
>Loïc Dachary, Artisan Logiciel Libre
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-09-18  2:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-15  7:47 Crushmap ruleset for rack aware PG placement Amit Vijairania
     [not found] ` <CADgBPFCVUVQZtHNkSsuMhskL2X-FTBTYy6zcav3CRpNCBGHJgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-15 15:28   ` Sage Weil
2014-09-15 18:21     ` Amit Vijairania
     [not found]     ` <alpine.DEB.2.00.1409150826060.513-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-09-16 10:02       ` Daniel Swarbrick
     [not found]         ` <54184711.3080002@dachary.org>
     [not found]           ` <lv9p5b$1h1$1@ger.gmane.org>
     [not found]             ` <D03E84B7.B8D%johnugeo@cisco.com>
     [not found]               ` <541945F8.60001@dachary.org>
2014-09-17 14:42                 ` [ceph-users] " Johnu George (johnugeo)
2014-09-17 16:10                   ` Loic Dachary
2014-09-17 20:03                     ` Johnu George (johnugeo)
2014-09-17 20:11                       ` Loic Dachary
2014-09-17 22:40                         ` Johnu George (johnugeo)
2014-09-18  2:03                           ` Chen, Xiaoxi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.