* crush rule definitions
@ 2011-05-04 18:16 Zenon Panoussis
2011-05-04 18:21 ` Sage Weil
0 siblings, 1 reply; 6+ messages in thread
From: Zenon Panoussis @ 2011-05-04 18:16 UTC (permalink / raw)
To: ceph-devel
Hi
I thought that min_size in crush is obvious, but the more I read the more
I tend to doubt. So I might as well ask:
With a crushmap like this
===
type 0 device
type 1 host
type 2 root
[definitions for the above; hosts contain devices, root contains hosts]
rule data {
ruleset 0
type replicated
min_size 2
max_size 2
step take root
step chooseleaf firstn 0 type host
step emit
}
===
does "min_size 2, max_size 2" mean that I want "2 copies of the data on each
host" or "2 copies of the data in total in the entire cluster"?
Z
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: crush rule definitions
2011-05-04 18:16 crush rule definitions Zenon Panoussis
@ 2011-05-04 18:21 ` Sage Weil
2011-05-04 19:20 ` Zenon Panoussis
0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2011-05-04 18:21 UTC (permalink / raw)
To: Zenon Panoussis; +Cc: ceph-devel
On Wed, 4 May 2011, Zenon Panoussis wrote:
> I thought that min_size in crush is obvious, but the more I read the more
> I tend to doubt. So I might as well ask:
>
> With a crushmap like this
>
> ===
> type 0 device
> type 1 host
> type 2 root
>
> [definitions for the above; hosts contain devices, root contains hosts]
>
> rule data {
> ruleset 0
> type replicated
> min_size 2
> max_size 2
> step take root
> step chooseleaf firstn 0 type host
> step emit
> }
> ===
>
> does "min_size 2, max_size 2" mean that I want "2 copies of the data on each
> host" or "2 copies of the data in total in the entire cluster"?
Neither, actually. It means that this rule will be used when we ask crush
for ruleset 0 and 2 replicas. If you change a pg to have 3x replication,
ceph will ask for ruleset 0 and 3 replicas, and this rule won't be used.
You probably want min_size 1 and max_size 10.
The motivation is that you might have different placement rules depending
on how many replicas there are. That's why it's "ruleset 0" and not
"rule 0".
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: crush rule definitions
2011-05-04 18:21 ` Sage Weil
@ 2011-05-04 19:20 ` Zenon Panoussis
2011-05-04 19:28 ` Sage Weil
0 siblings, 1 reply; 6+ messages in thread
From: Zenon Panoussis @ 2011-05-04 19:20 UTC (permalink / raw)
To: ceph-devel
On 05/04/2011 08:21 PM, Sage Weil wrote:
>> does "min_size 2, max_size 2" mean that I want "2 copies of the data on each
>> host" or "2 copies of the data in total in the entire cluster"?
> Neither, actually. It means that this rule will be used when we ask crush
> for ruleset 0 and 2 replicas. If you change a pg to have 3x replication,
> ceph will ask for ruleset 0 and 3 replicas, and this rule won't be used.
In other words, the total number of replicas in the cluster is determined on
the PG level? But then how do I control which PGs are physically stored where?
> You probably want min_size 1 and max_size 10.
Taking what you just wrote together with a re-reading of the wiki, I must admit
that I still don't quite grasp it. The wiki says
That is, when placing object replicas, we start at the root hierarchy, and
choose N items of type 'device'. ('0' means to grab however many replicas.
The rules are written to be general for some range of N, 1-10 in this case.)
What I make out of all this is that
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take root
step choose firstn 0 type device
step emit
}
means that IF the PGs are set to create anything between 1 and 10 replicas, then
the replicas should be placed on devices, using an unlimited number of devices.
Is that correct?
My problem really is how to configure ceph to put exactly 1 replica of the data
(and metadata) on each and every of some kind of target. For example, if I have
10 racks, I want exactly 1 copy of the data in each rack, no more, no less (and
I don't care which host in that rack the data lands on). If I have 10 hosts,
I want exactly 1 copy of the data on each host (and I don't care which OSD on
that host the data lands on). If I only have 10 OSDs, I want exactly 1 copy of
the data on each and every OSD.
Assuming that the number of targets is fixed and known, what is the way to do
this?
And going back to PGs, if "ceph osd dump -o -|grep pg_size" says
pg_pool 0 'data' pg_pool(rep pg_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 lpg_num 2 lpgp_num 2 last_change 66 owner 0)
and "ceph -w" says
pg v319405: 528 pgs: 528 active+clean; 22702 MB data, 77093 MB used, 346 GB / 446 GB avail
how do the 128 PGs of "ceph osd dump" relate to the 528 PGs of "ceph -w"?
*
As an aside, I think that, to a certain extent, improving the documentation could
contribute more to the code base than improving the actual code. You guys spend a
lot of time answering the kind of questions that I've been posing (and thank you
for doing so), while at the same time missing out on the debugging help you could
be getting instead if your user base could move past its trivial problems. If I were
your scrum master, I'd dedicate an entire sprint on the wiki alone.
Z
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: crush rule definitions
2011-05-04 19:20 ` Zenon Panoussis
@ 2011-05-04 19:28 ` Sage Weil
2011-05-04 19:31 ` Sage Weil
2011-05-04 20:34 ` Zenon Panoussis
0 siblings, 2 replies; 6+ messages in thread
From: Sage Weil @ 2011-05-04 19:28 UTC (permalink / raw)
To: Zenon Panoussis; +Cc: ceph-devel
On Wed, 4 May 2011, Zenon Panoussis wrote:
> On 05/04/2011 08:21 PM, Sage Weil wrote:
>
> >> does "min_size 2, max_size 2" mean that I want "2 copies of the data on each
> >> host" or "2 copies of the data in total in the entire cluster"?
>
> > Neither, actually. It means that this rule will be used when we ask crush
> > for ruleset 0 and 2 replicas. If you change a pg to have 3x replication,
> > ceph will ask for ruleset 0 and 3 replicas, and this rule won't be used.
>
> In other words, the total number of replicas in the cluster is determined on
> the PG level? But then how do I control which PGs are physically stored where?
>
> > You probably want min_size 1 and max_size 10.
>
> Taking what you just wrote together with a re-reading of the wiki, I must admit
> that I still don't quite grasp it. The wiki says
>
> That is, when placing object replicas, we start at the root hierarchy, and
> choose N items of type 'device'. ('0' means to grab however many replicas.
> The rules are written to be general for some range of N, 1-10 in this case.)
>
> What I make out of all this is that
>
> rule data {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take root
> step choose firstn 0 type device
> step emit
> }
>
> means that IF the PGs are set to create anything between 1 and 10 replicas, then
> the replicas should be placed on devices, using an unlimited number of devices.
>
> Is that correct?
>
> My problem really is how to configure ceph to put exactly 1 replica of the data
> (and metadata) on each and every of some kind of target. For example, if I have
> 10 racks, I want exactly 1 copy of the data in each rack, no more, no less (and
> I don't care which host in that rack the data lands on). If I have 10 hosts,
> I want exactly 1 copy of the data on each host (and I don't care which OSD on
> that host the data lands on). If I only have 10 OSDs, I want exactly 1 copy of
> the data on each and every OSD.
>
> Assuming that the number of targets is fixed and known, what is the way to do
> this?
Yes. So the rule you have is right (at least up to 10 nodes). Then you
need to set the pg_size (aka replication level) for the pools you care
about. For 5x, that's
ceph osd pool set data size 4
You can see the current sizes with
ceph osd dump -o - | grep pool
and look at pg_size attribute.
> And going back to PGs, if "ceph osd dump -o -|grep pg_size" says
>
> pg_pool 0 'data' pg_pool(rep pg_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 lpg_num 2 lpgp_num 2 last_change 66 owner 0)
>
> and "ceph -w" says
>
> pg v319405: 528 pgs: 528 active+clean; 22702 MB data, 77093 MB used, 346 GB / 446 GB avail
>
> how do the 128 PGs of "ceph osd dump" relate to the 528 PGs of "ceph -w"?
There are several different pools, each sliced into many pgs.
> As an aside, I think that, to a certain extent, improving the
> documentation could contribute more to the code base than improving the
> actual code. You guys spend a lot of time answering the kind of
> questions that I've been posing (and thank you for doing so), while at
> the same time missing out on the debugging help you could be getting
> instead if your user base could move past its trivial problems. If I
> were your scrum master, I'd dedicate an entire sprint on the wiki alone.
The replication is covered by
http://ceph.newdream.net/wiki/Adjusting_replication_level
Any specific suggestions on how that should be improved?
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: crush rule definitions
2011-05-04 19:28 ` Sage Weil
@ 2011-05-04 19:31 ` Sage Weil
2011-05-04 20:34 ` Zenon Panoussis
1 sibling, 0 replies; 6+ messages in thread
From: Sage Weil @ 2011-05-04 19:31 UTC (permalink / raw)
To: Zenon Panoussis; +Cc: ceph-devel
On Wed, 4 May 2011, Sage Weil wrote:
> about. For 5x, that's
>
> ceph osd pool set data size 4
er, for 5x, that's
ceph osd pool set data size 5
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: crush rule definitions
2011-05-04 19:28 ` Sage Weil
2011-05-04 19:31 ` Sage Weil
@ 2011-05-04 20:34 ` Zenon Panoussis
1 sibling, 0 replies; 6+ messages in thread
From: Zenon Panoussis @ 2011-05-04 20:34 UTC (permalink / raw)
To: ceph-devel
On 05/04/2011 09:28 PM, Sage Weil wrote:
>> As an aside, I think that, to a certain extent, improving the
>> documentation could contribute more to the code base...
> The replication is covered by
> http://ceph.newdream.net/wiki/Adjusting_replication_level
> Any specific suggestions on how that should be improved?
The wiki is a reasonably good instruction in getting ceph up and running,
but it could do a lot better in explaining the hows and whys of the system.
http://ceph.newdream.net/wiki/Adjusting_replication_level is a good example
of this. It tells you exactly how to adjust the overall replication level,
but it doesn't tell you how to control where the replicas are put. So you
go searching and you find http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH
and it explains pretty well how to implement crush rules, but it doesn't
tell what each rule parametre actually does, nor does it (appear to) give
an exhaustive list of all parametres that are available.
So, all in all, I think that what is mostly needed in the wiki is a general
introduction to ceph, an explanation and diagram of how it works and how
its internals relate to each-other. All of this is in your thesis, I know,
but it is very difficult to connect an abstract academic paper to a concrete
configuration problem and configuration parametre.
I happily contribute to the wiki when I'm 100% sure that what I'm writing is
correct, but most of the time when I see a potential improvent of the wiki,
I lack the solid understanding that's needed for me to make that improvement
myself.
Z
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-05-04 20:34 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-04 18:16 crush rule definitions Zenon Panoussis
2011-05-04 18:21 ` Sage Weil
2011-05-04 19:20 ` Zenon Panoussis
2011-05-04 19:28 ` Sage Weil
2011-05-04 19:31 ` Sage Weil
2011-05-04 20:34 ` Zenon Panoussis
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.