placement group sizing

All of lore.kernel.org
 help / color / mirror / Atom feed

* placement group sizing
@ 2013-04-25 12:36 Anders Saaby
  0 siblings, 0 replies; 8+ messages in thread
From: Anders Saaby @ 2013-04-25 12:36 UTC (permalink / raw)
  To: ceph-devel

Hi,

We are working on prototype infrastructure for RADOS clusters, and are now ready to deploy the first production size storage pool. One question remains; How many placement groups will we need, balancing memory footprint and ability to level data placement and data reads. - And still keeping stuff within sane limits. 

Our initial plan is to deploy 4PB pools, based on 4TB drives with 3 replicas (One OSD/disk). So, 3.000 disks per pool.

Acording to the documentation 1), we should have: 3.000 OSDs * 100 / 3 replicas == 100.000 placement groups.

From the maillist, 100.000 PG's is way more than I have seen, so, do you have any insights and advises on pg_num for a RADOS pool with these characteristics? Also, will it be a problem with a pg_num size this bit, if the pool is started out with only ~100 OSDs, and then grown to 3.000.

Thanks in advance,
Anders

1: http://ceph.com/docs/master/rados/operations/placement-groups/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* placement group sizing
@ 2013-04-25 12:39 Anders Saaby
  2013-04-26 12:22 ` Wido den Hollander
  2013-04-26 13:17 ` Mark Nelson
  0 siblings, 2 replies; 8+ messages in thread
From: Anders Saaby @ 2013-04-25 12:39 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi,

We are working on prototype infrastructure for RADOS clusters, and are now ready to deploy the first production size storage pool. One question remains; How many placement groups will we need, balancing memory footprint and ability to level data placement and data reads. - And still keeping stuff within sane limits. 

Our initial plan is to deploy 4PB pools, based on 4TB drives with 3 replicas (One OSD/disk). So, 3.000 disks per pool.

Acording to the documentation 1), we should have: 3.000 OSDs * 100 / 3 replicas == 100.000 placement groups.

From the maillist, 100.000 PG's is way more than I have seen, so, do you have any insights and advises on pg_num for a RADOS pool with these characteristics? Also, will it be a problem with a pg_num size this bit, if the pool is started out with only ~100 OSDs, and then grown to 3.000.

Thanks in advance,
Anders

1: http://ceph.com/docs/master/rados/operations/placement-groups/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: placement group sizing
  2013-04-25 12:39 placement group sizing Anders Saaby
@ 2013-04-26 12:22 ` Wido den Hollander
  2013-04-26 17:07   ` Anders Saaby
  2013-04-26 13:17 ` Mark Nelson
  1 sibling, 1 reply; 8+ messages in thread
From: Wido den Hollander @ 2013-04-26 12:22 UTC (permalink / raw)
  To: Anders Saaby; +Cc: ceph-devel@vger.kernel.org

Hello,

On 04/25/2013 02:39 PM, Anders Saaby wrote:
> Hi,
>
> We are working on prototype infrastructure for RADOS clusters, and are now ready to deploy the first production size storage pool. One question remains; How many placement groups will we need, balancing memory footprint and ability to level data placement and data reads. - And still keeping stuff within sane limits.
>
> Our initial plan is to deploy 4PB pools, based on 4TB drives with 3 replicas (One OSD/disk). So, 3.000 disks per pool.
>
> Acording to the documentation 1), we should have: 3.000 OSDs * 100 / 3 replicas == 100.000 placement groups.
>
>  From the maillist, 100.000 PG's is way more than I have seen, so, do you have any insights and advises on pg_num for a RADOS pool with these characteristics? Also, will it be a problem with a pg_num size this bit, if the pool is started out with only ~100 OSDs, and then grown to 3.000.
>

While the example says 100, the text above it says:

"We recommend approximately 50-100 placement groups per OSD to balance 
out memory and CPU requirements and per-OSD load"

So the question is, what is the workload going to be? What kind of data 
are you going to store? Will this be something with RBD or will it be a 
plain RADOS store?

How many OSDs per machine do you have and how much memory do you have 
per machine?

The more PGs you have, the more peering PGs you will have when an OSD 
boots again, so that could be heavy for the CPU in the machines.

The question also is, how many pools are you expecting? If you start 
creating 10 pools with 100.000 pgs each you'd get an insane amount of PGs.

Could you shed some light on this?

Wido

>
> Thanks in advance,
> Anders
>
> 1: http://ceph.com/docs/master/rados/operations/placement-groups/
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: placement group sizing
  2013-04-25 12:39 placement group sizing Anders Saaby
  2013-04-26 12:22 ` Wido den Hollander
@ 2013-04-26 13:17 ` Mark Nelson
  2013-04-26 17:09   ` Anders Saaby
  1 sibling, 1 reply; 8+ messages in thread
From: Mark Nelson @ 2013-04-26 13:17 UTC (permalink / raw)
  To: Anders Saaby; +Cc: ceph-devel@vger.kernel.org

On 04/25/2013 07:39 AM, Anders Saaby wrote:
> Hi,
>
> We are working on prototype infrastructure for RADOS clusters, and are now ready to deploy the first production size storage pool. One question remains; How many placement groups will we need, balancing memory footprint and ability to level data placement and data reads. - And still keeping stuff within sane limits.
>
> Our initial plan is to deploy 4PB pools, based on 4TB drives with 3 replicas (One OSD/disk). So, 3.000 disks per pool.
>
> Acording to the documentation 1), we should have: 3.000 OSDs * 100 / 3 replicas == 100.000 placement groups.
>
>  From the maillist, 100.000 PG's is way more than I have seen, so, do you have any insights and advises on pg_num for a RADOS pool with these characteristics? Also, will it be a problem with a pg_num size this bit, if the pool is started out with only ~100 OSDs, and then grown to 3.000.

I pretty regularly test single-mon configurations with 64k PGs.  ~100k 
PGs tends to be starting to get a bit intense, but with a larger mon 
cluster and some tweaking it should be doable.

I don't mean to push anything on you, but if you guys are really 
thinking about deploying multiple 4PB pools, you might want to talk to 
us about some kind of support/consulting agreement.  That's a lot of 
storage!

Mark

>
>
> Thanks in advance,
> Anders
>
> 1: http://ceph.com/docs/master/rados/operations/placement-groups/
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: placement group sizing
  2013-04-26 12:22 ` Wido den Hollander
@ 2013-04-26 17:07   ` Anders Saaby
  2013-04-27  4:45     ` Xiaopong Tran
  0 siblings, 1 reply; 8+ messages in thread
From: Anders Saaby @ 2013-04-26 17:07 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel@vger.kernel.org

On 26/04/2013, at 14.22, Wido den Hollander <wido@42on.com> wrote:
> Hello,
> 
> On 04/25/2013 02:39 PM, Anders Saaby wrote:
>> Hi,
>> 
>> We are working on prototype infrastructure for RADOS clusters, and are now ready to deploy the first production size storage pool. One question remains; How many placement groups will we need, balancing memory footprint and ability to level data placement and data reads. - And still keeping stuff within sane limits.
>> 
>> Our initial plan is to deploy 4PB pools, based on 4TB drives with 3 replicas (One OSD/disk). So, 3.000 disks per pool.
>> 
>> Acording to the documentation 1), we should have: 3.000 OSDs * 100 / 3 replicas == 100.000 placement groups.
>> 
>> From the maillist, 100.000 PG's is way more than I have seen, so, do you have any insights and advises on pg_num for a RADOS pool with these characteristics? Also, will it be a problem with a pg_num size this bit, if the pool is started out with only ~100 OSDs, and then grown to 3.000.
>> 
> 
> While the example says 100, the text above it says:
> 
> "We recommend approximately 50-100 placement groups per OSD to balance out memory and CPU requirements and per-OSD load"
> 
> So the question is, what is the workload going to be? What kind of data are you going to store? Will this be something with RBD or will it be a plain RADOS store?

Right. Here goes;

Workload will come from one of our applications using librados directly, so no RBD, no FS and no gateways. Low velocity I/O, should be well within SATA limits.

> How many OSDs per machine do you have and how much memory do you have per machine?

12 OSD's per machine. A bit over 1GB memory per OSD. (16GB per machine)

> The more PGs you have, the more peering PGs you will have when an OSD boots again, so that could be heavy for the CPU in the machines.

Right.

> The question also is, how many pools are you expecting? If you start creating 10 pools with 100.000 pgs each you'd get an insane amount of PGs.

Plan is, for now, to only have one pool (4PB) per cluster, and then just scale with the appropriate amount of clusters.

Our initial guesses for pgs, is in the 40K-64K range, that should give us the balancing we need, but we are not sure that is within sane memory consumption ranges.


best regards,
Anders

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: placement group sizing
  2013-04-26 13:17 ` Mark Nelson
@ 2013-04-26 17:09   ` Anders Saaby
  0 siblings, 0 replies; 8+ messages in thread
From: Anders Saaby @ 2013-04-26 17:09 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

On 26/04/2013, at 15.17, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 04/25/2013 07:39 AM, Anders Saaby wrote:
>> Hi,
>> 
>> We are working on prototype infrastructure for RADOS clusters, and are now ready to deploy the first production size storage pool. One question remains; How many placement groups will we need, balancing memory footprint and ability to level data placement and data reads. - And still keeping stuff within sane limits.
>> 
>> Our initial plan is to deploy 4PB pools, based on 4TB drives with 3 replicas (One OSD/disk). So, 3.000 disks per pool.
>> 
>> Acording to the documentation 1), we should have: 3.000 OSDs * 100 / 3 replicas == 100.000 placement groups.
>> 
>> From the maillist, 100.000 PG's is way more than I have seen, so, do you have any insights and advises on pg_num for a RADOS pool with these characteristics? Also, will it be a problem with a pg_num size this bit, if the pool is started out with only ~100 OSDs, and then grown to 3.000.
> 
> I pretty regularly test single-mon configurations with 64k PGs.  ~100k PGs tends to be starting to get a bit intense, but with a larger mon cluster and some tweaking it should be doable.

OK, that sounds encouraging. So, 40K-64K pgs should be doable with ~1GB mem per OSD?

> I don't mean to push anything on you, but if you guys are really thinking about deploying multiple 4PB pools, you might want to talk to us about some kind of support/consulting agreement.  That's a lot of storage!


I agree.


best regards,
Anders



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: placement group sizing
  2013-04-26 17:07   ` Anders Saaby
@ 2013-04-27  4:45     ` Xiaopong Tran
  2013-04-29  7:17       ` Anders Saaby
  0 siblings, 1 reply; 8+ messages in thread
From: Xiaopong Tran @ 2013-04-27  4:45 UTC (permalink / raw)
  To: Anders Saaby; +Cc: Wido den Hollander, ceph-devel@vger.kernel.org

On 04/27/2013 01:07 AM, Anders Saaby wrote:
>
>> How many OSDs per machine do you have and how much memory do you have per machine?
>
> 12 OSD's per machine. A bit over 1GB memory per OSD. (16GB per machine)
>

Try more. If you have a large-ish cluster, with many OSDs, and if you
have large PGs, when one or more OSDs go down (for different reasons,
a crash, a disk failure, etc), Ceph will start to remap and rebalance,
memory usage per OSD can easily balloon to GBs per OSD.

When this happens, and if you don't have enough memory, the
OSD processes might get OOM-killed, and you'll get into a
vicious cycle.

>> The more PGs you have, the more peering PGs you will have when an OSD boots again, so that could be heavy for the CPU in the machines.
>
> Right.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: placement group sizing
  2013-04-27  4:45     ` Xiaopong Tran
@ 2013-04-29  7:17       ` Anders Saaby
  0 siblings, 0 replies; 8+ messages in thread
From: Anders Saaby @ 2013-04-29  7:17 UTC (permalink / raw)
  To: Xiaopong Tran; +Cc: Wido den Hollander, ceph-devel@vger.kernel.org

On 27/04/2013, at 06.45, Xiaopong Tran <xiaopong.tran@gmail.com> wrote:
> On 04/27/2013 01:07 AM, Anders Saaby wrote:
>> 
>>> How many OSDs per machine do you have and how much memory do you have per machine?
>> 
>> 12 OSD's per machine. A bit over 1GB memory per OSD. (16GB per machine)
> 
> Try more. If you have a large-ish cluster, with many OSDs, and if you
> have large PGs, when one or more OSDs go down (for different reasons,
> a crash, a disk failure, etc), Ceph will start to remap and rebalance,
> memory usage per OSD can easily balloon to GBs per OSD.

I take it, you mean more memory?
Is there a way to calculate OSD memory requirements based on pg_num and/or number of OSDs?

> When this happens, and if you don't have enough memory, the
> OSD processes might get OOM-killed, and you'll get into a
> vicious cycle.


Right.


regards,
Anders


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-04-29  7:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-25 12:39 placement group sizing Anders Saaby
2013-04-26 12:22 ` Wido den Hollander
2013-04-26 17:07   ` Anders Saaby
2013-04-27  4:45     ` Xiaopong Tran
2013-04-29  7:17       ` Anders Saaby
2013-04-26 13:17 ` Mark Nelson
2013-04-26 17:09   ` Anders Saaby
  -- strict thread matches above, loose matches on Subject: below --
2013-04-25 12:36 Anders Saaby

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.