Is it safe to increase pg number in a production environment

All of lore.kernel.org
 help / color / mirror / Atom feed

* Is it safe to increase pg number in a production environment
@ 2015-08-03  7:57 乔建峰
  2015-08-04 16:46 ` [ceph-users] " Samuel Just
  0 siblings, 1 reply; 15+ messages in thread
From: 乔建峰 @ 2015-08-03  7:57 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ,
	cbt-Qp0mS5GaXlQ

[-- Attachment #1.1: Type: text/plain, Size: 1708 bytes --]

Hi Cephers,

This is a greeting from Jevon. Currently, I'm experiencing an issue which
suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
More details are provided bellow.

Issue:
I set up a cluster having 24 OSDs and created one pool with 1024 placement
groups on it for a small startup company. The number 1024 was calculated
per the equation 'OSDs * 100'/pool size. The cluster have been running
quite well for a long time. But recently, our monitoring system always
complains that some disks' usage exceed 85%. I log into the system and find
out that some disks' usage are really very high, but some are not(less than
60%). Each time when the issue happens, I have to manually re-balance the
distribution. This is a short-term solution, I'm not willing to do it all
the time.

Two long-term solutions come in my mind,
1) Ask the customers to expand their clusters by adding more OSDs. But I
think they will ask me to explain the reason of the imbalance data
distribution. We've already done some analysis on the environment, we
learned that the most imbalance part in the CRUSH is the mapping between
object and pg. The biggest pg has 613 objects, while the smallest pg only
has 226 objects.

2) Increase the number of placement groups. It can be of great help for
statistically uniform data distribution, but it can also incur significant
data movement as PGs are effective being split. I just cannot do it in our
customers' environment before we 100% understand the consequence. So anyone
did this under a production environment? How much does this operation
affect the performance of Clients?

Any comments/help/suggestions will be highly appreciated.

-- 
Best Regards
Jevon

[-- Attachment #1.2: Type: text/html, Size: 1999 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-03  7:57 Is it safe to increase pg number in a production environment 乔建峰
@ 2015-08-04 16:46 ` Samuel Just
  2015-08-04 16:51   ` Stefan Priebe
  2015-08-04 16:52   ` Marek Dohojda
  0 siblings, 2 replies; 15+ messages in thread
From: Samuel Just @ 2015-08-04 16:46 UTC (permalink / raw)
  To: 乔建峰; +Cc: ceph-devel@vger.kernel.org, ceph-users, cbt

It will cause a large amount of data movement.  Each new pg after the
split will relocate.  It might be ok if you do it slowly.  Experiment
on a test cluster.
-Sam

On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
> Hi Cephers,
>
> This is a greeting from Jevon. Currently, I'm experiencing an issue which
> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
> More details are provided bellow.
>
> Issue:
> I set up a cluster having 24 OSDs and created one pool with 1024 placement
> groups on it for a small startup company. The number 1024 was calculated per
> the equation 'OSDs * 100'/pool size. The cluster have been running quite
> well for a long time. But recently, our monitoring system always complains
> that some disks' usage exceed 85%. I log into the system and find out that
> some disks' usage are really very high, but some are not(less than 60%).
> Each time when the issue happens, I have to manually re-balance the
> distribution. This is a short-term solution, I'm not willing to do it all
> the time.
>
> Two long-term solutions come in my mind,
> 1) Ask the customers to expand their clusters by adding more OSDs. But I
> think they will ask me to explain the reason of the imbalance data
> distribution. We've already done some analysis on the environment, we
> learned that the most imbalance part in the CRUSH is the mapping between
> object and pg. The biggest pg has 613 objects, while the smallest pg only
> has 226 objects.
>
> 2) Increase the number of placement groups. It can be of great help for
> statistically uniform data distribution, but it can also incur significant
> data movement as PGs are effective being split. I just cannot do it in our
> customers' environment before we 100% understand the consequence. So anyone
> did this under a production environment? How much does this operation affect
> the performance of Clients?
>
> Any comments/help/suggestions will be highly appreciated.
>
> --
> Best Regards
> Jevon
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 16:46 ` [ceph-users] " Samuel Just
@ 2015-08-04 16:51   ` Stefan Priebe
  2015-08-04 19:16     ` Ketor D
  2015-08-05  1:50     ` Jevon Qiao
  2015-08-04 16:52   ` Marek Dohojda
  1 sibling, 2 replies; 15+ messages in thread
From: Stefan Priebe @ 2015-08-04 16:51 UTC (permalink / raw)
  To: Samuel Just, 乔建峰
  Cc: ceph-devel@vger.kernel.org, ceph-users, cbt

We've done the splitting several times. The most important thing is to 
run a ceph version which does not have the linger ops bug.

This is dumpling latest release, giant and hammer. Latest firefly 
release still has this bug. Which results in wrong watchers and no 
working snapshots.

Stefan
Am 04.08.2015 um 18:46 schrieb Samuel Just:
> It will cause a large amount of data movement.  Each new pg after the
> split will relocate.  It might be ok if you do it slowly.  Experiment
> on a test cluster.
> -Sam
>
> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>> Hi Cephers,
>>
>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>> More details are provided bellow.
>>
>> Issue:
>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>> groups on it for a small startup company. The number 1024 was calculated per
>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>> well for a long time. But recently, our monitoring system always complains
>> that some disks' usage exceed 85%. I log into the system and find out that
>> some disks' usage are really very high, but some are not(less than 60%).
>> Each time when the issue happens, I have to manually re-balance the
>> distribution. This is a short-term solution, I'm not willing to do it all
>> the time.
>>
>> Two long-term solutions come in my mind,
>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>> think they will ask me to explain the reason of the imbalance data
>> distribution. We've already done some analysis on the environment, we
>> learned that the most imbalance part in the CRUSH is the mapping between
>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>> has 226 objects.
>>
>> 2) Increase the number of placement groups. It can be of great help for
>> statistically uniform data distribution, but it can also incur significant
>> data movement as PGs are effective being split. I just cannot do it in our
>> customers' environment before we 100% understand the consequence. So anyone
>> did this under a production environment? How much does this operation affect
>> the performance of Clients?
>>
>> Any comments/help/suggestions will be highly appreciated.
>>
>> --
>> Best Regards
>> Jevon
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 16:51   ` Stefan Priebe
@ 2015-08-04 19:16     ` Ketor D
       [not found]       ` <CAM9_UU8Mxycvk91NSrFSMQ5=jDxaXcajzB7CTGDZ2sJJ0YW7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-08-05  1:50     ` Jevon Qiao
  1 sibling, 1 reply; 15+ messages in thread
From: Ketor D @ 2015-08-04 19:16 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Samuel Just, 乔建峰, ceph-devel@vger.kernel.org,
	ceph-users, cbt

Hi Stefan,
      Could you describe more about the linger ops bug?
      I'm runing Firefly as you say still has this bug.

Thanks!

On Wed, Aug 5, 2015 at 12:51 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> We've done the splitting several times. The most important thing is to run a
> ceph version which does not have the linger ops bug.
>
> This is dumpling latest release, giant and hammer. Latest firefly release
> still has this bug. Which results in wrong watchers and no working
> snapshots.
>
> Stefan
>
> Am 04.08.2015 um 18:46 schrieb Samuel Just:
>>
>> It will cause a large amount of data movement.  Each new pg after the
>> split will relocate.  It might be ok if you do it slowly.  Experiment
>> on a test cluster.
>> -Sam
>>
>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>>
>>> Hi Cephers,
>>>
>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>> suffers me a lot, so I'm writing to ask for your
>>> comments/help/suggestions.
>>> More details are provided bellow.
>>>
>>> Issue:
>>> I set up a cluster having 24 OSDs and created one pool with 1024
>>> placement
>>> groups on it for a small startup company. The number 1024 was calculated
>>> per
>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>> well for a long time. But recently, our monitoring system always
>>> complains
>>> that some disks' usage exceed 85%. I log into the system and find out
>>> that
>>> some disks' usage are really very high, but some are not(less than 60%).
>>> Each time when the issue happens, I have to manually re-balance the
>>> distribution. This is a short-term solution, I'm not willing to do it all
>>> the time.
>>>
>>> Two long-term solutions come in my mind,
>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>> think they will ask me to explain the reason of the imbalance data
>>> distribution. We've already done some analysis on the environment, we
>>> learned that the most imbalance part in the CRUSH is the mapping between
>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>> has 226 objects.
>>>
>>> 2) Increase the number of placement groups. It can be of great help for
>>> statistically uniform data distribution, but it can also incur
>>> significant
>>> data movement as PGs are effective being split. I just cannot do it in
>>> our
>>> customers' environment before we 100% understand the consequence. So
>>> anyone
>>> did this under a production environment? How much does this operation
>>> affect
>>> the performance of Clients?
>>>
>>> Any comments/help/suggestions will be highly appreciated.
>>>
>>> --
>>> Best Regards
>>> Jevon
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAM9_UU8Mxycvk91NSrFSMQ5=jDxaXcajzB7CTGDZ2sJJ0YW7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Is it safe to increase pg number in a production environment
       [not found]       ` <CAM9_UU8Mxycvk91NSrFSMQ5=jDxaXcajzB7CTGDZ2sJJ0YW7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-08-04 19:48         ` Stefan Priebe
  2015-08-11 15:31           ` [ceph-users] " Dan van der Ster
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Priebe @ 2015-08-04 19:48 UTC (permalink / raw)
  To: Ketor D
  Cc: 乔建峰,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ceph-users,
	cbt-Qp0mS5GaXlQ

Hi,

Am 04.08.2015 um 21:16 schrieb Ketor D:
> Hi Stefan,
>        Could you describe more about the linger ops bug?
>        I'm runing Firefly as you say still has this bug.

It will be fixed in next ff release.

This on:
http://tracker.ceph.com/issues/9806

Stefan

>
> Thanks!
>
> On Wed, Aug 5, 2015 at 12:51 AM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> We've done the splitting several times. The most important thing is to run a
>> ceph version which does not have the linger ops bug.
>>
>> This is dumpling latest release, giant and hammer. Latest firefly release
>> still has this bug. Which results in wrong watchers and no working
>> snapshots.
>>
>> Stefan
>>
>> Am 04.08.2015 um 18:46 schrieb Samuel Just:
>>>
>>> It will cause a large amount of data movement.  Each new pg after the
>>> split will relocate.  It might be ok if you do it slowly.  Experiment
>>> on a test cluster.
>>> -Sam
>>>
>>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>>>
>>>> Hi Cephers,
>>>>
>>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>>> suffers me a lot, so I'm writing to ask for your
>>>> comments/help/suggestions.
>>>> More details are provided bellow.
>>>>
>>>> Issue:
>>>> I set up a cluster having 24 OSDs and created one pool with 1024
>>>> placement
>>>> groups on it for a small startup company. The number 1024 was calculated
>>>> per
>>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>>> well for a long time. But recently, our monitoring system always
>>>> complains
>>>> that some disks' usage exceed 85%. I log into the system and find out
>>>> that
>>>> some disks' usage are really very high, but some are not(less than 60%).
>>>> Each time when the issue happens, I have to manually re-balance the
>>>> distribution. This is a short-term solution, I'm not willing to do it all
>>>> the time.
>>>>
>>>> Two long-term solutions come in my mind,
>>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>>> think they will ask me to explain the reason of the imbalance data
>>>> distribution. We've already done some analysis on the environment, we
>>>> learned that the most imbalance part in the CRUSH is the mapping between
>>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>>> has 226 objects.
>>>>
>>>> 2) Increase the number of placement groups. It can be of great help for
>>>> statistically uniform data distribution, but it can also incur
>>>> significant
>>>> data movement as PGs are effective being split. I just cannot do it in
>>>> our
>>>> customers' environment before we 100% understand the consequence. So
>>>> anyone
>>>> did this under a production environment? How much does this operation
>>>> affect
>>>> the performance of Clients?
>>>>
>>>> Any comments/help/suggestions will be highly appreciated.
>>>>
>>>> --
>>>> Best Regards
>>>> Jevon
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 19:48         ` Stefan Priebe
@ 2015-08-11 15:31           ` Dan van der Ster
  2015-08-11 16:02             ` Jan Schermer
  0 siblings, 1 reply; 15+ messages in thread
From: Dan van der Ster @ 2015-08-11 15:31 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Ketor D, 乔建峰, ceph-devel@vger.kernel.org,
	ceph-users, cbt

On Tue, Aug 4, 2015 at 9:48 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
> Hi,
>
> Am 04.08.2015 um 21:16 schrieb Ketor D:
>>
>> Hi Stefan,
>>        Could you describe more about the linger ops bug?
>>        I'm runing Firefly as you say still has this bug.
>
>
> It will be fixed in next ff release.
>
> This on:
> http://tracker.ceph.com/issues/9806
>

Just to clarify one point: it appears that the fix needs to be applied
on both the OSDs _and_ all clients, right? So all our kvm clients
would need to be restarted to get firefly 0.80.11 prior to any
attempted splits. :-(

Cheers, Dan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-11 15:31           ` [ceph-users] " Dan van der Ster
@ 2015-08-11 16:02             ` Jan Schermer
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Schermer @ 2015-08-11 16:02 UTC (permalink / raw)
  To: Dan van der Ster
  Cc: Stefan Priebe, 乔建峰,
	ceph-devel@vger.kernel.org, cbt, ceph-users

Could someone clarify what the impact of this bug is?
We did increase pg_num/pgp_num and we are on dumpling (0.67.12 unofficial snapshot).
Most of our clients are likely restarted already, but not all. Should we be worried?

Thanks
Jan

> On 11 Aug 2015, at 17:31, Dan van der Ster <dan@vanderster.com> wrote:
> 
> On Tue, Aug 4, 2015 at 9:48 PM, Stefan Priebe <s.priebe@profihost.ag> wrote:
>> Hi,
>> 
>> Am 04.08.2015 um 21:16 schrieb Ketor D:
>>> 
>>> Hi Stefan,
>>>       Could you describe more about the linger ops bug?
>>>       I'm runing Firefly as you say still has this bug.
>> 
>> 
>> It will be fixed in next ff release.
>> 
>> This on:
>> http://tracker.ceph.com/issues/9806
>> 
> 
> Just to clarify one point: it appears that the fix needs to be applied
> on both the OSDs _and_ all clients, right? So all our kvm clients
> would need to be restarted to get firefly 0.80.11 prior to any
> attempted splits. :-(
> 
> Cheers, Dan
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 16:51   ` Stefan Priebe
  2015-08-04 19:16     ` Ketor D
@ 2015-08-05  1:50     ` Jevon Qiao
  1 sibling, 0 replies; 15+ messages in thread
From: Jevon Qiao @ 2015-08-05  1:50 UTC (permalink / raw)
  To: Stefan Priebe, Samuel Just, 乔建峰
  Cc: ceph-devel@vger.kernel.org, ceph-users, cbt

Got it, thank you for the suggestion.

Regards,
Jevon
On 5/8/15 00:51, Stefan Priebe wrote:
> We've done the splitting several times. The most important thing is to 
> run a ceph version which does not have the linger ops bug.
>
> This is dumpling latest release, giant and hammer. Latest firefly 
> release still has this bug. Which results in wrong watchers and no 
> working snapshots.
>
> Stefan
> Am 04.08.2015 um 18:46 schrieb Samuel Just:
>> It will cause a large amount of data movement.  Each new pg after the
>> split will relocate.  It might be ok if you do it slowly. Experiment
>> on a test cluster.
>> -Sam
>>
>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>> Hi Cephers,
>>>
>>> This is a greeting from Jevon. Currently, I'm experiencing an issue 
>>> which
>>> suffers me a lot, so I'm writing to ask for your 
>>> comments/help/suggestions.
>>> More details are provided bellow.
>>>
>>> Issue:
>>> I set up a cluster having 24 OSDs and created one pool with 1024 
>>> placement
>>> groups on it for a small startup company. The number 1024 was 
>>> calculated per
>>> the equation 'OSDs * 100'/pool size. The cluster have been running 
>>> quite
>>> well for a long time. But recently, our monitoring system always 
>>> complains
>>> that some disks' usage exceed 85%. I log into the system and find 
>>> out that
>>> some disks' usage are really very high, but some are not(less than 
>>> 60%).
>>> Each time when the issue happens, I have to manually re-balance the
>>> distribution. This is a short-term solution, I'm not willing to do 
>>> it all
>>> the time.
>>>
>>> Two long-term solutions come in my mind,
>>> 1) Ask the customers to expand their clusters by adding more OSDs. 
>>> But I
>>> think they will ask me to explain the reason of the imbalance data
>>> distribution. We've already done some analysis on the environment, we
>>> learned that the most imbalance part in the CRUSH is the mapping 
>>> between
>>> object and pg. The biggest pg has 613 objects, while the smallest pg 
>>> only
>>> has 226 objects.
>>>
>>> 2) Increase the number of placement groups. It can be of great help for
>>> statistically uniform data distribution, but it can also incur 
>>> significant
>>> data movement as PGs are effective being split. I just cannot do it 
>>> in our
>>> customers' environment before we 100% understand the consequence. So 
>>> anyone
>>> did this under a production environment? How much does this 
>>> operation affect
>>> the performance of Clients?
>>>
>>> Any comments/help/suggestions will be highly appreciated.
>>>
>>> -- 
>>> Best Regards
>>> Jevon
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 16:46 ` [ceph-users] " Samuel Just
  2015-08-04 16:51   ` Stefan Priebe
@ 2015-08-04 16:52   ` Marek Dohojda
  2015-08-04 17:23     ` Jan Schermer
  2015-08-05  1:43     ` Jevon Qiao
  1 sibling, 2 replies; 15+ messages in thread
From: Marek Dohojda @ 2015-08-04 16:52 UTC (permalink / raw)
  To: Samuel Just
  Cc: 乔建峰, ceph-devel@vger.kernel.org, ceph-users,
	cbt

I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.  

After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.  

My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once. 

It isn’t recommended practice but doable.



> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust@redhat.com> wrote:
> 
> It will cause a large amount of data movement.  Each new pg after the
> split will relocate.  It might be ok if you do it slowly.  Experiment
> on a test cluster.
> -Sam
> 
> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>> Hi Cephers,
>> 
>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>> More details are provided bellow.
>> 
>> Issue:
>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>> groups on it for a small startup company. The number 1024 was calculated per
>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>> well for a long time. But recently, our monitoring system always complains
>> that some disks' usage exceed 85%. I log into the system and find out that
>> some disks' usage are really very high, but some are not(less than 60%).
>> Each time when the issue happens, I have to manually re-balance the
>> distribution. This is a short-term solution, I'm not willing to do it all
>> the time.
>> 
>> Two long-term solutions come in my mind,
>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>> think they will ask me to explain the reason of the imbalance data
>> distribution. We've already done some analysis on the environment, we
>> learned that the most imbalance part in the CRUSH is the mapping between
>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>> has 226 objects.
>> 
>> 2) Increase the number of placement groups. It can be of great help for
>> statistically uniform data distribution, but it can also incur significant
>> data movement as PGs are effective being split. I just cannot do it in our
>> customers' environment before we 100% understand the consequence. So anyone
>> did this under a production environment? How much does this operation affect
>> the performance of Clients?
>> 
>> Any comments/help/suggestions will be highly appreciated.
>> 
>> --
>> Best Regards
>> Jevon
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 16:52   ` Marek Dohojda
@ 2015-08-04 17:23     ` Jan Schermer
  2015-08-05  3:45       ` Jevon Qiao
  2015-08-05  1:43     ` Jevon Qiao
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Schermer @ 2015-08-04 17:23 UTC (permalink / raw)
  To: Marek Dohojda
  Cc: Samuel Just, 乔建峰, ceph-devel@vger.kernel.org,
	ceph-users, cbt

I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production.

Basicaly we had to
1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs
2) increse pgp_num in small increments and then go higher

We went from 4096 placement groups up to 16384

pg_num (the number of on-disk created placement groups) was increased like this:
# for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done
this ran overnight (and was upped to 128 step during the night)

Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs.
We did it again in steps and waited for the cluster to settle before continuing.
Each step upped pgp_num by about 2% and as we got higher (>8192) we increased this to much more - the last step was 15360->16384 with the same impact the initial 4096->4160 had.

The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling.

Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts...

And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing.

Jan


> On 04 Aug 2015, at 18:52, Marek Dohojda <mdohojda@altitudedigital.com> wrote:
> 
> I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.  
> 
> After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.  
> 
> My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once. 
> 
> It isn’t recommended practice but doable.
> 
> 
> 
>> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust@redhat.com> wrote:
>> 
>> It will cause a large amount of data movement.  Each new pg after the
>> split will relocate.  It might be ok if you do it slowly.  Experiment
>> on a test cluster.
>> -Sam
>> 
>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>> Hi Cephers,
>>> 
>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>>> More details are provided bellow.
>>> 
>>> Issue:
>>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>>> groups on it for a small startup company. The number 1024 was calculated per
>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>> well for a long time. But recently, our monitoring system always complains
>>> that some disks' usage exceed 85%. I log into the system and find out that
>>> some disks' usage are really very high, but some are not(less than 60%).
>>> Each time when the issue happens, I have to manually re-balance the
>>> distribution. This is a short-term solution, I'm not willing to do it all
>>> the time.
>>> 
>>> Two long-term solutions come in my mind,
>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>> think they will ask me to explain the reason of the imbalance data
>>> distribution. We've already done some analysis on the environment, we
>>> learned that the most imbalance part in the CRUSH is the mapping between
>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>> has 226 objects.
>>> 
>>> 2) Increase the number of placement groups. It can be of great help for
>>> statistically uniform data distribution, but it can also incur significant
>>> data movement as PGs are effective being split. I just cannot do it in our
>>> customers' environment before we 100% understand the consequence. So anyone
>>> did this under a production environment? How much does this operation affect
>>> the performance of Clients?
>>> 
>>> Any comments/help/suggestions will be highly appreciated.
>>> 
>>> --
>>> Best Regards
>>> Jevon
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 17:23     ` Jan Schermer
@ 2015-08-05  3:45       ` Jevon Qiao
  2015-08-05 11:36         ` Jan Schermer
  0 siblings, 1 reply; 15+ messages in thread
From: Jevon Qiao @ 2015-08-05  3:45 UTC (permalink / raw)
  To: Jan Schermer, Marek Dohojda
  Cc: Samuel Just, 乔建峰, ceph-devel@vger.kernel.org,
	ceph-users, cbt

Hi Jan,

Thank you for the detailed suggestion. Please see my reply in-line.
On 5/8/15 01:23, Jan Schermer wrote:
> I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production.
>
> Basicaly we had to
> 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs
> 2) increse pgp_num in small increments and then go higher
So you totally completed the step 1 before jumping into step 2. Have you 
ever tried mixing them together? Increase pg_number, increase 
pgp_number, increase pg_number...
> We went from 4096 placement groups up to 16384
>
> pg_num (the number of on-disk created placement groups) was increased like this:
> # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done
> this ran overnight (and was upped to 128 step during the night)
>
> Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs.
> We did it again in steps and waited for the cluster to settle before continuing.
> Each step upped pgp_num by about 2% and as we got higher (>8192) we increased this to much more - the last step was 15360->16384 with the same impact the initial 4096->4160 had.
The strategy you adopted looks great. I'll do some experiments on a test 
cluster to evaluate the real impact in each step.
> The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling.
>
> Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts...
This is a good point. So along with the increment of PGs, we also need 
to take the current status of the cluster(the available disk space and 
memory for each OSD) into account and evaluate whether it is needed to 
add more resources.
> And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing.
In our environment, we also encountered the imbalance mapping between 
PGs and OSD. What kind of bucket algorithm was used in your environment? 
Any idea on how to minimize it?

Thanks,
Jevon
> Jan
>
>
>> On 04 Aug 2015, at 18:52, Marek Dohojda <mdohojda@altitudedigital.com> wrote:
>>
>> I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.
>>
>> After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.
>>
>> My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once.
>>
>> It isn’t recommended practice but doable.
>>
>>
>>
>>> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust@redhat.com> wrote:
>>>
>>> It will cause a large amount of data movement.  Each new pg after the
>>> split will relocate.  It might be ok if you do it slowly.  Experiment
>>> on a test cluster.
>>> -Sam
>>>
>>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>>> Hi Cephers,
>>>>
>>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>>>> More details are provided bellow.
>>>>
>>>> Issue:
>>>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>>>> groups on it for a small startup company. The number 1024 was calculated per
>>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>>> well for a long time. But recently, our monitoring system always complains
>>>> that some disks' usage exceed 85%. I log into the system and find out that
>>>> some disks' usage are really very high, but some are not(less than 60%).
>>>> Each time when the issue happens, I have to manually re-balance the
>>>> distribution. This is a short-term solution, I'm not willing to do it all
>>>> the time.
>>>>
>>>> Two long-term solutions come in my mind,
>>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>>> think they will ask me to explain the reason of the imbalance data
>>>> distribution. We've already done some analysis on the environment, we
>>>> learned that the most imbalance part in the CRUSH is the mapping between
>>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>>> has 226 objects.
>>>>
>>>> 2) Increase the number of placement groups. It can be of great help for
>>>> statistically uniform data distribution, but it can also incur significant
>>>> data movement as PGs are effective being split. I just cannot do it in our
>>>> customers' environment before we 100% understand the consequence. So anyone
>>>> did this under a production environment? How much does this operation affect
>>>> the performance of Clients?
>>>>
>>>> Any comments/help/suggestions will be highly appreciated.
>>>>
>>>> --
>>>> Best Regards
>>>> Jevon
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-05  3:45       ` Jevon Qiao
@ 2015-08-05 11:36         ` Jan Schermer
  2015-08-07  1:39           ` Jevon Qiao
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Schermer @ 2015-08-05 11:36 UTC (permalink / raw)
  To: Jevon Qiao
  Cc: Marek Dohojda, Samuel Just, 乔建峰,
	ceph-devel@vger.kernel.org, ceph-users, cbt

Hi,
comments inline.

> On 05 Aug 2015, at 05:45, Jevon Qiao <qiaojianfeng@unitedstack.com> wrote:
> 
> Hi Jan,
> 
> Thank you for the detailed suggestion. Please see my reply in-line.
> On 5/8/15 01:23, Jan Schermer wrote:
>> I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production.
>> 
>> Basicaly we had to
>> 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs
>> 2) increse pgp_num in small increments and then go higher
> So you totally completed the step 1 before jumping into step 2. Have you ever tried mixing them together? Increase pg_number, increase pgp_number, increase pg_number…

Actually we first increased both to 8192 and then decided to go higher, but that doesn’t matter.
The only reason for this was that the first step took could run unattended at night without disturbing the workload.*
The second step had to be attended.

* in other words, we didn’t see “slow requests” because of our threshold settings, but while PGs were creating the cluster paused IO for non-trivial amounts of time. I suggest you do this in as small steps as possible, depending on your SLAs.

>> We went from 4096 placement groups up to 16384
>> 
>> pg_num (the number of on-disk created placement groups) was increased like this:
>> # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done
>> this ran overnight (and was upped to 128 step during the night)
>> 
>> Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs.
>> We did it again in steps and waited for the cluster to settle before continuing.
>> Each step upped pgp_num by about 2% and as we got higher (>8192) we increased this to much more - the last step was 15360->16384 with the same impact the initial 4096->4160 had.
> The strategy you adopted looks great. I'll do some experiments on a test cluster to evaluate the real impact in each step
>> The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling.
>> 
>> Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts...
> This is a good point. So along with the increment of PGs, we also need to take the current status of the cluster(the available disk space and memory for each OSD) into account and evaluate whether it is needed to add more resources.

Depends on how much free space you have. We had some OSDs at close to 85% capacity before we started (and other OSD’s at only 30%). When increasing the number of PGs the data shuffled greatly - but this depends on what CRUSH rules you have (and what version you are running). Newer versions with newer tunables will make this a lot easier I guess.

>> And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing.
> In our environment, we also encountered the imbalance mapping between PGs and OSD. What kind of bucket algorithm was used in your environment? Any idea on how to minimize it?

We are using straw because of dumpling. Straw2 should make everything better :-)

Jan

> 
> Thanks,
> Jevon
>> Jan
>> 
>> 
>>> On 04 Aug 2015, at 18:52, Marek Dohojda <mdohojda@altitudedigital.com> wrote:
>>> 
>>> I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.
>>> 
>>> After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.
>>> 
>>> My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once.
>>> 
>>> It isn’t recommended practice but doable.
>>> 
>>> 
>>> 
>>>> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust@redhat.com> wrote:
>>>> 
>>>> It will cause a large amount of data movement.  Each new pg after the
>>>> split will relocate.  It might be ok if you do it slowly.  Experiment
>>>> on a test cluster.
>>>> -Sam
>>>> 
>>>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>>>> Hi Cephers,
>>>>> 
>>>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>>>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>>>>> More details are provided bellow.
>>>>> 
>>>>> Issue:
>>>>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>>>>> groups on it for a small startup company. The number 1024 was calculated per
>>>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>>>> well for a long time. But recently, our monitoring system always complains
>>>>> that some disks' usage exceed 85%. I log into the system and find out that
>>>>> some disks' usage are really very high, but some are not(less than 60%).
>>>>> Each time when the issue happens, I have to manually re-balance the
>>>>> distribution. This is a short-term solution, I'm not willing to do it all
>>>>> the time.
>>>>> 
>>>>> Two long-term solutions come in my mind,
>>>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>>>> think they will ask me to explain the reason of the imbalance data
>>>>> distribution. We've already done some analysis on the environment, we
>>>>> learned that the most imbalance part in the CRUSH is the mapping between
>>>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>>>> has 226 objects.
>>>>> 
>>>>> 2) Increase the number of placement groups. It can be of great help for
>>>>> statistically uniform data distribution, but it can also incur significant
>>>>> data movement as PGs are effective being split. I just cannot do it in our
>>>>> customers' environment before we 100% understand the consequence. So anyone
>>>>> did this under a production environment? How much does this operation affect
>>>>> the performance of Clients?
>>>>> 
>>>>> Any comments/help/suggestions will be highly appreciated.
>>>>> 
>>>>> --
>>>>> Best Regards
>>>>> Jevon
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-05 11:36         ` Jan Schermer
@ 2015-08-07  1:39           ` Jevon Qiao
  0 siblings, 0 replies; 15+ messages in thread
From: Jevon Qiao @ 2015-08-07  1:39 UTC (permalink / raw)
  To: Jan Schermer
  Cc: Marek Dohojda, Samuel Just, 乔建峰,
	ceph-devel@vger.kernel.org, ceph-users, cbt

Hi Jan,

Thank you very much for the suggestion.

Regards,
Jevon
On 5/8/15 19:36, Jan Schermer wrote:
> Hi,
> comments inline.
>
>> On 05 Aug 2015, at 05:45, Jevon Qiao <qiaojianfeng@unitedstack.com> wrote:
>>
>> Hi Jan,
>>
>> Thank you for the detailed suggestion. Please see my reply in-line.
>> On 5/8/15 01:23, Jan Schermer wrote:
>>> I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production.
>>>
>>> Basicaly we had to
>>> 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs
>>> 2) increse pgp_num in small increments and then go higher
>> So you totally completed the step 1 before jumping into step 2. Have you ever tried mixing them together? Increase pg_number, increase pgp_number, increase pg_number…
> Actually we first increased both to 8192 and then decided to go higher, but that doesn’t matter.
> The only reason for this was that the first step took could run unattended at night without disturbing the workload.*
> The second step had to be attended.
>
> * in other words, we didn’t see “slow requests” because of our threshold settings, but while PGs were creating the cluster paused IO for non-trivial amounts of time. I suggest you do this in as small steps as possible, depending on your SLAs.
>
>>> We went from 4096 placement groups up to 16384
>>>
>>> pg_num (the number of on-disk created placement groups) was increased like this:
>>> # for i in `seq 4096 64 16384` ; do ceph osd pool set $pool pg_num $i ; sleep 60 ; done
>>> this ran overnight (and was upped to 128 step during the night)
>>>
>>> Increasing pgp_num was trickier in our case, first because it was heavy production and we wanted to minimize the visible impact and second because of wildly differing free space on the OSDs.
>>> We did it again in steps and waited for the cluster to settle before continuing.
>>> Each step upped pgp_num by about 2% and as we got higher (>8192) we increased this to much more - the last step was 15360->16384 with the same impact the initial 4096->4160 had.
>> The strategy you adopted looks great. I'll do some experiments on a test cluster to evaluate the real impact in each step
>>> The end result is much better but still nowhere near optimal - bigger impact would be upgrading to a newer Ceph release and setting the new tunables because we’re running Dumpling.
>>>
>>> Be aware that PGs cost some space (rough estimate is 5GB per OSD in our case), and also quite a bit of memory - each OSD has 1.7-2.0GB RSS right now while it only had about 1GB before. That’s a lot of memory and space with higher OSD counts...
>> This is a good point. So along with the increment of PGs, we also need to take the current status of the cluster(the available disk space and memory for each OSD) into account and evaluate whether it is needed to add more resources.
> Depends on how much free space you have. We had some OSDs at close to 85% capacity before we started (and other OSD’s at only 30%). When increasing the number of PGs the data shuffled greatly - but this depends on what CRUSH rules you have (and what version you are running). Newer versions with newer tunables will make this a lot easier I guess.
>
>>> And while I haven’t calculated the number of _objects_ per PG, but we have differing numbers of _placement_groups_ per OSD (one OSD hosts 500, another hosts 1300) and this seems to be the cause of poor data balancing.
>> In our environment, we also encountered the imbalance mapping between PGs and OSD. What kind of bucket algorithm was used in your environment? Any idea on how to minimize it?
> We are using straw because of dumpling. Straw2 should make everything better :-)
>
> Jan
>
>> Thanks,
>> Jevon
>>> Jan
>>>
>>>
>>>> On 04 Aug 2015, at 18:52, Marek Dohojda <mdohojda@altitudedigital.com> wrote:
>>>>
>>>> I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.
>>>>
>>>> After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.
>>>>
>>>> My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once.
>>>>
>>>> It isn’t recommended practice but doable.
>>>>
>>>>
>>>>
>>>>> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust@redhat.com> wrote:
>>>>>
>>>>> It will cause a large amount of data movement.  Each new pg after the
>>>>> split will relocate.  It might be ok if you do it slowly.  Experiment
>>>>> on a test cluster.
>>>>> -Sam
>>>>>
>>>>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>>>>> Hi Cephers,
>>>>>>
>>>>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>>>>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>>>>>> More details are provided bellow.
>>>>>>
>>>>>> Issue:
>>>>>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>>>>>> groups on it for a small startup company. The number 1024 was calculated per
>>>>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>>>>> well for a long time. But recently, our monitoring system always complains
>>>>>> that some disks' usage exceed 85%. I log into the system and find out that
>>>>>> some disks' usage are really very high, but some are not(less than 60%).
>>>>>> Each time when the issue happens, I have to manually re-balance the
>>>>>> distribution. This is a short-term solution, I'm not willing to do it all
>>>>>> the time.
>>>>>>
>>>>>> Two long-term solutions come in my mind,
>>>>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>>>>> think they will ask me to explain the reason of the imbalance data
>>>>>> distribution. We've already done some analysis on the environment, we
>>>>>> learned that the most imbalance part in the CRUSH is the mapping between
>>>>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>>>>> has 226 objects.
>>>>>>
>>>>>> 2) Increase the number of placement groups. It can be of great help for
>>>>>> statistically uniform data distribution, but it can also incur significant
>>>>>> data movement as PGs are effective being split. I just cannot do it in our
>>>>>> customers' environment before we 100% understand the consequence. So anyone
>>>>>> did this under a production environment? How much does this operation affect
>>>>>> the performance of Clients?
>>>>>>
>>>>>> Any comments/help/suggestions will be highly appreciated.
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>> Jevon
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [ceph-users] Is it safe to increase pg number in a production environment
  2015-08-04 16:52   ` Marek Dohojda
  2015-08-04 17:23     ` Jan Schermer
@ 2015-08-05  1:43     ` Jevon Qiao
       [not found]       ` <55C16A52.4040403-OsJI6HhKm/eMe3Hu20U6GA@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Jevon Qiao @ 2015-08-05  1:43 UTC (permalink / raw)
  To: Marek Dohojda, Samuel Just
  Cc: 乔建峰, ceph-devel@vger.kernel.org, ceph-users,
	cbt

Thank you and Samuel for the prompt response.
On 5/8/15 00:52, Marek Dohojda wrote:
> I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.
>
> After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.
How many OSDs do you have in your cluster? How much did you adjust the 
PG numbers?
> My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once.
Both you and Samuel said to do it slowly, do you mean to adjust the pg 
numbers step by step rather than doing it in one step? Also, would you 
please explain 'a long IO time' in details.

Thanks,
Jevon
> It isn’t recommended practice but doable.
>
>
>> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust@redhat.com> wrote:
>>
>> It will cause a large amount of data movement.  Each new pg after the
>> split will relocate.  It might be ok if you do it slowly.  Experiment
>> on a test cluster.
>> -Sam
>>
>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao@gmail.com> wrote:
>>> Hi Cephers,
>>>
>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>>> More details are provided bellow.
>>>
>>> Issue:
>>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>>> groups on it for a small startup company. The number 1024 was calculated per
>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>> well for a long time. But recently, our monitoring system always complains
>>> that some disks' usage exceed 85%. I log into the system and find out that
>>> some disks' usage are really very high, but some are not(less than 60%).
>>> Each time when the issue happens, I have to manually re-balance the
>>> distribution. This is a short-term solution, I'm not willing to do it all
>>> the time.
>>>
>>> Two long-term solutions come in my mind,
>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>> think they will ask me to explain the reason of the imbalance data
>>> distribution. We've already done some analysis on the environment, we
>>> learned that the most imbalance part in the CRUSH is the mapping between
>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>> has 226 objects.
>>>
>>> 2) Increase the number of placement groups. It can be of great help for
>>> statistically uniform data distribution, but it can also incur significant
>>> data movement as PGs are effective being split. I just cannot do it in our
>>> customers' environment before we 100% understand the consequence. So anyone
>>> did this under a production environment? How much does this operation affect
>>> the performance of Clients?
>>>
>>> Any comments/help/suggestions will be highly appreciated.
>>>
>>> --
>>> Best Regards
>>> Jevon
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.htmlml

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <55C16A52.4040403-OsJI6HhKm/eMe3Hu20U6GA@public.gmane.org>]

* Re: Is it safe to increase pg number in a production environment
       [not found]       ` <55C16A52.4040403-OsJI6HhKm/eMe3Hu20U6GA@public.gmane.org>
@ 2015-08-05 16:04         ` Marek Dohojda
  0 siblings, 0 replies; 15+ messages in thread
From: Marek Dohojda @ 2015-08-05 16:04 UTC (permalink / raw)
  To: Jevon Qiao
  Cc: 乔建峰,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ceph-users,
	cbt-Qp0mS5GaXlQ


[-- Attachment #1.1: Type: text/plain, Size: 4783 bytes --]

I started with 7 and expended it to 14 with starting PG of 512 to 4096, as recommended. 

Unfortunately I can’t tell you the exact IO impact as I’ve done my changes in the off hours where the impact wasn’t important, I could see reduction in performance but since it had no impact on me I didn’t messure it exactly.

Since I had the luxury of leaving it going overnight I didn’t step the PG.  I would, however, highly recommend in normal circumstances to do this in stages to reduce the impact you will see.  

You can see very significant IO load, and CPU time during the operation.  The realoctation in my case took over an hour to accomplish.


> On Aug 4, 2015, at 7:43 PM, Jevon Qiao <qiaojianfeng-OsJI6HhKm/eMe3Hu20U6GA@public.gmane.org> wrote:
> 
> Thank you and Samuel for the prompt response.
> On 5/8/15 00:52, Marek Dohojda wrote:
>> I have done this not that long ago.  My original PG estimates were wrong and I had to increase them.
>> 
>> After increasing the PG numbers the Ceph rebalanced, and that took a while.  To be honest in my case the slowdown wasn’t really visible, but it took a while.
> How many OSDs do you have in your cluster? How much did you adjust the PG numbers?
>> My strong suggestion to you would be to do it in a long IO time, and be prepared that this willl take quite a long time to accomplish.  Do it slowly  and do not increase multiple pools at once.
> Both you and Samuel said to do it slowly, do you mean to adjust the pg numbers step by step rather than doing it in one step? Also, would you please explain 'a long IO time' in details.
> 
> Thanks,
> Jevon
>> It isn’t recommended practice but doable.
>> 
>> 
>>> On Aug 4, 2015, at 10:46 AM, Samuel Just <sjust-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> 
>>> It will cause a large amount of data movement.  Each new pg after the
>>> split will relocate.  It might be ok if you do it slowly.  Experiment
>>> on a test cluster.
>>> -Sam
>>> 
>>> On Mon, Aug 3, 2015 at 12:57 AM, 乔建峰 <scaleqiao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> Hi Cephers,
>>>> 
>>>> This is a greeting from Jevon. Currently, I'm experiencing an issue which
>>>> suffers me a lot, so I'm writing to ask for your comments/help/suggestions.
>>>> More details are provided bellow.
>>>> 
>>>> Issue:
>>>> I set up a cluster having 24 OSDs and created one pool with 1024 placement
>>>> groups on it for a small startup company. The number 1024 was calculated per
>>>> the equation 'OSDs * 100'/pool size. The cluster have been running quite
>>>> well for a long time. But recently, our monitoring system always complains
>>>> that some disks' usage exceed 85%. I log into the system and find out that
>>>> some disks' usage are really very high, but some are not(less than 60%).
>>>> Each time when the issue happens, I have to manually re-balance the
>>>> distribution. This is a short-term solution, I'm not willing to do it all
>>>> the time.
>>>> 
>>>> Two long-term solutions come in my mind,
>>>> 1) Ask the customers to expand their clusters by adding more OSDs. But I
>>>> think they will ask me to explain the reason of the imbalance data
>>>> distribution. We've already done some analysis on the environment, we
>>>> learned that the most imbalance part in the CRUSH is the mapping between
>>>> object and pg. The biggest pg has 613 objects, while the smallest pg only
>>>> has 226 objects.
>>>> 
>>>> 2) Increase the number of placement groups. It can be of great help for
>>>> statistically uniform data distribution, but it can also incur significant
>>>> data movement as PGs are effective being split. I just cannot do it in our
>>>> customers' environment before we 100% understand the consequence. So anyone
>>>> did this under a production environment? How much does this operation affect
>>>> the performance of Clients?
>>>> 
>>>> Any comments/help/suggestions will be highly appreciated.
>>>> 
>>>> --
>>>> Best Regards
>>>> Jevon
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org <mailto:majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
>> More majordomo info at  http://vger.kernel.org/majordomo-info.htmlml <http://vger.kernel.org/majordomo-info.htmlml>

[-- Attachment #1.2: Type: text/html, Size: 12069 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-08-11 16:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-03  7:57 Is it safe to increase pg number in a production environment 乔建峰
2015-08-04 16:46 ` [ceph-users] " Samuel Just
2015-08-04 16:51   ` Stefan Priebe
2015-08-04 19:16     ` Ketor D
     [not found]       ` <CAM9_UU8Mxycvk91NSrFSMQ5=jDxaXcajzB7CTGDZ2sJJ0YW7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-08-04 19:48         ` Stefan Priebe
2015-08-11 15:31           ` [ceph-users] " Dan van der Ster
2015-08-11 16:02             ` Jan Schermer
2015-08-05  1:50     ` Jevon Qiao
2015-08-04 16:52   ` Marek Dohojda
2015-08-04 17:23     ` Jan Schermer
2015-08-05  3:45       ` Jevon Qiao
2015-08-05 11:36         ` Jan Schermer
2015-08-07  1:39           ` Jevon Qiao
2015-08-05  1:43     ` Jevon Qiao
     [not found]       ` <55C16A52.4040403-OsJI6HhKm/eMe3Hu20U6GA@public.gmane.org>
2015-08-05 16:04         ` Marek Dohojda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.