pg balancing

All of lore.kernel.org
 help / color / mirror / Atom feed

* pg balancing
@ 2013-05-14  0:35 Sage Weil
  2013-05-14 15:10 ` Chen, Xiaoxi
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Sage Weil @ 2013-05-14  0:35 UTC (permalink / raw)
  To: jaschut; +Cc: ceph-devel

Hi Jim-

You mentioned the other day your concerns about the uniformity of the PG 
and data distribution.  There are several ways to attack it (including 
increasing the number of PGs), but one that we haven't tested much yet is 
the 'reweight-by-utilization' function in the monitor.

The idea is that there will always be some statistical variance in the 
distribution and a non-zero probability of having outlier OSDs with too 
many PG.  We adjust for this by taking nodes that are substantially above 
the mean down by some adjustment factor in an automated way.

 ceph osd reweight-by-utilization MIN

where MIN is the minimum relative utilization at which we will start 
adjusting down.  It is always > 100 (100% of the mean), and defaults to 
120.  After it adjusts the reweights, you should see the result in 'ceph 
osd tree' output

Have you played with this at all on your cluster?  I'd be very interested 
in how well this does/does not improve things for you.

Thanks!
sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pg balancing
  2013-05-14  0:35 pg balancing Sage Weil
@ 2013-05-14 15:10 ` Chen, Xiaoxi
  2013-05-14 15:25   ` Sage Weil
  2013-05-14 15:39 ` Jim Schutt
  2013-06-05 19:49 ` Jim Schutt
  2 siblings, 1 reply; 5+ messages in thread
From: Chen, Xiaoxi @ 2013-05-14 15:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: jaschut@sandia.gov, ceph-devel@vger.kernel.org

from which release can we get this？

发自我的 iPhone

在 2013-5-14，8:36，"Sage Weil" <sage@inktank.com> 写道：

> Hi Jim-
> 
> You mentioned the other day your concerns about the uniformity of the PG 
> and data distribution.  There are several ways to attack it (including 
> increasing the number of PGs), but one that we haven't tested much yet is 
> the 'reweight-by-utilization' function in the monitor.
> 
> The idea is that there will always be some statistical variance in the 
> distribution and a non-zero probability of having outlier OSDs with too 
> many PG.  We adjust for this by taking nodes that are substantially above 
> the mean down by some adjustment factor in an automated way.
> 
> ceph osd reweight-by-utilization MIN
> 
> where MIN is the minimum relative utilization at which we will start 
> adjusting down.  It is always > 100 (100% of the mean), and defaults to 
> 120.  After it adjusts the reweights, you should see the result in 'ceph 
> osd tree' output
> 
> Have you played with this at all on your cluster?  I'd be very interested 
> in how well this does/does not improve things for you.
> 
> Thanks!
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pg balancing
  2013-05-14 15:10 ` Chen, Xiaoxi
@ 2013-05-14 15:25   ` Sage Weil
  0 siblings, 0 replies; 5+ messages in thread
From: Sage Weil @ 2013-05-14 15:25 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: jaschut@sandia.gov, ceph-devel@vger.kernel.org

On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> from which release can we get this?

That function has been there since 0.30something I think, although we 
fixed some major bug sometime around argonaut.  But it's largely unused 
and undocumented, so testing is encouraged!  :)

sage

> 
> ???? iPhone
> 
> ? 2013-5-14?8:36?"Sage Weil" <sage@inktank.com> ???
> 
> > Hi Jim-
> > 
> > You mentioned the other day your concerns about the uniformity of the PG 
> > and data distribution.  There are several ways to attack it (including 
> > increasing the number of PGs), but one that we haven't tested much yet is 
> > the 'reweight-by-utilization' function in the monitor.
> > 
> > The idea is that there will always be some statistical variance in the 
> > distribution and a non-zero probability of having outlier OSDs with too 
> > many PG.  We adjust for this by taking nodes that are substantially above 
> > the mean down by some adjustment factor in an automated way.
> > 
> > ceph osd reweight-by-utilization MIN
> > 
> > where MIN is the minimum relative utilization at which we will start 
> > adjusting down.  It is always > 100 (100% of the mean), and defaults to 
> > 120.  After it adjusts the reweights, you should see the result in 'ceph 
> > osd tree' output
> > 
> > Have you played with this at all on your cluster?  I'd be very interested 
> > in how well this does/does not improve things for you.
> > 
> > Thanks!
> > sage
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pg balancing
  2013-05-14  0:35 pg balancing Sage Weil
  2013-05-14 15:10 ` Chen, Xiaoxi
@ 2013-05-14 15:39 ` Jim Schutt
  2013-06-05 19:49 ` Jim Schutt
  2 siblings, 0 replies; 5+ messages in thread
From: Jim Schutt @ 2013-05-14 15:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[resent to list because I missed that Cc:]

Hi Sage,

On 05/13/2013 06:35 PM, Sage Weil wrote:
> Hi Jim-
> 
> You mentioned the other day your concerns about the uniformity of the PG 
> and data distribution.  There are several ways to attack it (including 
> increasing the number of PGs), but one that we haven't tested much yet is 
> the 'reweight-by-utilization' function in the monitor.
> 
> The idea is that there will always be some statistical variance in the 
> distribution and a non-zero probability of having outlier OSDs with too 
> many PG.  We adjust for this by taking nodes that are substantially above 
> the mean down by some adjustment factor in an automated way.
> 
>  ceph osd reweight-by-utilization MIN
> 
> where MIN is the minimum relative utilization at which we will start 
> adjusting down.  It is always > 100 (100% of the mean), and defaults to 
> 120.  After it adjusts the reweights, you should see the result in 'ceph 
> osd tree' output
> 
> Have you played with this at all on your cluster?  I'd be very interested 
> in how well this does/does not improve things for you.

I haven't yet, but likely it's because I don't understand
what the reweighting does, exactly.  Maybe you can comment
inline below where I go wrong?  Here's my thinking:

I'm only partially motivated by the actual amount of storage
used per OSD, although it is a factor.

My major concern is a performance issue for our parallel
application codes.  Their computation cycle is: compute
furiously, write results, repeat.  The issue is that none
of our codes implement write-behind; each task must finish
writing results before any can resume computing.

So, when some OSDs carry more PGs, they cannot complete
their portion of the write phase as quickly as other OSDs
with  fewer PGs.  Thus, the application's ability to resume
computation is delayed by the busiest OSDs.

My concern is that when we rebalance, we just cause some
other subset of the OSDs to be the busiest, in order to
send fewer writes to the overused OSDs and more writes
to underused OSDs.

At least, that's what I was thinking, without actually 
examining the code to see what is really going on in
rebalancing, and without testing.

Another thing I haven't done is actually compute from the
statistics of uniform distributions what the expected variance
is for my specific layout, (now 256K PGs across 576 OSDs, with
root/host/device hierarchy, 24 OSDs/host).  That's mostly due
to my lack of knowledge of statistics....

If I'm getting more variance than expected I want to understand
why, in case it can be fixed.

In any event, I think it's past time I tried reweighting.

Suppose I use 'ceph osd reweight-by-utilization 101', on the
theory that I'd cause continuous, small adjustments to
utilization, and I'd learn what the maximum impact can be.
Does that seem like a bad idea to you, and if so could you
help me understand why?

Thanks for taking the time to think about this -
I know you're busy.

PS - FWIW, another reason I keep pushing the number of PGs is
because when we actually deploy Ceph for production, it'll
be at a bigger scale than my testbed.  So, I'm trying to
shake out any scale-related issues now, to make sure our
users' first experience with Ceph is a good one.

-- Jim

> 
> Thanks!
> sage
> 
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pg balancing
  2013-05-14  0:35 pg balancing Sage Weil
  2013-05-14 15:10 ` Chen, Xiaoxi
  2013-05-14 15:39 ` Jim Schutt
@ 2013-06-05 19:49 ` Jim Schutt
  2 siblings, 0 replies; 5+ messages in thread
From: Jim Schutt @ 2013-06-05 19:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

On 05/13/2013 06:35 PM, Sage Weil wrote:
> Hi Jim-
> 
> You mentioned the other day your concerns about the uniformity of the PG 
> and data distribution.  There are several ways to attack it (including 
> increasing the number of PGs), but one that we haven't tested much yet is 
> the 'reweight-by-utilization' function in the monitor.
> 
> The idea is that there will always be some statistical variance in the 
> distribution and a non-zero probability of having outlier OSDs with too 
> many PG.  We adjust for this by taking nodes that are substantially above 
> the mean down by some adjustment factor in an automated way.
> 
>  ceph osd reweight-by-utilization MIN
> 
> where MIN is the minimum relative utilization at which we will start 
> adjusting down.  It is always > 100 (100% of the mean), and defaults to 
> 120.  After it adjusts the reweights, you should see the result in 'ceph 
> osd tree' output
> 
> Have you played with this at all on your cluster?  I'd be very interested 
> in how well this does/does not improve things for you.

I've been experimenting with re-weighting, and have found that
it works well to redistribute data, as you expected.

Here's a few observations:

- when an OSD goes out and comes back in, its weight gets reset
    to 1.  It would be nice if it could remember its old weight.

- in order to reach the data distribution uniformity I'm after, I
    need to run multiple iterations of re-weighting - each iteration
    pushes data off the most highly utilized OSDs, but some ends up
    on average OSDs and pushes them over the limit.

- as you expected, to reach the uniformity I'm after, a _lot_ of data
    needs to move.  I've got some scripts I'm using to generate 
    'ceph osd reweight OSD WEIGHT' commands based on PG distribution,
    and I can use these after I create a new filesystem, to get a
    suitably uniform PG distribution before there is any data to
    move.  Some iteration is required here as well, and this is
    working really well for me.  When I start writing data to such
    a re-weighted filesystem, the data distribution pretty closely
    mirrors the PG distribution (once you write enough data).

- re-weighting to get a more uniform data distribution works better
    if there are more PGs to work with.  At 576 OSDs, I can't quite
    get things as uniform as I'd like with 64K PGs, but I can with
    128K PGs.  FWIW, here's the (max PGs/OSD)/(min PGs/OSD) I've
    measured for various numbers of PGs on 576 OSDs, with no
    re-weighting:

        PGs     (max PGs/OSD) / (min PGs/OSD)
 
        65536               1.478
       131073               1.308
       262144               1.240
       524288               1.155
      1048576               1.105

    (BTW, your recent leveldb work enabled those 512K and 1M
    measurements.  Thanks!)

    With 128K PGs and iterative re-weighting, I can get
    (max PGs/OSD)/(min PGs/OSD) < 1.05, and after writing
    enough data to consume ~33% of available storage, I
    get (max OSD data use)/(min OSD data use) ~ 1.06.
    OSD weights end up in the 0.85 - 1.0 range for such a
    distribution.

So, re-weighting is definitely working for me.

-- Jim

> 
> Thanks!
> sage
> 
> 
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-06-05 19:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-14  0:35 pg balancing Sage Weil
2013-05-14 15:10 ` Chen, Xiaoxi
2013-05-14 15:25   ` Sage Weil
2013-05-14 15:39 ` Jim Schutt
2013-06-05 19:49 ` Jim Schutt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.