From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jim Schutt" Subject: Re: pg balancing Date: Wed, 5 Jun 2013 13:49:21 -0600 Message-ID: <51AF9641.60403@sandia.gov> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: Received: from sentry-two.sandia.gov ([132.175.109.14]:48943 "EHLO sentry-two.sandia.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757088Ab3FETtt (ORCPT ); Wed, 5 Jun 2013 15:49:49 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: ceph-devel@vger.kernel.org Hi Sage, On 05/13/2013 06:35 PM, Sage Weil wrote: > Hi Jim- > > You mentioned the other day your concerns about the uniformity of the PG > and data distribution. There are several ways to attack it (including > increasing the number of PGs), but one that we haven't tested much yet is > the 'reweight-by-utilization' function in the monitor. > > The idea is that there will always be some statistical variance in the > distribution and a non-zero probability of having outlier OSDs with too > many PG. We adjust for this by taking nodes that are substantially above > the mean down by some adjustment factor in an automated way. > > ceph osd reweight-by-utilization MIN > > where MIN is the minimum relative utilization at which we will start > adjusting down. It is always > 100 (100% of the mean), and defaults to > 120. After it adjusts the reweights, you should see the result in 'ceph > osd tree' output > > Have you played with this at all on your cluster? I'd be very interested > in how well this does/does not improve things for you. I've been experimenting with re-weighting, and have found that it works well to redistribute data, as you expected. Here's a few observations: - when an OSD goes out and comes back in, its weight gets reset to 1. It would be nice if it could remember its old weight. - in order to reach the data distribution uniformity I'm after, I need to run multiple iterations of re-weighting - each iteration pushes data off the most highly utilized OSDs, but some ends up on average OSDs and pushes them over the limit. - as you expected, to reach the uniformity I'm after, a _lot_ of data needs to move. I've got some scripts I'm using to generate 'ceph osd reweight OSD WEIGHT' commands based on PG distribution, and I can use these after I create a new filesystem, to get a suitably uniform PG distribution before there is any data to move. Some iteration is required here as well, and this is working really well for me. When I start writing data to such a re-weighted filesystem, the data distribution pretty closely mirrors the PG distribution (once you write enough data). - re-weighting to get a more uniform data distribution works better if there are more PGs to work with. At 576 OSDs, I can't quite get things as uniform as I'd like with 64K PGs, but I can with 128K PGs. FWIW, here's the (max PGs/OSD)/(min PGs/OSD) I've measured for various numbers of PGs on 576 OSDs, with no re-weighting: PGs (max PGs/OSD) / (min PGs/OSD) 65536 1.478 131073 1.308 262144 1.240 524288 1.155 1048576 1.105 (BTW, your recent leveldb work enabled those 512K and 1M measurements. Thanks!) With 128K PGs and iterative re-weighting, I can get (max PGs/OSD)/(min PGs/OSD) < 1.05, and after writing enough data to consume ~33% of available storage, I get (max OSD data use)/(min OSD data use) ~ 1.06. OSD weights end up in the 0.85 - 1.0 range for such a distribution. So, re-weighting is definitely working for me. -- Jim > > Thanks! > sage > > >