From mboxrd@z Thu Jan 1 00:00:00 1970 From: Loic Dachary Subject: Re: FW: CURSH optimization for unbalanced pg distribution Date: Tue, 09 Sep 2014 15:36:18 +0200 Message-ID: <540F0252.7040609@dachary.org> References: <51FC7A40FB29414D88A121A7FFEF9A4710DC2188@SHSMSX104.ccr.corp.intel.com> <53299E33.5060809@inktank.com> <51FC7A40FB29414D88A121A7FFEF9A4710DC3018@SHSMSX104.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="2TebQ523cGe7e1KTWNl4uuRAvMkVXG2I2" Return-path: Received: from mail2.dachary.org ([91.121.57.175]:48192 "EHLO smtp.dmail.dachary.org" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756614AbaIINg2 (ORCPT ); Tue, 9 Sep 2014 09:36:28 -0400 In-Reply-To: <51FC7A40FB29414D88A121A7FFEF9A4710DC3018@SHSMSX104.ccr.corp.intel.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Zhang, Jian" , "ceph-devel@vger.kernel.org" This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --2TebQ523cGe7e1KTWNl4uuRAvMkVXG2I2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 20/03/2014 04:54, Zhang, Jian wrote: > Forwarding per Sage's suggestion.=20 Very interesting discussion :-) For the record the corresponding pull req= uest is https://github.com/ceph/ceph/pull/2402 >=20 >=20 > -----Original Message----- > From: Sage Weil [mailto:sage@inktank.com]=20 > Sent: Wednesday, March 19, 2014 11:29 PM > To: Mark Nelson > Cc: Zhang, Jian; Duan, Jiangang; He, Yujie > Subject: Re: CURSH optimization for unbalanced pg distribution >=20 > On Wed, 19 Mar 2014, Mark Nelson wrote: >> On 03/19/2014 03:24 AM, Zhang, Jian wrote: >>> For more detail data, please refer to the *Testing results* part. >>> >>> *Optimization proposals: * >>> >>> After we dived into the source code of CRUSH and related papers, we=20 >>> proposed two possible optimizations: >>> >>> 1.Add different hash algorithms, as an alternative for the Jenkin's=20 >>> hash, e.g. algorithm that will produce even values when range of=20 >>> input value (pg#) is relatively small. Or add new bucket type at the = >>> same time if necessary. >=20 > This *might* work, but I don't have a strong intuition about it. The m= odeling we've done now has essentially assumed a statistically uniform di= stribution, which has some inherent inbalance for low values of n (num pg= s in our case). I have generally assumed we can't do better than "random= ", and still have the other properties we want (independent, deterministi= c placement), but it may be possible. >=20 >>> >>> 2.Find a better replica placement strategy instead of current retry >>> logic of crush_choose_firstn, which may cause CRUSH to behave badly. >>> >>> We find there are several threshold of retry times by referring to co= de, >>> choose_total_tries, choose_local_tries and choose_local_fallback_trie= s. >>> They are used to decide whether to do a retry_bucket, retry_descent o= r >>> use permutation to do an exhaustive bucket search. We are wondering i= f >>> there is another retry strategy: >>> >>> a)Backtracking retry. Now the logic of crush_choose_firstn can only >>> issue an retry either from the initial bucket(retry_descent) or from = the >>> current bucket (retry_bucket), how about retrying the intervening buc= kets? >>> >>> b)Adjust threshold of retry times by other values. We do noticed that= >>> the 'optimal' crush tunable could be used to make it, but we still >>> encounter unbalanced [g distribution by using the optimal strategy. >>> Please refer to 4 of the Testing results part. >>> >>> c)Add an mechanism that can adjust above mentioned thresholds >>> adaptively. Maybe we can record the retry times of the previous call = for >>> CRUSH, and adjust retry thresholds automatically according to the rec= ord. >=20 > I suggest ignoring all of this retry logic. The original version of CR= USH=20 > has the local retries to try to make data move "less far", but when we = > went back a year ago and did a statistical analysis of the distribution= we=20 > found that *all* of these hacks degraded the quality of the placement,a= nd=20 > by turning them all off (setting the 'optimal' values which zeroes them= =20 > all out excent for total_retries) we got something that was=20 > indistinguishable from a uniform distribution. >=20 >>> 3.Add soft link for pg directories. During pg creation, we can create= >>> soft links for the pgs if pg# on the selected osd is more than some >>> threshold, say 10% more than desired average number, to move objects >>> that will be stored in this pg to another osd. Balanced disk utilizat= ion >>> may be gained in this way. >=20 > I think you need to be careful, but yes, this is an option. There is a= =20 > similar exception mechanism in place that is used for other purposes an= d=20 > something similar could be done here. The main challenge will be in=20 > ensuring that the soft links/exceptions follow the same overall policy = > that CRUSH does after the raw mapping is performed. This is an option,= =20 > but I would put it toward the bottom of the list... >=20 >>> 4.Change placement strategy only for step of selecting devices from >>> hosts. We found in our testing results that pg distribution was balan= ced >>> among hosts, which is reasonable since pg# of each host is above 1K >>> (according to the current BKM that pg# per osd should be about 100). = So >>> how about we apply CRUSH only on the interval buckets and find anothe= r >>> simple but more balanced method to choose osd from host? >=20 > This idea has a lot of potential. For example: >=20 > If you know the chassis can hold 12 disks, you can force the bucket siz= e=20 > to twelve and somehow prevent users from adjusting the structure of the= =20 > tree. Then you can use a simple mapping that is truly flat (like a lin= ear=20 > mapping, disk =3D x % num_disks) for that bucket/subtree. The downside= of=20 > course is that if you remove a disk *everything* reshuffles, hence some= =20 > sort of guardrails to prevent a user from inadvertantly doing that. If= a=20 > disk *does* fail, you just need to make sure the disk is marked "out" b= ut=20 > not removed from the CRUSH hierarchy and the normal retry will kick in.= >=20 > Note that all this is reall doing is increasing the size of the "bucket= s"=20 > that we are (pseudo)randomly distribution over. It is still a=20 > random/uniform distribution, but the N value is 12 times bigger (for a = 12=20 > disk chassis) and as a result the variance is substantially lower. >=20 > I would suggest making a new bucket type that is called 'linear' and do= es=20 > a simple modulo and trying this out. We will need a bunch of additiona= l=20 > safety checks to help users avoid doing silly things (like adjusting th= e=20 > number of items in the linear buckets, which reshuffle everything) but = > that wouldn't be needed for an initial analysis of the performance impa= ct. >=20 > Do you mind if we shift this thread over to ceph-devel? I think there = are=20 > lots of people who would be interested in this discussion. We can of=20 > course leave off your attachment if you prefer. >=20 > Thanks! > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >=20 --=20 Lo=EFc Dachary, Artisan Logiciel Libre --2TebQ523cGe7e1KTWNl4uuRAvMkVXG2I2 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEUEARECAAYFAlQPAlIACgkQ8dLMyEl6F23xRgCgouC0xqGtrYSzoZ9QyYzieNgL eEAAmNicEEQdNW1S51UsSJmL3X6OtjM= =/xuV -----END PGP SIGNATURE----- --2TebQ523cGe7e1KTWNl4uuRAvMkVXG2I2--