From mboxrd@z Thu Jan 1 00:00:00 1970 From: Loic Dachary Subject: Re: Pyramid erasure codes and replica hinted recovery Date: Sun, 12 Jan 2014 02:31:12 +0100 Message-ID: <52D1F060.8050108@dachary.org> References: Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="s6FGhmjuD2T2oJFMLu8MtstXuMssHFN9m" Return-path: Received: from smtp.dmail.dachary.org ([91.121.254.229]:41133 "EHLO smtp.dmail.dachary.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750938AbaALBbW (ORCPT ); Sat, 11 Jan 2014 20:31:22 -0500 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Kyle Bader , ceph-devel@vger.kernel.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --s6FGhmjuD2T2oJFMLu8MtstXuMssHFN9m Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 11/01/2014 00:40, Kyle Bader wrote: > I've been researching what features might be necessary in Ceph to > build multi-site RADOS clusters, whether for purposes of scale or to > meet SLA requirements more stringent than is achievable with a single > datacenter. According to [1], "typical [datacenter] availability > estimates used in the industry range from 99.7% for tier II to 99.98 > and 99.995% for tiers II and IV respectively". Combine the possibility > of border and/or core networking meltdown and it's all but impossible > to achieve a Ceph service SLA that requires 3-5 nines of availability > in a single facility. >=20 > When we start looking at multi-site network configurations we need to > make sure there is sufficient cluster level bandwidth for the > following activities: >=20 > 1. Write fan-out from replication on ingest > 2. Backfills from OSD recovery > 3. Backfills from OSD remapping >=20 > Number 1 can be estimated based on historical usage with some > additional padding for traffic spikes. Recovery backfills can be > roughly estimated based on the size of the disk population in each > facility and the OSD annualized failure rate. Number 3 makes > multi-site configurations extremely challenging unless the > organization building the cluster is willing to pay 7 zeros for 5 > nines. >=20 > Consider the following: >=20 > 1x 16x40GbE switch with 8x used for access ports, 8x used for > inter-site (x4 10GbE breakout per port) > 32x Ceph OSD nodes with a 10GbE cluster link (working out to ~3PB raw) >=20 > Topology: >=20 > [A]-----[B] > \ / > \ / > [C] >=20 > Since 40GbE is likely only an option if running over dark fiber, > non-blocking multi-site would require a total of 12 leased 10GbE > lines, 6 for 2:1, and 3 for 4:1. These lines will be extremely > stressed each and every time capacity is added to the cluster due to > the fact that pgs will be remapped and the OSD that is new to the PG > needing to be backfilled by the primary at another site (for 3x > replication). Erasure coding with regular MDS codes or even pyramid > codes will exhibit similar issues, as described in [2] and [3]. It > would be fantastic to see Ceph have a facility similar to what I > describe in this bug for replication: >=20 > http://tracker.ceph.com/issues/7114 >=20 > For erasure coding, something similar to Facebook's LRC as described > in [2] would be advantageous. For example: >=20 > RS(8:4:2) >=20 > [k][k][k][k][k][k][k][k] -> [k][k][k][k][k][k][k][k][m][m][m][m] >=20 > Split over 3 sites >=20 > [k][k][k][k] [k][k][k][k] [k][k][k][k] >=20 > Generate 2 more parity units >=20 > [k][k][k][k][m][m] [k][k][k][k][m][m] [k][k][k][k][m][m] >=20 > Now if each *set* of units could be placed such that they share a > common ancestor in the CRUSH hierarchy then local unit sets from the > lower level of the pyramid could be remapped/recovered without > consuming inter-site bandwidth (maybe treat each set as a "replica" > instead of treating each individual unit as a "replica"). >=20 > Thoughts? If we had RS(6:3:3) 6 data chunks, 3 coding chunks, 3 local chunks, the f= ollowing rule could be used to spread it over 3 datacenters: rule erasure_ruleset { ruleset 1 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step take root step choose indep 3 type datacenter step choose indep 4 type device step emit } crushtool -o /tmp/t.map --num_osds 500 --build node straw 10 datacenter s= traw 10 root straw 0 crushtool -d /tmp/t.map -o /tmp/t.txt # edit the ruleset as above crushtool -c /tmp/t.txt -o /tmp/t.map ; crushtool -i /tmp/t.map --show-ba= d-mappings --show-statistics --test --rule 1 --x 1 --num-rep 12 rule 1 (erasure_ruleset), x =3D 1..1, numrep =3D 12..12 CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213] rule 1 (erasure_ruleset) num_rep 12 result size =3D=3D 12: 1/1 399 is in datacenter 3, node 9, device 9 etc. It shows that the first fou= r are in datacenter 3, the next in datacenter zero and the last four in d= atacenter 2.=20 If the function calculating erasure code spreads local chunks evenly ( 32= 1, 12, 213 for instance ), they will effectively be located as you sugges= t. Andreas may have a different view on this question though. In case 78 goes missing ( and assuming all other chunks are good ), it ca= n be rebuilt with 512, 9, 12 only. However, if the primary driving the re= construction is 270, data will need to cross datacenter boundaries. Would= it be cheaper to elect a primary closest ( in the sense of get_common_an= cestor_distance https://github.com/ceph/ceph/blob/master/src/crush/CrushW= rapper.h#L487 ) to the OSD to be recovered ? Only Sam or David could give= you an authoritative answer. Cheers >=20 > [1] http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CA= C024 > [2] http://arxiv.org/pdf/1301.3791.pdf > [3] https://static.googleusercontent.com/media/research.google.com/en/u= s/pubs/archive/36737.pdf >=20 --=20 Lo=EFc Dachary, Artisan Logiciel Libre --s6FGhmjuD2T2oJFMLu8MtstXuMssHFN9m Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlLR8GEACgkQ8dLMyEl6F23KhgCfT7IbGAMQbw9DVxv9xmaNdC+Q oyIAoIjG0vQaWn7KaYgO5dwp9tLsrYnY =IDOc -----END PGP SIGNATURE----- --s6FGhmjuD2T2oJFMLu8MtstXuMssHFN9m--