All of lore.kernel.org
 help / color / mirror / Atom feed
From: Loic Dachary <loic@dachary.org>
To: Kyle Bader <kyle@inktank.com>, ceph-devel@vger.kernel.org
Subject: Re: Pyramid erasure codes and replica hinted recovery
Date: Sun, 12 Jan 2014 02:31:12 +0100	[thread overview]
Message-ID: <52D1F060.8050108@dachary.org> (raw)
In-Reply-To: <CAHxYaFMa_peMz8StZpJ=if0W4LmHBDO71vNnYnmgcNMab0-3aQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5062 bytes --]



On 11/01/2014 00:40, Kyle Bader wrote:
> I've been researching what features might be necessary in Ceph to
> build multi-site RADOS clusters, whether for purposes of scale or to
> meet SLA requirements more stringent than is achievable with a single
> datacenter. According to [1], "typical [datacenter] availability
> estimates used in the industry range from 99.7% for tier II to 99.98
> and 99.995% for tiers II and IV respectively". Combine the possibility
> of border and/or core networking meltdown and it's all but impossible
> to achieve a Ceph service SLA that requires 3-5 nines of availability
> in a single facility.
> 
> When we start looking at multi-site network configurations we need to
> make sure there is sufficient cluster level bandwidth for the
> following activities:
> 
> 1. Write fan-out from replication on ingest
> 2. Backfills from OSD recovery
> 3. Backfills from OSD remapping
> 
> Number 1 can be estimated based on historical usage with some
> additional padding for traffic spikes. Recovery backfills can be
> roughly estimated based on the size of the disk population in each
> facility and the OSD annualized failure rate. Number 3 makes
> multi-site configurations extremely challenging unless the
> organization building the cluster is willing to pay 7 zeros for 5
> nines.
> 
> Consider the following:
> 
> 1x 16x40GbE switch with 8x used for access ports, 8x used for
> inter-site (x4 10GbE breakout per port)
> 32x Ceph OSD nodes with a 10GbE cluster link (working out to ~3PB raw)
> 
> Topology:
> 
> [A]-----[B]
>   \       /
>    \     /
>     [C]
> 
> Since 40GbE is likely only an option if running over dark fiber,
> non-blocking multi-site would require a total of 12 leased 10GbE
> lines, 6 for 2:1, and 3 for 4:1. These lines will be extremely
> stressed each and every time capacity is added to the cluster due to
> the fact that pgs will be remapped and the OSD that is new to the PG
> needing to be backfilled by the primary at another site (for 3x
> replication). Erasure coding with regular MDS codes or even pyramid
> codes will exhibit similar issues, as described in [2] and [3]. It
> would be fantastic to see Ceph have a facility similar to what I
> describe in this bug for replication:
> 
> http://tracker.ceph.com/issues/7114
> 
> For erasure coding, something similar to Facebook's LRC as described
> in [2] would be advantageous. For example:
> 
> RS(8:4:2)
> 
> [k][k][k][k][k][k][k][k] -> [k][k][k][k][k][k][k][k][m][m][m][m]
> 
> Split over 3 sites
> 
> [k][k][k][k]      [k][k][k][k]      [k][k][k][k]
> 
> Generate 2 more parity units
> 
> [k][k][k][k][m][m]  [k][k][k][k][m][m]   [k][k][k][k][m][m]
> 
> Now if each *set* of units could be placed such that they share a
> common ancestor in the CRUSH hierarchy then local unit sets from the
> lower level of the pyramid could be remapped/recovered without
> consuming inter-site bandwidth (maybe treat each set as a "replica"
> instead of treating each individual unit as a "replica").
> 
> Thoughts?

If we had RS(6:3:3) 6 data chunks, 3 coding chunks, 3 local chunks, the following rule could be used to spread it over 3 datacenters:

rule erasure_ruleset {
	ruleset 1
	type erasure
	min_size 3
	max_size 20
	step set_chooseleaf_tries 5
	step take root
	step choose indep 3 type datacenter
	step choose indep 4 type device
	step emit
}

crushtool -o /tmp/t.map --num_osds 500 --build node straw 10 datacenter straw 10 root straw 0
crushtool -d /tmp/t.map -o /tmp/t.txt # edit the ruleset as above
crushtool -c /tmp/t.txt -o /tmp/t.map ; crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 1 --x 1 --num-rep 12
rule 1 (erasure_ruleset), x = 1..1, numrep = 12..12
CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213]
rule 1 (erasure_ruleset) num_rep 12 result size == 12:	1/1

399 is in datacenter 3, node 9, device 9 etc. It shows that the first four are in datacenter 3, the next in datacenter zero and the last four in datacenter 2. 

If the function calculating erasure code spreads local chunks evenly ( 321, 12, 213 for instance ), they will effectively be located as you suggest. Andreas may have a different view on this question though.

In case 78 goes missing ( and assuming all other chunks are good ), it can be rebuilt with 512, 9, 12 only. However, if the primary driving the reconstruction is 270, data will need to cross datacenter boundaries. Would it be cheaper to elect a primary closest ( in the sense of get_common_ancestor_distance https://github.com/ceph/ceph/blob/master/src/crush/CrushWrapper.h#L487 ) to the OSD to be recovered ? Only Sam or David could give you an authoritative answer.

Cheers

> 
> [1] http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024
> [2] http://arxiv.org/pdf/1301.3791.pdf
> [3] https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36737.pdf
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

  reply	other threads:[~2014-01-12  1:31 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-10 23:40 Pyramid erasure codes and replica hinted recovery Kyle Bader
2014-01-12  1:31 ` Loic Dachary [this message]
2014-01-12 14:31   ` Kyle Bader
2014-01-12 19:37     ` Loic Dachary
2014-01-13  2:35       ` Kyle Bader
2014-01-13  8:38         ` Loic Dachary
2014-01-13 11:15           ` Andreas Joachim Peters

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52D1F060.8050108@dachary.org \
    --to=loic@dachary.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=kyle@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.