From: Loic Dachary <loic@dachary.org>
To: Kyle Bader <kyle@inktank.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: Pyramid erasure codes and replica hinted recovery
Date: Sun, 12 Jan 2014 20:37:28 +0100 [thread overview]
Message-ID: <52D2EEF8.8020009@dachary.org> (raw)
In-Reply-To: <CAHxYaFOg6N2ZP1=OuOWajXUqMFtGsq0u94DG=Ub8NcAaqFtj+g@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3717 bytes --]
On 12/01/2014 15:31, Kyle Bader wrote:
>> If we had RS(6:3:3) 6 data chunks, 3 coding chunks, 3 local chunks, the following rule could be used to spread it over 3 datacenters:
>>
>> rule erasure_ruleset {
>> ruleset 1
>> type erasure
>> min_size 3
>> max_size 20
>> step set_chooseleaf_tries 5
>> step take root
>> step choose indep 3 type datacenter
>> step choose indep 4 type device
>> step emit
>> }
>>
>> crushtool -o /tmp/t.map --num_osds 500 --build node straw 10 datacenter straw 10 root straw 0
>> crushtool -d /tmp/t.map -o /tmp/t.txt # edit the ruleset as above
>> crushtool -c /tmp/t.txt -o /tmp/t.map ; crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 1 --x 1 --num-rep 12
>> rule 1 (erasure_ruleset), x = 1..1, numrep = 12..12
>> CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213]
>> rule 1 (erasure_ruleset) num_rep 12 result size == 12: 1/1
>>
>> 399 is in datacenter 3, node 9, device 9 etc. It shows that the first four are in datacenter 3, the next in datacenter zero and the last four in datacenter 2.
>>
>> If the function calculating erasure code spreads local chunks evenly ( 321, 12, 213 for instance ), they will effectively be located as you suggest. Andreas may have a different view on this question though.
>>
>> In case 78 goes missing ( and assuming all other chunks are good ), it can be rebuilt with 512, 9, 12 only. However, if the primary driving the reconstruction is 270, data will need to cross datacenter boundaries. Would it be cheaper to elect a primary closest ( in the sense of get_common_ancestor_distance https://github.com/ceph/ceph/blob/master/src/crush/CrushWrapper.h#L487 ) to the OSD to be recovered ? Only Sam or David could give you an authoritative answer.
>
There is a typo : "rebuilt with 51, 9, 12"
> That scheme does work out, except I'm not sure how the
> recovering/remapped OSD can makes sure to read the local copies. I'll
> wait to hear what Sam and David have to say. That said, what do you
> think of this diagram of a 12,2,2:
>
> http://storagemojo.com/wp-content/uploads/2012/09/LRC_parity.PNG from [4]
>
> With a 12,2,2 LRC code the stripe is split into 12 units, those 12
> units are used to generate 2 global parity units. The original 12
> units are then split into 2 groups, and each group separately
> generates an additional local parity unit. Each group of 6 original
> units and 1 parity unit are passed through CRUSH for placement with a
> common ancestor (for local recovery). This would be low overhead,
> certainly less than a replicated, and be friendlier on inter-dc links
> during remapping and recovery.
>
> [4] http://storagemojo.com/2012/09/26/more-efficient-erasure-coding-in-windows-azure-storage/
>
How is it different from what is described above ? There must be something I fail to understand.
In the above example
CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213]
would be
datacenter 3
399 data
344 data
343 global parity
321 local parity
datacenter 0
51 data
78 data
9 global parity
12 local parity
datacenter 2
274 data
263 data
270 global parity
213 local parity
If 9 is missing, recovery looks for the chunks that have the closest common ancestor, finds 51, 78, 12 and since there only is one missing chunk uses local parity chunk 12 to rebuild it. If 9 and 12 are missing, it needs all remaining data + global parity chunks to rebuild. In this example the overhead of erasure coding is high but increasing the number of data chunks reduces it.
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
next prev parent reply other threads:[~2014-01-12 19:37 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-10 23:40 Pyramid erasure codes and replica hinted recovery Kyle Bader
2014-01-12 1:31 ` Loic Dachary
2014-01-12 14:31 ` Kyle Bader
2014-01-12 19:37 ` Loic Dachary [this message]
2014-01-13 2:35 ` Kyle Bader
2014-01-13 8:38 ` Loic Dachary
2014-01-13 11:15 ` Andreas Joachim Peters
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52D2EEF8.8020009@dachary.org \
--to=loic@dachary.org \
--cc=ceph-devel@vger.kernel.org \
--cc=kyle@inktank.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.