Pyramid erasure code description revisited

ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Pyramid erasure code description revisited
@ 2014-05-31 17:10 Loic Dachary
  2014-06-02 12:20 ` Andreas Joachim Peters
  2014-06-05 14:05 ` Locally repairable code description revisited (was Pyramid ...) Loic Dachary
  0 siblings, 2 replies; 10+ messages in thread
From: Loic Dachary @ 2014-05-31 17:10 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

Hi Andreas,

After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).

These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.

"__ABC__DE_" data chunks placement

Step 1
"__ABC__DE_"
"_yVWX_zYZ_" K=5, M=2
"_aABC_bDE_"

Step 2
"_aABC_bDE_"
"z_XYZ_____" K=3, M=1
"caABC_bDE_"

Step 3
"caABC_bDE_"
"_____zXYZ_" K=3, M=1
"caABCdbDE_"

Step 4
"caABCdbDE_"
"_____WXYZz" K=4, M=1
"caABCdbDEe"

The interpretation of Step 3 is as follows:

Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ). 

This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.

For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.

Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Pyramid erasure code description revisited
  2014-05-31 17:10 Pyramid erasure code description revisited Loic Dachary
@ 2014-06-02 12:20 ` Andreas Joachim Peters
  2014-06-02 13:14   ` Loic Dachary
  2014-06-05 14:05 ` Locally repairable code description revisited (was Pyramid ...) Loic Dachary
  1 sibling, 1 reply; 10+ messages in thread
From: Andreas Joachim Peters @ 2014-06-02 12:20 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic, 

I think this gives all the flexibility to define any possible combination for encoding ...

When one constructs the steps one has just to be aware that the 'most local' encoding should happen in the end, right?

It would be usefule to have a tool which outputs then for each data aND parity chunk the achieved 'redundancy' and the overall volume and maximal reconstruction 'overhead'.

Cheers Andreas.

________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 31 May 2014 19:10
To: Andreas Joachim Peters
Cc: Ceph Development
Subject: Pyramid erasure code description revisited

Hi Andreas,

After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).

These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.

"__ABC__DE_" data chunks placement

Step 1
"__ABC__DE_"
"_yVWX_zYZ_" K=5, M=2
"_aABC_bDE_"

Step 2
"_aABC_bDE_"
"z_XYZ_____" K=3, M=1
"caABC_bDE_"

Step 3
"caABC_bDE_"
"_____zXYZ_" K=3, M=1
"caABCdbDE_"

Step 4
"caABCdbDE_"
"_____WXYZz" K=4, M=1
"caABCdbDEe"

The interpretation of Step 3 is as follows:

Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ).

This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.

For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.

Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?

Cheers

--
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Pyramid erasure code description revisited
  2014-06-02 12:20 ` Andreas Joachim Peters
@ 2014-06-02 13:14   ` Loic Dachary
       [not found]     ` <1401733713.18379.YahooMailNeo@web165006.mail.bf1.yahoo.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-06-02 13:14 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Ceph Development, Koleos Fuscus

[-- Attachment #1: Type: text/plain, Size: 3442 bytes --]

Hi Andreas,

On 02/06/2014 14:20, Andreas Joachim Peters wrote:> Hi Loic, 
> 
> I think this gives all the flexibility to define any possible combination for encoding ...
> 
> When one constructs the steps one has just to be aware that the 'most local' encoding should happen in the end, right?

Yes. 

> 
> It would be usefule to have a tool which outputs then for each data aND parity chunk the achieved 'redundancy' and the overall volume and maximal reconstruction 'overhead'.

Right. I'm kind of hoping koleosfuscus (cc'ed) will be able to fit that into the reliability model, but we've not discussed that yet. In any case you are right, a small command line tool would be helpful. Something that would explain: if you loose one of the chunks you need four to recover. If you lose two you need all of them. That's more humanly readable and understandable than the full description ;-)

Cheers

> 
> Cheers Andreas.
> 
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 31 May 2014 19:10
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Pyramid erasure code description revisited
> 
> Hi Andreas,
> 
> After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).
> 
> These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.
> 
> "__ABC__DE_" data chunks placement
> 
> Step 1
> "__ABC__DE_"
> "_yVWX_zYZ_" K=5, M=2
> "_aABC_bDE_"
> 
> Step 2
> "_aABC_bDE_"
> "z_XYZ_____" K=3, M=1
> "caABC_bDE_"
> 
> Step 3
> "caABC_bDE_"
> "_____zXYZ_" K=3, M=1
> "caABCdbDE_"
> 
> Step 4
> "caABCdbDE_"
> "_____WXYZz" K=4, M=1
> "caABCdbDEe"
> 
> The interpretation of Step 3 is as follows:
> 
> Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ).
> 
> This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.
> 
> For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.
> 
> Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?
> 
> Cheers
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Pyramid erasure code description revisited
       [not found]     ` <1401733713.18379.YahooMailNeo@web165006.mail.bf1.yahoo.com>
@ 2014-06-02 18:49       ` Loic Dachary
  0 siblings, 0 replies; 10+ messages in thread
From: Loic Dachary @ 2014-06-02 18:49 UTC (permalink / raw)
  To: Koleos Fuskus; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 4768 bytes --]

Hi koleosfuscus,

A simpler proposal was made a few days ago. As you rightfully point out, the previous one was a bit complicated to understand ;-)

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/19753

Cheers

On 02/06/2014 20:28, Koleos Fuskus wrote:
> Hi Loic,
> I am trying to understand your proposal on http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code
> Is the mapping specification a new feature on CRUSH to support Pyramid Codes? 
> I don't follow from line 72, when you are talking about "crush multidatacenter mapping". 
> Adding a failure domain typically adds a new level in the pyramid using xor?
> 
> "** if one chunk is missing, will return that all chunks from the local cluster are needed"
> If one chunk is missing, it recovers it using xor instead of jerasure?
> "** if two chunks are missing in the same local cluster, it will defer to the global level"
> In this case it has the pyramid code doesn't help, does it?
> ** if two chunks are missing, each of them in a different local cluster, it will return that it needs all chunks from both local cluster but will not defer to the upper level
> 
> Best,
> koleos
> 
> 
> 
> On Monday, June 2, 2014 3:14 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Andreas,
> 
> On 02/06/2014 14:20, Andreas Joachim Peters wrote:> Hi Loic, 
>>
>> I think this gives all the flexibility to define any possible combination for encoding ...
>>
>> When one constructs the steps one has just to be aware that the 'most local' encoding should happen in the end, right?
> 
> Yes. 
> 
>>
>> It would be usefule to have a tool which outputs then for each data aND parity chunk the achieved 'redundancy' and the overall volume and maximal reconstruction 'overhead'.
> 
> Right. I'm kind of hoping koleosfuscus (cc'ed) will be able to fit that into the reliability model, but we've not discussed that yet. In any case you are right, a small command line tool would be helpful. Something that would explain: if you loose one of the chunks you need four to recover. If you lose two you need all of them. That's more humanly readable and understandable than the full description ;-)
> 
> Cheers
> 
>>
>> Cheers Andreas.
>>
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 31 May 2014 19:10
>> To: Andreas Joachim Peters
>> Cc: Ceph Development
>> Subject: Pyramid erasure code description revisited
>>
>> Hi Andreas,
>>
>> After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).
>>
>> These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.
>>
>> "__ABC__DE_" data chunks placement
>>
>> Step 1
>> "__ABC__DE_"
>> "_yVWX_zYZ_" K=5, M=2
>> "_aABC_bDE_"
>>
>> Step 2
>> "_aABC_bDE_"
>> "z_XYZ_____" K=3, M=1
>> "caABC_bDE_"
>>
>> Step 3
>> "caABC_bDE_"
>> "_____zXYZ_" K=3, M=1
>> "caABCdbDE_"
>>
>> Step 4
>> "caABCdbDE_"
>> "_____WXYZz" K=4, M=1
>> "caABCdbDEe"
>>
>> The interpretation of Step 3 is as follows:
>>
>> Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ).
>>
>> This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.
>>
>> For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.
>>
>> Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?
>>
>> Cheers
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Locally repairable code description revisited (was Pyramid ...)
  2014-05-31 17:10 Pyramid erasure code description revisited Loic Dachary
  2014-06-02 12:20 ` Andreas Joachim Peters
@ 2014-06-05 14:05 ` Loic Dachary
  2014-06-06 11:46   ` Andreas Joachim Peters
  1 sibling, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-06-05 14:05 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2491 bytes --]

Hi Andreas,

Here is a preliminary implementation https://github.com/ceph/ceph/pull/1921 . It does not have tests yet, nor does it run but but it compiles and I think it makes sense. It is indeed simpler than the previous implementation.

It would be great if you could quickly browse 

   https://github.com/dachary/ceph/commit/5661c72864eca04c5dac9494a3f7e7a13515fca6

and let me know if you have any concerns.

Cheers

On 31/05/2014 19:10, Loic Dachary wrote:
> Hi Andreas,
> 
> After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).
> 
> These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.
> 
> "__ABC__DE_" data chunks placement
> 
> Step 1
> "__ABC__DE_"
> "_yVWX_zYZ_" K=5, M=2
> "_aABC_bDE_"
> 
> Step 2
> "_aABC_bDE_"
> "z_XYZ_____" K=3, M=1
> "caABC_bDE_"
> 
> Step 3
> "caABC_bDE_"
> "_____zXYZ_" K=3, M=1
> "caABCdbDE_"
> 
> Step 4
> "caABCdbDE_"
> "_____WXYZz" K=4, M=1
> "caABCdbDEe"
> 
> The interpretation of Step 3 is as follows:
> 
> Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ). 
> 
> This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.
> 
> For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.
> 
> Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Locally repairable code description revisited (was Pyramid ...)
  2014-06-05 14:05 ` Locally repairable code description revisited (was Pyramid ...) Loic Dachary
@ 2014-06-06 11:46   ` Andreas Joachim Peters
  2014-06-06 14:30     ` Loic Dachary
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Joachim Peters @ 2014-06-06 11:46 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic,
the basic implementation looks very clean.

I have few comments/ideas:

- the reconstruction strategy using the three levels is certainly efficient enough for standard cases but does not guarantee always the minimum decoding (in cases where one layer is not enough to reconstruct) since your third algorithm is just brute-force to reconstruct everything through all layers until we have what we need ...

- the whole LRC configuration actually does not describe the placement - it still looks disconnected from the placement strategy/crush rules ... wouldn't it make sense to have the crush rule implicit in the description or a function to derive it automatically based on the LRC configuration? Maybe you have this already done in another way and I didn't see it ...

-  should the plug-in have the ability to select reconstruction on proximity or this should be up-to the higher layer to provide chunks in a way that reconstruction would select the 'closest' layer? The relevance of the question you will understand better in the next point ....

- I remember we had this 3 data centre example with (8,4) where you can reconstruct every object if 2 data centres are up. Another appealing example avoiding remote access when reading an object is that you have 2 data centres having a replication of e.g. (4,2) encoded objects. Can you describe in your LRC configuration language to store the same chunk twice like    __ABCCBA__ ?

Cheers Andreas.


________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 05 June 2014 16:05
To: Andreas Joachim Peters
Cc: Ceph Development
Subject: Re: Locally repairable code description revisited (was Pyramid ...)

Hi Andreas,

Here is a preliminary implementation https://github.com/ceph/ceph/pull/1921 . It does not have tests yet, nor does it run but but it compiles and I think it makes sense. It is indeed simpler than the previous implementation.

It would be great if you could quickly browse

   https://github.com/dachary/ceph/commit/5661c72864eca04c5dac9494a3f7e7a13515fca6

and let me know if you have any concerns.

Cheers

On 31/05/2014 19:10, Loic Dachary wrote:
> Hi Andreas,
>
> After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).
>
> These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.
>
> "__ABC__DE_" data chunks placement
>
> Step 1
> "__ABC__DE_"
> "_yVWX_zYZ_" K=5, M=2
> "_aABC_bDE_"
>
> Step 2
> "_aABC_bDE_"
> "z_XYZ_____" K=3, M=1
> "caABC_bDE_"
>
> Step 3
> "caABC_bDE_"
> "_____zXYZ_" K=3, M=1
> "caABCdbDE_"
>
> Step 4
> "caABCdbDE_"
> "_____WXYZz" K=4, M=1
> "caABCdbDEe"
>
> The interpretation of Step 3 is as follows:
>
> Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ).
>
> This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.
>
> For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.
>
> Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?
>
> Cheers
>

--
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Locally repairable code description revisited (was Pyramid ...)
  2014-06-06 11:46   ` Andreas Joachim Peters
@ 2014-06-06 14:30     ` Loic Dachary
  2014-06-09 20:18       ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-06-06 14:30 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 6164 bytes --]

Hi Andreas,

On 06/06/2014 13:46, Andreas Joachim Peters wrote:> Hi Loic,
> the basic implementation looks very clean.
> 
> I have few comments/ideas:
> 
> - the reconstruction strategy using the three levels is certainly efficient enough for standard cases but does not guarantee always the minimum decoding (in cases where one layer is not enough to reconstruct) since your third algorithm is just brute-force to reconstruct everything through all layers until we have what we need ...

The third strategy is indeed brute force. Do you think it is worth changing to be minimal ? It would be nice to quantify the percent of cases it addresses. Do you know how to do that ? It looks like a very small percentage but there is no proof it is small ;-)

> - the whole LRC configuration actually does not describe the placement - it still looks disconnected from the placement strategy/crush rules ... wouldn't it make sense to have the crush rule implicit in the description or a function to derive it automatically based on the LRC configuration? Maybe you have this already done in another way and I didn't see it ...

Good catch. 

What about this:

      "  [ \"_aAAA_aAA_\", \"set choose datacenter 2\","  
      "    \"_aXXX_aXX_\" ],"
      "  [ \"b_BBB_____\", \"set choose host 5\","
      "    \"baXXX_____\" ],"
      "  [ \"_____cCCC_\", \"\","
      "    \"baXXXcaXX_\" ],"
      "  [ \"_____DDDDd\", \"\","
      "    \"baXXXcaXXd\" ],"

Which translates into

take root
set choose datacenter 2
set choose host 5

In other words, the ruleset is created by concatenating the strings from the description, without any kind of smart computation. It is up to the person who creates the description to add the ruleset near a description that makes sense. There is going to be minimal checking to make sure the ruleset can actually be used to get the required number of chunks.

It probably is very difficult and very confusing to automate the generation of the ruleset. If it is implicit rather than explicit as above, the operator will have to somehow understand and learn how it is computed to make sure it does what is desired. With an explicit set of crush rules loosely coupled to chunk mapping, the operator can read the crush documentation instead of guessing.

> -  should the plug-in have the ability to select reconstruction on proximity or this should be up-to the higher layer to provide chunks in a way that reconstruction would select the 'closest' layer? The relevance of the question you will understand better in the next point ....
> 
> - I remember we had this 3 data centre example with (8,4) where you can reconstruct every object if 2 data centres are up. Another appealing example avoiding remote access when reading an object is that you have 2 data centres having a replication of e.g. (4,2) encoded objects. Can you describe in your LRC configuration language to store the same chunk twice like    __ABCCBA__ ?

Unless I'm mistaken that would require the caller of the plugin to support duplicate data chunks and provide a kind of proximity check. Since this is not currently supported by the OSD logic, it is difficult to figure out how an erasure code plugin could provide support for this use case.

Cheers

> 
> Cheers Andreas.
> 
> 
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 05 June 2014 16:05
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: Locally repairable code description revisited (was Pyramid ...)
> 
> Hi Andreas,
> 
> Here is a preliminary implementation https://github.com/ceph/ceph/pull/1921 . It does not have tests yet, nor does it run but but it compiles and I think it makes sense. It is indeed simpler than the previous implementation.
> 
> It would be great if you could quickly browse
> 
>    https://github.com/dachary/ceph/commit/5661c72864eca04c5dac9494a3f7e7a13515fca6
> 
> and let me know if you have any concerns.
> 
> Cheers
> 
> On 31/05/2014 19:10, Loic Dachary wrote:
>> Hi Andreas,
>>
>> After a few weeks and a fresh eye, I revisited the way pyramid erasure code could be described by the system administrator. Here is a proposal that is hopefully more intuitive than the one from the last CDS ( http://pad.ceph.com/p/cdsgiant-pyramid-erasure-code ).
>>
>> These are the steps to create all coding chunks. The upper case letters are data chunks and the lower case letters are coding chunks.
>>
>> "__ABC__DE_" data chunks placement
>>
>> Step 1
>> "__ABC__DE_"
>> "_yVWX_zYZ_" K=5, M=2
>> "_aABC_bDE_"
>>
>> Step 2
>> "_aABC_bDE_"
>> "z_XYZ_____" K=3, M=1
>> "caABC_bDE_"
>>
>> Step 3
>> "caABC_bDE_"
>> "_____zXYZ_" K=3, M=1
>> "caABCdbDE_"
>>
>> Step 4
>> "caABCdbDE_"
>> "_____WXYZz" K=4, M=1
>> "caABCdbDEe"
>>
>> The interpretation of Step 3 is as follows:
>>
>> Given the output of the previous step ( "caABC_bDE_" ), the bDE chunks are considered to be data chunks at this stage and they are marked with XYZ. A K=3, M=1 coding chunk is calculated and placed in the chunk marked with z ( "_____zXYZ_" ). The output of this coding step is the previous step plus the coding chunk that was just calculated, named d ( "caABCdbDE_" ).
>>
>> This gives the flexibility of deciding wether or not a coding chunk from a previous step is used as data to compute the coding chunk of the next step. It also allows for unbalanced steps such as step 4.
>>
>> For decoding, the steps are walked from the bottom up. If E is missing, it can be reconstructed from dbD.e in step 4 and the other steps are skipped because it was the only missing chunk. If AB are missing, all steps that have not be used to encode it are ignored, up to step 2 that will fail to recover them because M=1 and yeild to step 1 that will use a..CbDE successfully because M=2.
>>
>> Giving up the recursion and favor iteration seems to simplify how it can be explained. And I suspect the implementation is also simpler. What do you think ?
>>
>> Cheers
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Locally repairable code description revisited (was Pyramid ...)
  2014-06-06 14:30     ` Loic Dachary
@ 2014-06-09 20:18       ` Gregory Farnum
  2014-06-09 20:38         ` Samuel Just
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2014-06-09 20:18 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Andreas Joachim Peters, Ceph Development

On Fri, Jun 6, 2014 at 7:30 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Andreas,
>
> On 06/06/2014 13:46, Andreas Joachim Peters wrote:> Hi Loic,
>> the basic implementation looks very clean.
>>
>> I have few comments/ideas:
>>
>> - the reconstruction strategy using the three levels is certainly efficient enough for standard cases but does not guarantee always the minimum decoding (in cases where one layer is not enough to reconstruct) since your third algorithm is just brute-force to reconstruct everything through all layers until we have what we need ...
>
> The third strategy is indeed brute force. Do you think it is worth changing to be minimal ? It would be nice to quantify the percent of cases it addresses. Do you know how to do that ? It looks like a very small percentage but there is no proof it is small ;-)
>
>> - the whole LRC configuration actually does not describe the placement - it still looks disconnected from the placement strategy/crush rules ... wouldn't it make sense to have the crush rule implicit in the description or a function to derive it automatically based on the LRC configuration? Maybe you have this already done in another way and I didn't see it ...
>
> Good catch.
>
> What about this:
>
>       "  [ \"_aAAA_aAA_\", \"set choose datacenter 2\","
>       "    \"_aXXX_aXX_\" ],"
>       "  [ \"b_BBB_____\", \"set choose host 5\","
>       "    \"baXXX_____\" ],"
>       "  [ \"_____cCCC_\", \"\","
>       "    \"baXXXcaXX_\" ],"
>       "  [ \"_____DDDDd\", \"\","
>       "    \"baXXXcaXXd\" ],"
>
> Which translates into
>
> take root
> set choose datacenter 2
> set choose host 5
>
> In other words, the ruleset is created by concatenating the strings from the description, without any kind of smart computation. It is up to the person who creates the description to add the ruleset near a description that makes sense. There is going to be minimal checking to make sure the ruleset can actually be used to get the required number of chunks.
>
> It probably is very difficult and very confusing to automate the generation of the ruleset. If it is implicit rather than explicit as above, the operator will have to somehow understand and learn how it is computed to make sure it does what is desired. With an explicit set of crush rules loosely coupled to chunk mapping, the operator can read the crush documentation instead of guessing.

I think I'm missing some context for this discussion (maybe I haven't
been reading other threads closely enough); can you discuss this in
more detail?
Matching up CRUSH rulesets and the EC plugin formulas is very
important and demonstrated to be difficult, but I don't really
understand what you're suggesting here, which makes me think it's not
quite the right idea. ;)

>
>> -  should the plug-in have the ability to select reconstruction on proximity or this should be up-to the higher layer to provide chunks in a way that reconstruction would select the 'closest' layer? The relevance of the question you will understand better in the next point ....
>>
>> - I remember we had this 3 data centre example with (8,4) where you can reconstruct every object if 2 data centres are up. Another appealing example avoiding remote access when reading an object is that you have 2 data centres having a replication of e.g. (4,2) encoded objects. Can you describe in your LRC configuration language to store the same chunk twice like    __ABCCBA__ ?
>
> Unless I'm mistaken that would require the caller of the plugin to support duplicate data chunks and provide a kind of proximity check. Since this is not currently supported by the OSD logic, it is difficult to figure out how an erasure code plugin could provide support for this use case.

I haven't looked at the EC plugin interface at all, but I thought the
OSD told the plugin what chunks it could access, and the plugin tells
it which ones to fetch. So couldn't the plugin simply output duplicate
chunks, and not have the OSD retrieve both of them?
-Greg

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Locally repairable code description revisited (was Pyramid ...)
  2014-06-09 20:18       ` Gregory Farnum
@ 2014-06-09 20:38         ` Samuel Just
  2014-06-09 21:40           ` Loic Dachary
  0 siblings, 1 reply; 10+ messages in thread
From: Samuel Just @ 2014-06-09 20:38 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Loic Dachary, Andreas Joachim Peters, Ceph Development

I'm finding that I don't really understand how the LRC specification
works.  Is there a doc somewhere I can read?
-Sam

On Mon, Jun 9, 2014 at 1:18 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Fri, Jun 6, 2014 at 7:30 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Andreas,
>>
>> On 06/06/2014 13:46, Andreas Joachim Peters wrote:> Hi Loic,
>>> the basic implementation looks very clean.
>>>
>>> I have few comments/ideas:
>>>
>>> - the reconstruction strategy using the three levels is certainly efficient enough for standard cases but does not guarantee always the minimum decoding (in cases where one layer is not enough to reconstruct) since your third algorithm is just brute-force to reconstruct everything through all layers until we have what we need ...
>>
>> The third strategy is indeed brute force. Do you think it is worth changing to be minimal ? It would be nice to quantify the percent of cases it addresses. Do you know how to do that ? It looks like a very small percentage but there is no proof it is small ;-)
>>
>>> - the whole LRC configuration actually does not describe the placement - it still looks disconnected from the placement strategy/crush rules ... wouldn't it make sense to have the crush rule implicit in the description or a function to derive it automatically based on the LRC configuration? Maybe you have this already done in another way and I didn't see it ...
>>
>> Good catch.
>>
>> What about this:
>>
>>       "  [ \"_aAAA_aAA_\", \"set choose datacenter 2\","
>>       "    \"_aXXX_aXX_\" ],"
>>       "  [ \"b_BBB_____\", \"set choose host 5\","
>>       "    \"baXXX_____\" ],"
>>       "  [ \"_____cCCC_\", \"\","
>>       "    \"baXXXcaXX_\" ],"
>>       "  [ \"_____DDDDd\", \"\","
>>       "    \"baXXXcaXXd\" ],"
>>
>> Which translates into
>>
>> take root
>> set choose datacenter 2
>> set choose host 5
>>
>> In other words, the ruleset is created by concatenating the strings from the description, without any kind of smart computation. It is up to the person who creates the description to add the ruleset near a description that makes sense. There is going to be minimal checking to make sure the ruleset can actually be used to get the required number of chunks.
>>
>> It probably is very difficult and very confusing to automate the generation of the ruleset. If it is implicit rather than explicit as above, the operator will have to somehow understand and learn how it is computed to make sure it does what is desired. With an explicit set of crush rules loosely coupled to chunk mapping, the operator can read the crush documentation instead of guessing.
>
> I think I'm missing some context for this discussion (maybe I haven't
> been reading other threads closely enough); can you discuss this in
> more detail?
> Matching up CRUSH rulesets and the EC plugin formulas is very
> important and demonstrated to be difficult, but I don't really
> understand what you're suggesting here, which makes me think it's not
> quite the right idea. ;)
>
>>
>>> -  should the plug-in have the ability to select reconstruction on proximity or this should be up-to the higher layer to provide chunks in a way that reconstruction would select the 'closest' layer? The relevance of the question you will understand better in the next point ....
>>>
>>> - I remember we had this 3 data centre example with (8,4) where you can reconstruct every object if 2 data centres are up. Another appealing example avoiding remote access when reading an object is that you have 2 data centres having a replication of e.g. (4,2) encoded objects. Can you describe in your LRC configuration language to store the same chunk twice like    __ABCCBA__ ?
>>
>> Unless I'm mistaken that would require the caller of the plugin to support duplicate data chunks and provide a kind of proximity check. Since this is not currently supported by the OSD logic, it is difficult to figure out how an erasure code plugin could provide support for this use case.
>
> I haven't looked at the EC plugin interface at all, but I thought the
> OSD told the plugin what chunks it could access, and the plugin tells
> it which ones to fetch. So couldn't the plugin simply output duplicate
> chunks, and not have the OSD retrieve both of them?
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Locally repairable code description revisited (was Pyramid ...)
  2014-06-09 20:38         ` Samuel Just
@ 2014-06-09 21:40           ` Loic Dachary
  0 siblings, 0 replies; 10+ messages in thread
From: Loic Dachary @ 2014-06-09 21:40 UTC (permalink / raw)
  To: Samuel Just, Gregory Farnum; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 5250 bytes --]

Sam, Greg,

A simpler proposal is documented at :

    https://github.com/dachary/ceph/commit/ff11902bdc26aa35c70dd2f4d9de31f4cd207519#diff-5518964bc98a094a784ce2d17a5b0cc1R1

which is part of the proposed implementation for locally repairable code

    https://github.com/ceph/ceph/pull/1921

Hopefully it makes sense ;-)

Cheers

On 09/06/2014 22:38, Samuel Just wrote:
> I'm finding that I don't really understand how the LRC specification
> works.  Is there a doc somewhere I can read?
> -Sam
> 
> On Mon, Jun 9, 2014 at 1:18 PM, Gregory Farnum <greg@inktank.com> wrote:
>> On Fri, Jun 6, 2014 at 7:30 AM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi Andreas,
>>>
>>> On 06/06/2014 13:46, Andreas Joachim Peters wrote:> Hi Loic,
>>>> the basic implementation looks very clean.
>>>>
>>>> I have few comments/ideas:
>>>>
>>>> - the reconstruction strategy using the three levels is certainly efficient enough for standard cases but does not guarantee always the minimum decoding (in cases where one layer is not enough to reconstruct) since your third algorithm is just brute-force to reconstruct everything through all layers until we have what we need ...
>>>
>>> The third strategy is indeed brute force. Do you think it is worth changing to be minimal ? It would be nice to quantify the percent of cases it addresses. Do you know how to do that ? It looks like a very small percentage but there is no proof it is small ;-)
>>>
>>>> - the whole LRC configuration actually does not describe the placement - it still looks disconnected from the placement strategy/crush rules ... wouldn't it make sense to have the crush rule implicit in the description or a function to derive it automatically based on the LRC configuration? Maybe you have this already done in another way and I didn't see it ...
>>>
>>> Good catch.
>>>
>>> What about this:
>>>
>>>       "  [ \"_aAAA_aAA_\", \"set choose datacenter 2\","
>>>       "    \"_aXXX_aXX_\" ],"
>>>       "  [ \"b_BBB_____\", \"set choose host 5\","
>>>       "    \"baXXX_____\" ],"
>>>       "  [ \"_____cCCC_\", \"\","
>>>       "    \"baXXXcaXX_\" ],"
>>>       "  [ \"_____DDDDd\", \"\","
>>>       "    \"baXXXcaXXd\" ],"
>>>
>>> Which translates into
>>>
>>> take root
>>> set choose datacenter 2
>>> set choose host 5
>>>
>>> In other words, the ruleset is created by concatenating the strings from the description, without any kind of smart computation. It is up to the person who creates the description to add the ruleset near a description that makes sense. There is going to be minimal checking to make sure the ruleset can actually be used to get the required number of chunks.
>>>
>>> It probably is very difficult and very confusing to automate the generation of the ruleset. If it is implicit rather than explicit as above, the operator will have to somehow understand and learn how it is computed to make sure it does what is desired. With an explicit set of crush rules loosely coupled to chunk mapping, the operator can read the crush documentation instead of guessing.
>>
>> I think I'm missing some context for this discussion (maybe I haven't
>> been reading other threads closely enough); can you discuss this in
>> more detail?
>> Matching up CRUSH rulesets and the EC plugin formulas is very
>> important and demonstrated to be difficult, but I don't really
>> understand what you're suggesting here, which makes me think it's not
>> quite the right idea. ;)
>>
>>>
>>>> -  should the plug-in have the ability to select reconstruction on proximity or this should be up-to the higher layer to provide chunks in a way that reconstruction would select the 'closest' layer? The relevance of the question you will understand better in the next point ....
>>>>
>>>> - I remember we had this 3 data centre example with (8,4) where you can reconstruct every object if 2 data centres are up. Another appealing example avoiding remote access when reading an object is that you have 2 data centres having a replication of e.g. (4,2) encoded objects. Can you describe in your LRC configuration language to store the same chunk twice like    __ABCCBA__ ?
>>>
>>> Unless I'm mistaken that would require the caller of the plugin to support duplicate data chunks and provide a kind of proximity check. Since this is not currently supported by the OSD logic, it is difficult to figure out how an erasure code plugin could provide support for this use case.
>>
>> I haven't looked at the EC plugin interface at all, but I thought the
>> OSD told the plugin what chunks it could access, and the plugin tells
>> it which ones to fetch. So couldn't the plugin simply output duplicate
>> chunks, and not have the OSD retrieve both of them?
>> -Greg
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-06-09 21:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-31 17:10 Pyramid erasure code description revisited Loic Dachary
2014-06-02 12:20 ` Andreas Joachim Peters
2014-06-02 13:14   ` Loic Dachary
     [not found]     ` <1401733713.18379.YahooMailNeo@web165006.mail.bf1.yahoo.com>
2014-06-02 18:49       ` Loic Dachary
2014-06-05 14:05 ` Locally repairable code description revisited (was Pyramid ...) Loic Dachary
2014-06-06 11:46   ` Andreas Joachim Peters
2014-06-06 14:30     ` Loic Dachary
2014-06-09 20:18       ` Gregory Farnum
2014-06-09 20:38         ` Samuel Just
2014-06-09 21:40           ` Loic Dachary

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).