Comments on Ceph distributed parity implementation

All of lore.kernel.org
 help / color / mirror / Atom feed

* Comments on Ceph distributed parity implementation
@ 2013-06-14 20:13 Martin Flyvbjerg
  2013-06-14 20:29 ` Mark Nelson
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Martin Flyvbjerg @ 2013-06-14 20:13 UTC (permalink / raw)
  To: ceph-devel

Dear Community
I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.

I stumped upon the suggestion for adding erasure codes to Ceph, as
described in this article
http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

first I would like to say great initiative to add erasure codes to Ceph.
Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.

Why?
Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.

The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).

The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
Please see
https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions

The solution
Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.

The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic. 

Best Regards
Martin Flyvbjerg

[1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
[2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
[3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
[4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-14 20:13 Comments on Ceph distributed parity implementation Martin Flyvbjerg
@ 2013-06-14 20:29 ` Mark Nelson
  2013-06-14 21:05 ` Joe Buck
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2013-06-14 20:29 UTC (permalink / raw)
  To: Martin Flyvbjerg; +Cc: ceph-devel

On 06/14/2013 03:13 PM, Martin Flyvbjerg wrote:
> Dear Community
> I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.
>
> I stumped upon the suggestion for adding erasure codes to Ceph, as
> described in this article
> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
>
> first I would like to say great initiative to add erasure codes to Ceph.
> Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.
>
> Why?
> Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.
>
> The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).
>
> The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
> Please see
> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
>
>
> The solution
> Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
> James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
>
> The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic.

Thanks for all the info and the links Martin!  It would be useful to sit 
down and figure out how much overhead we'd expect to see with each 
implementation.  I'm definitely in favour of looking at 
Cauchy-Reed-Solomon (and any specific SSE/SIMD implementations) assuming 
there's not a lot of additional complexity or other downsides.  I won't 
be doing the development work though, so it's easy for me to say that. ;)

Mark

>
> Best Regards
> Martin Flyvbjerg
>
> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-14 20:13 Comments on Ceph distributed parity implementation Martin Flyvbjerg
  2013-06-14 20:29 ` Mark Nelson
@ 2013-06-14 21:05 ` Joe Buck
  2013-06-14 22:57 ` Loic Dachary
  2013-06-15  7:30 ` Loic Dachary
  3 siblings, 0 replies; 23+ messages in thread
From: Joe Buck @ 2013-06-14 21:05 UTC (permalink / raw)
  To: Martin Flyvbjerg; +Cc: ceph-devel

A recent paper out of Microsoft had an interesting twist on erasure 
coding (separately encoding each half and then the entire thing once, 
rather than the entire unit multiple times).
http://research.microsoft.com/en-us/um/people/yekhanin/papers/usenixatc_2012.pdf

it doesn't get at the math itself, but is something worth considering.

Best,
-Joe Buck

On 06/14/2013 01:13 PM, Martin Flyvbjerg wrote:
> Dear Community
> I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.
>
> I stumped upon the suggestion for adding erasure codes to Ceph, as
> described in this article
> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
>
> first I would like to say great initiative to add erasure codes to Ceph.
> Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.
>
> Why?
> Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.
>
> The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).
>
> The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
> Please see
> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
>
>
> The solution
> Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
> James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
>
> The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic.
>
> Best Regards
> Martin Flyvbjerg
>
> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-14 20:13 Comments on Ceph distributed parity implementation Martin Flyvbjerg
  2013-06-14 20:29 ` Mark Nelson
  2013-06-14 21:05 ` Joe Buck
@ 2013-06-14 22:57 ` Loic Dachary
  2013-06-15  1:12   ` Paul Von-Stamwitz
  2013-06-15  7:30 ` Loic Dachary
  3 siblings, 1 reply; 23+ messages in thread
From: Loic Dachary @ 2013-06-14 22:57 UTC (permalink / raw)
  To: Martin Flyvbjerg; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5505 bytes --]

Hi Martin,

Your explanations are very helpful to better understand the tradeoffs of the existing implementations. To be honest I was looking forward to your intervention. Not you specifically, of course :-) But someone with a good theoretical background to be a judge of what's best in the context of Ceph. If you say it's the upcoming library to be released in August 2013, I'll take your word for it. 

The work currently being done within Ceph is to architecture to storage backend ( namely placement groups ) to make room for distributed parity. My initial idea was to isolate the low level library under an API that takes a region ( 16KB for instance, as in gf_unit.c found in http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar ) as input and outputs chunks that can then be written on different hosts. For instance

	encode(char* region, char** chuncks) => encode the region into N chuncks
	decode(char** chunks, char* region) => decode the N chuncks into a region
        repair(char** chunks, int damaged) => repair the damaged chunck

Do you think it is a sensible approach ? And if you do, will I find examples of such higher level functions in http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar ? Or elsewhere ?

I'm a little confused about the relation between GF complete ( as found at http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( as found at http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-1.2.tar )  which is comparatively older ( 2008 ). Do you know how Jerasure 2.0 relates to GF complete ? 

For completeness, here is a thread with pointers to Mojette Transform that's being used as part of Rozofs.

http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14666.html

I'm not able to compare it with the other libraries because it seems to take a completely different approach. Do you have an opinion about it ?

As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next month but I'd love to understand more about this as soon as possible :-)

Cheers

P.S. Updated http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend#Erasure_Encoded_Storage with a link to http://web.eecs.utk.edu/~plank/plank/www/software.html for the record

On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> Dear Community
> I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.
> 
> I stumped upon the suggestion for adding erasure codes to Ceph, as
> described in this article
> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
> 
> first I would like to say great initiative to add erasure codes to Ceph.
> Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.
> 
> Why?
> Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.
> 
> The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).
> 
> The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
> Please see
> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
> 
> 
> The solution
> Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
> James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> 
> The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic. 
> 
> Best Regards
> Martin Flyvbjerg
> 
> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Comments on Ceph distributed parity implementation
  2013-06-14 22:57 ` Loic Dachary
@ 2013-06-15  1:12   ` Paul Von-Stamwitz
  2013-06-15  6:51     ` Loic Dachary
                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Paul Von-Stamwitz @ 2013-06-15  1:12 UTC (permalink / raw)
  To: Loic Dachary, Martin Flyvbjerg; +Cc: ceph-devel@vger.kernel.org

Hi Loic and Martin,

This is a great discussion and I agree that the performance ramifications of erasure coding need to be thought out very carefully. Since chunks are not only encoded, but distributed across the cluster, we need to pay attention to the network overhead as well as the arithmetic involved in the encoding/decoding.

If I understand the proposal correctly, objects begin their life's journey replicated as normal. As it grows cold, it gets transformed to an encode PG in another pool. Subsequent reads will be redirected (ehh). Subsequent writes will first decode the original object and re-replicate it (ouch!). Client writes never are encoded on the fly; they are always replicated (nice).

So...
encode() is run as a low-priority background process probably once a week during deep scrubs.
decode() should be rare (if not, the object shouldn't have been encoded in the first place.) If the cluster is healthy no arithmetic is needed, just concatenation, but a lot of network activity.
repair() operations will be the most prevalent and may occur at any time during normal self-healing/rebalancing operations.

Therefore, in my opinion, the algorithm that we choose must be optimized for repairing damaged and missing chunks. The main problem I have with Reed-Solomon is that it uses MDS codes which maximize network activity for recalculations. Pyramid codes have the same write (encode) overhead, but have better read (repair) overhead.

Loic, I know nothing about Mojette Transforms. From what little I gleaned, it might be good for repair (needing only a subset of chunks within a range to recalculate a missing chunk) but I'm worried about the storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.

All the best,
Paul

On 06/14/2013 3:57 PM, Loic Dachary wrote:
> Hi Martin,
> 
> Your explanations are very helpful to better understand the tradeoffs of
> the existing implementations. To be honest I was looking forward to your
> intervention. Not you specifically, of course :-) But someone with a good
> theoretical background to be a judge of what's best in the context of Ceph.
> If you say it's the upcoming library to be released in August 2013, I'll
> take your word for it.
> 
> The work currently being done within Ceph is to architecture to storage
> backend ( namely placement groups ) to make room for distributed parity.
> My initial idea was to isolate the low level library under an API that
> takes a region ( 16KB for instance, as in gf_unit.c found in
> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be
> written on different hosts. For instance
> 
> 	encode(char* region, char** chuncks) => encode the region into N
> chuncks
> 	decode(char** chunks, char* region) => decode the N chuncks into a
> region
>         repair(char** chunks, int damaged) => repair the damaged chunck
> 
> Do you think it is a sensible approach ? And if you do, will I find
> examples of such higher level functions in
> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> 703/gf_complete_0.1.tar ? Or elsewhere ?
> 
> I'm a little confused about the relation between GF complete ( as found at
> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( as
> found at http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-
> 1.2.tar )  which is comparatively older ( 2008 ). Do you know how Jerasure
> 2.0 relates to GF complete ?
> 
> For completeness, here is a thread with pointers to Mojette Transform
> that's being used as part of Rozofs.
> 
> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14666.html
> 
> I'm not able to compare it with the other libraries because it seems to
> take a completely different approach. Do you have an opinion about it ?
> 
> As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next month
> but I'd love to understand more about this as soon as possible :-)
> 
> Cheers
> 
> P.S. Updated
> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_
> a_storage_backend#Erasure_Encoded_Storage with a link to
> http://web.eecs.utk.edu/~plank/plank/www/software.html for the record
> 
> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> > Dear Community
> > I am a young engineer (not software or math, please bare with me) with
> some suggestions regarding erasure codes. I never used Ceph before or any
> other distributed file system.
> >
> > I stumped upon the suggestion for adding erasure codes to Ceph, as
> > described in this article
> >
> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_
> a_storage_backend
> >
> > first I would like to say great initiative to add erasure codes to Ceph.
> > Ceph needs its own implementation and it have to be done right, I cannot
> stress this enough, suggested software mentioned in that article would
> result in very low performance.
> >
> > Why?
> > Reed-Solomon is normally something regarded as being very slow compared
> to other erasure codes, because the underlying Galois-Field multiplication
> is slow. Please see video at usenix.org forexplanation.
> >
> > The implementations of Zfec library and other suggested software the
> others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon
> erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon
> implementation. Please see [1,2,3]
> > Although there is something even better, by using the Intel SSE2/3 SIMD
> instructions it is possible to do the as fast as any other XOR based
> erasure codes (RaptorQ LT-codes, LDPC etc.).
> >
> > The suggested FECpp lib uses the optimisation but with a relative small
> Galois-field only 2^8, since Ceph aimes at unlimited scalability
> increasing the size of the Galois-Field would improve performance [4]. Of
> course the configured Ceph Object Size and/or Stripe width have to be
> taken into account.
> > Please see
> > https://www.usenix.org/conference/fast13/screaming-fast-galois-field-
> arithmetic-using-sse2-extensions
> >
> >
> > The solution
> > Using the GF-Complete open source library [4] to implement Reed-Solomon
> in Ceph in order to allow Ceph to scale to infinity.
> > James S. Plank the author of GF-complete have developed a library
> implementing various Reed-Solomon codes called Jerasure.
> http://web.eecs.utk.edu/~plank/plank/www/software.html
> > Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD
> instructions, is current in development expected release august 2013. Will
> be released under the new BSD license. Jerasure 2.0 also supports
> arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> >
> > The limit of this implementation would be the processors L2/L3 cache not
> the underlying arithmetic.
> >
> > Best Regards
> > Martin Flyvbjerg
> >
> > [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> > [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> > [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> > [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do
> nothing.
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-15  1:12   ` Paul Von-Stamwitz
@ 2013-06-15  6:51     ` Loic Dachary
  2013-06-16 19:51     ` Benoît Parrein
       [not found]     ` <C395B77B849187439280E1CF5FE1F2FA8990491B@G9W0337.americas.hpqcorp.net>
  2 siblings, 0 replies; 23+ messages in thread
From: Loic Dachary @ 2013-06-15  6:51 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: Martin Flyvbjerg, ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 8106 bytes --]



On 06/15/2013 03:12 AM, Paul Von-Stamwitz wrote:
> Hi Loic and Martin,
> 
> This is a great discussion and I agree that the performance ramifications of erasure coding need to be thought out very carefully. Since chunks are not only encoded, but distributed across the cluster, we need to pay attention to the network overhead as well as the arithmetic involved in the encoding/decoding.
> 
> If I understand the proposal correctly, objects begin their life's journey replicated as normal. As it grows cold, it gets transformed to an encode PG in another pool. Subsequent reads will be redirected (ehh). Subsequent writes will first decode the original object and re-replicate it (ouch!). Client writes never are encoded on the fly; they are always replicated (nice).

Hi Paul,

The first step is to have an encoded PG : it could be useful on its own. 

> So...
> encode() is run as a low-priority background process probably once a week during deep scrubs.
> decode() should be rare (if not, the object shouldn't have been encoded in the first place.) If the cluster is healthy no arithmetic is needed, just concatenation, but a lot of network activity.
> repair() operations will be the most prevalent and may occur at any time during normal self-healing/rebalancing operations.
> 
> Therefore, in my opinion, the algorithm that we choose must be optimized for repairing damaged and missing chunks. The main problem I have with Reed-Solomon is that it uses MDS codes which maximize network activity for recalculations. Pyramid codes have the same write (encode) overhead, but have better read (repair) overhead.

Which library do you typically use ? I would love to hear about the use cases you worked on.

Cheers

> Loic, I know nothing about Mojette Transforms. From what little I gleaned, it might be good for repair (needing only a subset of chunks within a range to recalculate a missing chunk) but I'm worried about the storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.
> 
> All the best,
> Paul
> 
> On 06/14/2013 3:57 PM, Loic Dachary wrote:
>> Hi Martin,
>>
>> Your explanations are very helpful to better understand the tradeoffs of
>> the existing implementations. To be honest I was looking forward to your
>> intervention. Not you specifically, of course :-) But someone with a good
>> theoretical background to be a judge of what's best in the context of Ceph.
>> If you say it's the upcoming library to be released in August 2013, I'll
>> take your word for it.
>>
>> The work currently being done within Ceph is to architecture to storage
>> backend ( namely placement groups ) to make room for distributed parity.
>> My initial idea was to isolate the low level library under an API that
>> takes a region ( 16KB for instance, as in gf_unit.c found in
>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>> 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be
>> written on different hosts. For instance
>>
>> 	encode(char* region, char** chuncks) => encode the region into N
>> chuncks
>> 	decode(char** chunks, char* region) => decode the N chuncks into a
>> region
>>         repair(char** chunks, int damaged) => repair the damaged chunck
>>
>> Do you think it is a sensible approach ? And if you do, will I find
>> examples of such higher level functions in
>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>> 703/gf_complete_0.1.tar ? Or elsewhere ?
>>
>> I'm a little confused about the relation between GF complete ( as found at
>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>> 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( as
>> found at http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-
>> 1.2.tar )  which is comparatively older ( 2008 ). Do you know how Jerasure
>> 2.0 relates to GF complete ?
>>
>> For completeness, here is a thread with pointers to Mojette Transform
>> that's being used as part of Rozofs.
>>
>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14666.html
>>
>> I'm not able to compare it with the other libraries because it seems to
>> take a completely different approach. Do you have an opinion about it ?
>>
>> As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next month
>> but I'd love to understand more about this as soon as possible :-)
>>
>> Cheers
>>
>> P.S. Updated
>> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_
>> a_storage_backend#Erasure_Encoded_Storage with a link to
>> http://web.eecs.utk.edu/~plank/plank/www/software.html for the record
>>
>> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
>>> Dear Community
>>> I am a young engineer (not software or math, please bare with me) with
>> some suggestions regarding erasure codes. I never used Ceph before or any
>> other distributed file system.
>>>
>>> I stumped upon the suggestion for adding erasure codes to Ceph, as
>>> described in this article
>>>
>> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_
>> a_storage_backend
>>>
>>> first I would like to say great initiative to add erasure codes to Ceph.
>>> Ceph needs its own implementation and it have to be done right, I cannot
>> stress this enough, suggested software mentioned in that article would
>> result in very low performance.
>>>
>>> Why?
>>> Reed-Solomon is normally something regarded as being very slow compared
>> to other erasure codes, because the underlying Galois-Field multiplication
>> is slow. Please see video at usenix.org forexplanation.
>>>
>>> The implementations of Zfec library and other suggested software the
>> others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon
>> erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon
>> implementation. Please see [1,2,3]
>>> Although there is something even better, by using the Intel SSE2/3 SIMD
>> instructions it is possible to do the as fast as any other XOR based
>> erasure codes (RaptorQ LT-codes, LDPC etc.).
>>>
>>> The suggested FECpp lib uses the optimisation but with a relative small
>> Galois-field only 2^8, since Ceph aimes at unlimited scalability
>> increasing the size of the Galois-Field would improve performance [4]. Of
>> course the configured Ceph Object Size and/or Stripe width have to be
>> taken into account.
>>> Please see
>>> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-
>> arithmetic-using-sse2-extensions
>>>
>>>
>>> The solution
>>> Using the GF-Complete open source library [4] to implement Reed-Solomon
>> in Ceph in order to allow Ceph to scale to infinity.
>>> James S. Plank the author of GF-complete have developed a library
>> implementing various Reed-Solomon codes called Jerasure.
>> http://web.eecs.utk.edu/~plank/plank/www/software.html
>>> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD
>> instructions, is current in development expected release august 2013. Will
>> be released under the new BSD license. Jerasure 2.0 also supports
>> arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
>>>
>>> The limit of this implementation would be the processors L2/L3 cache not
>> the underlying arithmetic.
>>>
>>> Best Regards
>>> Martin Flyvbjerg
>>>
>>> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
>>> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
>>> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
>>> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do
>> nothing.
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-14 20:13 Comments on Ceph distributed parity implementation Martin Flyvbjerg
                   ` (2 preceding siblings ...)
  2013-06-14 22:57 ` Loic Dachary
@ 2013-06-15  7:30 ` Loic Dachary
  2013-06-15  9:40   ` Leen Besselink
  3 siblings, 1 reply; 23+ messages in thread
From: Loic Dachary @ 2013-06-15  7:30 UTC (permalink / raw)
  To: Martin Flyvbjerg; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3441 bytes --]

Hi Martin,

Just listened to

https://2459d6dc103cb5933875-c0245c5c937c5dedcca3f1764ecc9b2f.ssl.cf2.rackcdn.com/fast13/plank_screaming.mp4
from
https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions

and it's very entertaining ;-)

Cheers


On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> Dear Community
> I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.
> 
> I stumped upon the suggestion for adding erasure codes to Ceph, as
> described in this article
> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
> 
> first I would like to say great initiative to add erasure codes to Ceph.
> Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.
> 
> Why?
> Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.
> 
> The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).
> 
> The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
> Please see
> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
> 
> 
> The solution
> Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
> James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> 
> The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic. 
> 
> Best Regards
> Martin Flyvbjerg
> 
> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-15  7:30 ` Loic Dachary
@ 2013-06-15  9:40   ` Leen Besselink
  0 siblings, 0 replies; 23+ messages in thread
From: Leen Besselink @ 2013-06-15  9:40 UTC (permalink / raw)
  To: ceph-devel

On Sat, Jun 15, 2013 at 09:30:27AM +0200, Loic Dachary wrote:
> Hi Martin,
> 
> Just listened to
> 
> https://2459d6dc103cb5933875-c0245c5c937c5dedcca3f1764ecc9b2f.ssl.cf2.rackcdn.com/fast13/plank_screaming.mp4
> from
> https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
> 
> and it's very entertaining ;-)
> 

If video is what you like the Microsoft presentation of the previously mentioned PDF is here:

https://www.usenix.org/conference/usenixfederatedconferencesweek/high-performance-vehicular-connectivity-opportunistic

They also talk about why they made their choices. I guess that it's also in the PDF.

> Cheers
> 
> 
> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> > Dear Community
> > I am a young engineer (not software or math, please bare with me) with some suggestions regarding erasure codes. I never used Ceph before or any other distributed file system.
> > 
> > I stumped upon the suggestion for adding erasure codes to Ceph, as
> > described in this article
> > http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
> > 
> > first I would like to say great initiative to add erasure codes to Ceph.
> > Ceph needs its own implementation and it have to be done right, I cannot stress this enough, suggested software mentioned in that article would result in very low performance.
> > 
> > Why?
> > Reed-Solomon is normally something regarded as being very slow compared to other erasure codes, because the underlying Galois-Field multiplication is slow. Please see video at usenix.org forexplanation.
> > 
> > The implementations of Zfec library and other suggested software the others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> > Although there is something even better, by using the Intel SSE2/3 SIMD instructions it is possible to do the as fast as any other XOR based erasure codes (RaptorQ LT-codes, LDPC etc.).
> > 
> > The suggested FECpp lib uses the optimisation but with a relative small Galois-field only 2^8, since Ceph aimes at unlimited scalability increasing the size of the Galois-Field would improve performance [4]. Of course the configured Ceph Object Size and/or Stripe width have to be taken into account.
> > Please see
> > https://www.usenix.org/conference/fast13/screaming-fast-galois-field-arithmetic-using-sse2-extensions
> > 
> > 
> > The solution
> > Using the GF-Complete open source library [4] to implement Reed-Solomon in Ceph in order to allow Ceph to scale to infinity.
> > James S. Plank the author of GF-complete have developed a library implementing various Reed-Solomon codes called Jerasure. http://web.eecs.utk.edu/~plank/plank/www/software.html
> > Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD instructions, is current in development expected release august 2013. Will be released under the new BSD license. Jerasure 2.0 also supports arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> > 
> > The limit of this implementation would be the processors L2/L3 cache not the underlying arithmetic. 
> > 
> > Best Regards
> > Martin Flyvbjerg
> > 
> > [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> > [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> > [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> > [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-15  1:12   ` Paul Von-Stamwitz
  2013-06-15  6:51     ` Loic Dachary
@ 2013-06-16 19:51     ` Benoît Parrein
  2013-06-16 21:31       ` Loic Dachary
       [not found]     ` <C395B77B849187439280E1CF5FE1F2FA8990491B@G9W0337.americas.hpqcorp.net>
  2 siblings, 1 reply; 23+ messages in thread
From: Benoît Parrein @ 2013-06-16 19:51 UTC (permalink / raw)
  To: ceph-devel

Paul Von-Stamwitz <PVonStamwitz <at> us.fujitsu.com> writes:

Hi Paul, 

> 
> Loic, I know nothing about Mojette Transforms. From what little I gleaned, 
it might be good for repair
> (needing only a subset of chunks within a range to recalculate a missing 
chunk) but I'm worried about the
> storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.
> 
> All the best,
> Paul
> 

If you want to do better than that you will probably lose in availability. 
1.5x give the same availability than 3 replicats and that for any kind of 
erasure coding.
FYI, Mojette transform has no constraint in terms of Galois fields. It is the 
big advantage to use discrete geometry rather than algebra.

best regards,
bp




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-16 19:51     ` Benoît Parrein
@ 2013-06-16 21:31       ` Loic Dachary
  2013-06-17 16:48         ` Benoît Parrein
  2013-06-17 16:55         ` Paul Von-Stamwitz
  0 siblings, 2 replies; 23+ messages in thread
From: Loic Dachary @ 2013-06-16 21:31 UTC (permalink / raw)
  To: Benoît Parrein; +Cc: ceph-devel, James S. Plank

[-- Attachment #1: Type: text/plain, Size: 1752 bytes --]

Hi Benoît,

From the ( naïve ) point of view of engineering, performances are important. The recent works of James Plank ( cc'ed ) greatly improved them
 and I'm looking forward to the next version of jerasure ( http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html ). Rozofs Mojette Transform implementation ( https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.h & https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.cc ) does not seem to make use of SIMD. Is it because the performances are good enough to not require them ?

Cheers

On 06/16/2013 09:51 PM, Benoît Parrein wrote:
> Paul Von-Stamwitz <PVonStamwitz <at> us.fujitsu.com> writes:
> 
> Hi Paul, 
> 
>>
>> Loic, I know nothing about Mojette Transforms. From what little I gleaned, 
> it might be good for repair
>> (needing only a subset of chunks within a range to recalculate a missing 
> chunk) but I'm worried about the
>> storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.
>>
>> All the best,
>> Paul
>>
> 
> If you want to do better than that you will probably lose in availability. 
> 1.5x give the same availability than 3 replicats and that for any kind of 
> erasure coding.
> FYI, Mojette transform has no constraint in terms of Galois fields. It is the 
> big advantage to use discrete geometry rather than algebra.
> 
> best regards,
> bp
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-16 21:31       ` Loic Dachary
@ 2013-06-17 16:48         ` Benoît Parrein
  2013-06-17 16:55         ` Paul Von-Stamwitz
  1 sibling, 0 replies; 23+ messages in thread
From: Benoît Parrein @ 2013-06-17 16:48 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel, James S. Plank

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

Hi Loïc,

totaly agree about performances. honestly, we don't experiment yet any 
trouble in our erasure coding implementation (which is just software 
optimized for the moment). the bottleneck is more located at the network 
(basically at 1Gbit/s) and at the disks i/o levels. and, I guess it is 
the same conclusion for well optimized libraries like Jerasure.
to continue about Mojette transform, this erasure code is in fact 
(1+epsilon)MDS with a linear complexity both in coding and decoding (we 
just need addition and subtraction in any field). furthermore, it is 
well implemented in RozoFS. th epsilon (the overhead) is very negligible 
for normal use cases.

best,
bp


Le 16/06/2013 23:31, Loic Dachary a écrit :
> Hi Benoît,
>
>  From the ( naïve ) point of view of engineering, performances are important. The recent works of James Plank ( cc'ed ) greatly improved them
>   and I'm looking forward to the next version of jerasure ( http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html ). Rozofs Mojette Transform implementation ( https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.h & https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.cc ) does not seem to make use of SIMD. Is it because the performances are good enough to not require them ?
>
> Cheers
>
> On 06/16/2013 09:51 PM, Benoît Parrein wrote:
>> Paul Von-Stamwitz <PVonStamwitz <at> us.fujitsu.com> writes:
>>
>> Hi Paul,
>>
>>> Loic, I know nothing about Mojette Transforms. From what little I gleaned,
>> it might be good for repair
>>> (needing only a subset of chunks within a range to recalculate a missing
>> chunk) but I'm worried about the
>>> storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.
>>>
>>> All the best,
>>> Paul
>>>
>> If you want to do better than that you will probably lose in availability.
>> 1.5x give the same availability than 3 replicats and that for any kind of
>> erasure coding.
>> FYI, Mojette transform has no constraint in terms of Galois fields. It is the
>> big advantage to use discrete geometry rather than algebra.
>>
>> best regards,
>> bp
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: benoit_parrein.vcf --]
[-- Type: text/x-vcard, Size: 402 bytes --]

begin:vcard
fn;quoted-printable:Beno=C3=AEt Parrein
n;quoted-printable:Parrein;Beno=C3=AEt
org;quoted-printable:Polytech Nantes, Universit=C3=A9 de Nantes, IRCCyN lab
adr;quoted-printable:;;=C3=A9quipe Image Vid=C3=A9o Communication;;;;France
email;internet:benoit.parrein@polytech.univ-nantes.fr
title;quoted-printable:Ma=C3=AEtre de Conf=C3=A9rences
tel;work:+33 2 40 68 30 50
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Comments on Ceph distributed parity implementation
  2013-06-16 21:31       ` Loic Dachary
  2013-06-17 16:48         ` Benoît Parrein
@ 2013-06-17 16:55         ` Paul Von-Stamwitz
  2013-06-18  7:44           ` Benoît Parrein
  1 sibling, 1 reply; 23+ messages in thread
From: Paul Von-Stamwitz @ 2013-06-17 16:55 UTC (permalink / raw)
  To: Loic Dachary, Benoît Parrein
  Cc: ceph-devel@vger.kernel.org, James S. Plank

Loic,

As Benoit points out, Mojette uses discrete geometry rather than algebra, so simple XOR is all that is needed.

Benoit,

Microsoft's paper states that their [12,2,2] LRC provides better availability than 3x replication with 1.33x efficiency. 1.5x is certainly a good number. I'm just pointing out that better efficiency can be had without losing availibity.

All the best,
Paul

On 6/16/2013 02:31 PM Loic Dachary wrote:
> Hi Benoît,
> 
> From the ( naïve ) point of view of engineering, performances are
> important. The recent works of James Plank ( cc'ed ) greatly improved them
>  and I'm looking forward to the next version of jerasure
> ( http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html ). Rozofs
> Mojette Transform implementation
> ( https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.h &
> https://github.com/rozofs/rozofs/blob/master/rozofs/common/transform.cc )
> does not seem to make use of SIMD. Is it because the performances are good
> enough to not require them ?
> 
> Cheers
> 
> On 06/16/2013 09:51 PM, Benoît Parrein wrote:
> > Paul Von-Stamwitz <PVonStamwitz <at> us.fujitsu.com> writes:
> >
> > Hi Paul,
> >
> >>
> >> Loic, I know nothing about Mojette Transforms. From what little I
> gleaned,
> > it might be good for repair
> >> (needing only a subset of chunks within a range to recalculate a
> missing
> > chunk) but I'm worried about the
> >> storage efficiency. RozoFS claims 1.5x. I'd like to do better than that.
> >>
> >> All the best,
> >> Paul
> >>
> >
> > If you want to do better than that you will probably lose in
> availability.
> > 1.5x give the same availability than 3 replicats and that for any kind
> of
> > erasure coding.
> > FYI, Mojette transform has no constraint in terms of Galois fields. It
> is the
> > big advantage to use discrete geometry rather than algebra.
> >
> > best regards,
> > bp
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do
> nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-17 16:55         ` Paul Von-Stamwitz
@ 2013-06-18  7:44           ` Benoît Parrein
  2013-06-18 14:22             ` James Plank
  0 siblings, 1 reply; 23+ messages in thread
From: Benoît Parrein @ 2013-06-18  7:44 UTC (permalink / raw)
  To: Paul Von-Stamwitz
  Cc: Loic Dachary, ceph-devel@vger.kernel.org, James S. Plank

[-- Attachment #1: Type: text/plain, Size: 988 bytes --]

Hi Paul,

thank you for your message

from my point, LRC focuses on the repairing problem. how to reconstruct 
destroyed node to maintain the same availability by the distributed system?
in this context they can even go below 1x rate by introducing local 
parity on classical Reed Solomon blocks (but they pay a supplementary 
overhead). see excellent Alex Dimakis's papers for that. but, still from 
my point, the same relationship between redundancy and availability 
occurs (if you consider binomial model for your loses).

best
bp


Le 17/06/2013 18:55, Paul Von-Stamwitz a écrit :
> Loic,
>
> As Benoit points out, Mojette uses discrete geometry rather than algebra, so simple XOR is all that is needed.
>
> Benoit,
>
> Microsoft's paper states that their [12,2,2] LRC provides better availability than 3x replication with 1.33x efficiency. 1.5x is certainly a good number. I'm just pointing out that better efficiency can be had without losing availibity.
>
> All the best,
> Paul


[-- Attachment #2: benoit_parrein.vcf --]
[-- Type: text/x-vcard, Size: 402 bytes --]

begin:vcard
fn;quoted-printable:Beno=C3=AEt Parrein
n;quoted-printable:Parrein;Beno=C3=AEt
org;quoted-printable:Polytech Nantes, Universit=C3=A9 de Nantes, IRCCyN lab
adr;quoted-printable:;;=C3=A9quipe Image Vid=C3=A9o Communication;;;;France
email;internet:benoit.parrein@polytech.univ-nantes.fr
title;quoted-printable:Ma=C3=AEtre de Conf=C3=A9rences
tel;work:+33 2 40 68 30 50
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-18  7:44           ` Benoît Parrein
@ 2013-06-18 14:22             ` James Plank
  2013-06-19  1:35               ` Paul Von-Stamwitz
  2013-06-20 18:25               ` Loic Dachary
  0 siblings, 2 replies; 23+ messages in thread
From: James Plank @ 2013-06-18 14:22 UTC (permalink / raw)
  To: Benoît Parrein
  Cc: Paul Von-Stamwitz, Loic Dachary, ceph-devel@vger.kernel.org

Hi all -- thank you for including me on this thread, although I have little substantive to add.  At the moment, my sole focus is finishing a journal paper about GF implementations, with a concomitant GF-complete release to accompany it.  I agree that the CPU burden of the GF arithmetic will not be a bottleneck in your system, regardless of which implementation you use, as long as you stay at or below GF(2^16).  If you want to go higher, GF-complete will help.  When we put out a new release (the code will be ready within two weeks, however, the documentation is lagging), I'll let you know.  I think LRC is a nice coding paradigm, although I imagine that it has IP issues with Microsoft.  I don't have first-hand experience with network/regenerating codes, and I'll be honest -- there have been so many papers in that realm that I am not up to date on them.

Is there a question on which you'd like some help?  It sounds as though you are at two decision points: What code should you use, and at which point on the space-overhead/fault-tolerance curve would you like to be?

Best wishes,

Jim
----------

On Jun 18, 2013, at 3:44 AM, Benoît Parrein wrote:

> Hi Paul,
> 
> thank you for your message
> 
> from my point, LRC focuses on the repairing problem. how to reconstruct destroyed node to maintain the same availability by the distributed system?
> in this context they can even go below 1x rate by introducing local parity on classical Reed Solomon blocks (but they pay a supplementary overhead). see excellent Alex Dimakis's papers for that. but, still from my point, the same relationship between redundancy and availability occurs (if you consider binomial model for your loses).
> 
> best
> bp
> 
> 
> Le 17/06/2013 18:55, Paul Von-Stamwitz a écrit :
>> Loic,
>> 
>> As Benoit points out, Mojette uses discrete geometry rather than algebra, so simple XOR is all that is needed.
>> 
>> Benoit,
>> 
>> Microsoft's paper states that their [12,2,2] LRC provides better availability than 3x replication with 1.33x efficiency. 1.5x is certainly a good number. I'm just pointing out that better efficiency can be had without losing availibity.
>> 
>> All the best,
>> Paul
> 
> <benoit_parrein.vcf>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
       [not found]         ` <CAJOObicNGkweZbVSR-V8NA9YXaZucUpNm0y8Ph3X7EkE=pRG5g@mail.gmail.com>
@ 2013-06-18 14:31           ` Harvey Skinner
  2013-06-18 15:46             ` Loic Dachary
  0 siblings, 1 reply; 23+ messages in thread
From: Harvey Skinner @ 2013-06-18 14:31 UTC (permalink / raw)
  To: ceph-devel

all,


nice discussion on implementing  erasure-encoding as an option in Ceph for
object store.   THis will be a great feature option.   I would recommend
that a good default be implemented  as being discussed here, but also that
an API or plug-in be architected at the same time so the deployment could
also use some other form of parity/erasure-encoding if required/desired.
Governments and other vendors will push/convince (or have hardcoded in an
RFP) that some other algorithm be used and allowing an API/plug-in
capability will allow Ceph to still be applicable. There are some
proprietary object store solutions today which allow other
erasure-encode/parity plug-ins which you will want to be competitive with.

I would see initial deployments of erasure-encoded object store be used for
static data objects; block device usage would come later if proved to be as
performant as replicated objects.   Replication as done now in Ceph may
still be a better block storage solution choice for DR strategies,
configuring replicas to be at different DR site(s).

just my view of course …

Harvey


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Paul Von-Stamwitz
> Sent: Friday, June 14, 2013 6:12 PM
> To: Loic Dachary; Martin Flyvbjerg
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: Comments on Ceph distributed parity implementation
>
> Hi Loic and Martin,
>
> This is a great discussion and I agree that the performance ramifications
> of erasure coding need to be thought out very carefully. Since chunks are
> not only encoded, but distributed across the cluster, we need to pay
> attention to the network overhead as well as the arithmetic involved in the
> encoding/decoding.
>
> If I understand the proposal correctly, objects begin their life's journey
> replicated as normal. As it grows cold, it gets transformed to an encode PG
> in another pool. Subsequent reads will be redirected (ehh). Subsequent
> writes will first decode the original object and re-replicate it (ouch!).
> Client writes never are encoded on the fly; they are always replicated
> (nice).
>
> So...
> encode() is run as a low-priority background process probably once a week
> during deep scrubs.
> decode() should be rare (if not, the object shouldn't have been encoded in
> the first place.) If the cluster is healthy no arithmetic is needed, just
> concatenation, but a lot of network activity.
> repair() operations will be the most prevalent and may occur at any time
> during normal self-healing/rebalancing operations.
>
> Therefore, in my opinion, the algorithm that we choose must be optimized
> for repairing damaged and missing chunks. The main problem I have with
> Reed-Solomon is that it uses MDS codes which maximize network activity for
> recalculations. Pyramid codes have the same write (encode) overhead, but
> have better read (repair) overhead.
>
> Loic, I know nothing about Mojette Transforms. From what little I gleaned,
> it might be good for repair (needing only a subset of chunks within a range
> to recalculate a missing chunk) but I'm worried about the storage
> efficiency. RozoFS claims 1.5x. I'd like to do better than that.
>
> All the best,
> Paul
>
> On 06/14/2013 3:57 PM, Loic Dachary wrote:
> > Hi Martin,
> >
> > Your explanations are very helpful to better understand the tradeoffs
> > of the existing implementations. To be honest I was looking forward to
> > your intervention. Not you specifically, of course :-) But someone
> > with a good theoretical background to be a judge of what's best in the
> > context of Ceph.
> > If you say it's the upcoming library to be released in August 2013,
> > I'll take your word for it.
> >
> > The work currently being done within Ceph is to architecture to
> > storage backend ( namely placement groups ) to make room for distributed
> > parity.
> > My initial idea was to isolate the low level library under an API that
> > takes a region ( 16KB for instance, as in gf_unit.c found in
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> > 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be
> > written on different hosts. For instance
> >
> >       encode(char* region, char** chuncks) => encode the region into N
> > chuncks
> >       decode(char** chunks, char* region) => decode the N chuncks into a
> > region
> >         repair(char** chunks, int damaged) => repair the damaged
> > chunck
> >
> > Do you think it is a sensible approach ? And if you do, will I find
> > examples of such higher level functions in
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> > 703/gf_complete_0.1.tar ? Or elsewhere ?
> >
> > I'm a little confused about the relation between GF complete ( as
> > found at
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
> > 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure (
> > as found at
> > http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-
> > 1.2.tar )  which is comparatively older ( 2008 ). Do you know how
> > Jerasure
> > 2.0 relates to GF complete ?
> >
> > For completeness, here is a thread with pointers to Mojette Transform
> > that's being used as part of Rozofs.
> >
> > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14666.html
> >
> > I'm not able to compare it with the other libraries because it seems
> > to take a completely different approach. Do you have an opinion about it
> > ?
> >
> > As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next
> > month but I'd love to understand more about this as soon as possible
> > :-)
> >
> > Cheers
> >
> > P.S. Updated
> > http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
> > _as_ a_storage_backend#Erasure_Encoded_Storage with a link to
> > http://web.eecs.utk.edu/~plank/plank/www/software.html for the record
> >
> > On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
> > > Dear Community
> > > I am a young engineer (not software or math, please bare with me)
> > > with
> > some suggestions regarding erasure codes. I never used Ceph before or
> > any other distributed file system.
> > >
> > > I stumped upon the suggestion for adding erasure codes to Ceph, as
> > > described in this article
> > >
> > http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
> > _as_
> > a_storage_backend
> > >
> > > first I would like to say great initiative to add erasure codes to
> > > Ceph.
> > > Ceph needs its own implementation and it have to be done right, I
> > > cannot
> > stress this enough, suggested software mentioned in that article would
> > result in very low performance.
> > >
> > > Why?
> > > Reed-Solomon is normally something regarded as being very slow
> > > compared
> > to other erasure codes, because the underlying Galois-Field
> > multiplication is slow. Please see video at usenix.org forexplanation.
> > >
> > > The implementations of Zfec library and other suggested software the
> > others rely on the Vandermonde matrix, a matrix used in in
> > Reed-Solomon erasure codes, a faster approach would be to use the
> > Cauchy-Reed-Solomon implementation. Please see [1,2,3]
> > > Although there is something even better, by using the Intel SSE2/3
> > > SIMD
> > instructions it is possible to do the as fast as any other XOR based
> > erasure codes (RaptorQ LT-codes, LDPC etc.).
> > >
> > > The suggested FECpp lib uses the optimisation but with a relative
> > > small
> > Galois-field only 2^8, since Ceph aimes at unlimited scalability
> > increasing the size of the Galois-Field would improve performance [4].
> > Of course the configured Ceph Object Size and/or Stripe width have to
> > be taken into account.
> > > Please see
> > > https://www.usenix.org/conference/fast13/screaming-fast-galois-field
> > > -
> > arithmetic-using-sse2-extensions
> > >
> > >
> > > The solution
> > > Using the GF-Complete open source library [4] to implement
> > > Reed-Solomon
> > in Ceph in order to allow Ceph to scale to infinity.
> > > James S. Plank the author of GF-complete have developed a library
> > implementing various Reed-Solomon codes called Jerasure.
> > http://web.eecs.utk.edu/~plank/plank/www/software.html
> > > Jerasure 2.0 using the GF-complete artimetric based in Intel SSE
> > > SIMD
> > instructions, is current in development expected release august 2013.
> > Will be released under the new BSD license. Jerasure 2.0 also supports
> > arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
> > >
> > > The limit of this implementation would be the processors L2/L3 cache
> > > not
> > the underlying arithmetic.
> > >
> > > Best Regards
> > > Martin Flyvbjerg
> > >
> > > [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
> > > [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
> > > [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
> > > [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> > All that is necessary for the triumph of evil is that good people do
> > nothing.
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-18 14:31           ` Harvey Skinner
@ 2013-06-18 15:46             ` Loic Dachary
  0 siblings, 0 replies; 23+ messages in thread
From: Loic Dachary @ 2013-06-18 15:46 UTC (permalink / raw)
  To: Harvey Skinner; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 11038 bytes --]

Hi Harvey,

On 06/18/2013 04:31 PM, Harvey Skinner wrote:
> all,
> 
> 
> nice discussion on implementing  erasure-encoding as an option in Ceph for
> object store.   THis will be a great feature option.   I would recommend
> that a good default be implemented  as being discussed here, but also that
> an API or plug-in be architected at the same time so the deployment could
> also use some other form of parity/erasure-encoding if required/desired.
> Governments and other vendors will push/convince (or have hardcoded in an
> RFP) that some other algorithm be used and allowing an API/plug-in
> capability will allow Ceph to still be applicable. There are some
> proprietary object store solutions today which allow other
> erasure-encode/parity plug-ins which you will want to be competitive with.
> 

For most coding libraries it looks like an abstract API such as:

// extra == GF(2^8) for instance, i.e. parameters dependant on the kind of encoding function
context = initialize(int k, int m, void* extra)
// code block into k+m chunks
code(context, char* block, char** chunks)
// decode k+m chunks into block and if chunk[i] is erased[i], repair it
decode(context, char** chunks, char* block, int* erased)

is all we need. However, if hierarchical codes are to be considered, the code() function will need information about the location of the object ( the crushmap maybe ?) and probably output additional information to be used when coding other objects, not just chunks. 

What do you think ? 
                           
> I would see initial deployments of erasure-encoded object store be used for
> static data objects; block device usage would come later if proved to be as
> performant as replicated objects.   Replication as done now in Ceph may
> still be a better block storage solution choice for DR strategies,
> configuring replicas to be at different DR site(s).

I suspect you're correct : using erasure coding in a pool used by rgw is likely to be the first actual use case.

> 
> just my view of course …
> 
> Harvey
> 
> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Paul Von-Stamwitz
>> Sent: Friday, June 14, 2013 6:12 PM
>> To: Loic Dachary; Martin Flyvbjerg
>> Cc: ceph-devel@vger.kernel.org
>> Subject: RE: Comments on Ceph distributed parity implementation
>>
>> Hi Loic and Martin,
>>
>> This is a great discussion and I agree that the performance ramifications
>> of erasure coding need to be thought out very carefully. Since chunks are
>> not only encoded, but distributed across the cluster, we need to pay
>> attention to the network overhead as well as the arithmetic involved in the
>> encoding/decoding.
>>
>> If I understand the proposal correctly, objects begin their life's journey
>> replicated as normal. As it grows cold, it gets transformed to an encode PG
>> in another pool. Subsequent reads will be redirected (ehh). Subsequent
>> writes will first decode the original object and re-replicate it (ouch!).
>> Client writes never are encoded on the fly; they are always replicated
>> (nice).
>>
>> So...
>> encode() is run as a low-priority background process probably once a week
>> during deep scrubs.
>> decode() should be rare (if not, the object shouldn't have been encoded in
>> the first place.) If the cluster is healthy no arithmetic is needed, just
>> concatenation, but a lot of network activity.
>> repair() operations will be the most prevalent and may occur at any time
>> during normal self-healing/rebalancing operations.
>>
>> Therefore, in my opinion, the algorithm that we choose must be optimized
>> for repairing damaged and missing chunks. The main problem I have with
>> Reed-Solomon is that it uses MDS codes which maximize network activity for
>> recalculations. Pyramid codes have the same write (encode) overhead, but
>> have better read (repair) overhead.
>>
>> Loic, I know nothing about Mojette Transforms. From what little I gleaned,
>> it might be good for repair (needing only a subset of chunks within a range
>> to recalculate a missing chunk) but I'm worried about the storage
>> efficiency. RozoFS claims 1.5x. I'd like to do better than that.
>>
>> All the best,
>> Paul
>>
>> On 06/14/2013 3:57 PM, Loic Dachary wrote:
>>> Hi Martin,
>>>
>>> Your explanations are very helpful to better understand the tradeoffs
>>> of the existing implementations. To be honest I was looking forward to
>>> your intervention. Not you specifically, of course :-) But someone
>>> with a good theoretical background to be a judge of what's best in the
>>> context of Ceph.
>>> If you say it's the upcoming library to be released in August 2013,
>>> I'll take your word for it.
>>>
>>> The work currently being done within Ceph is to architecture to
>>> storage backend ( namely placement groups ) to make room for distributed
>>> parity.
>>> My initial idea was to isolate the low level library under an API that
>>> takes a region ( 16KB for instance, as in gf_unit.c found in
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>>> 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be
>>> written on different hosts. For instance
>>>
>>>       encode(char* region, char** chuncks) => encode the region into N
>>> chuncks
>>>       decode(char** chunks, char* region) => decode the N chuncks into a
>>> region
>>>         repair(char** chunks, int damaged) => repair the damaged
>>> chunck
>>>
>>> Do you think it is a sensible approach ? And if you do, will I find
>>> examples of such higher level functions in
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>>> 703/gf_complete_0.1.tar ? Or elsewhere ?
>>>
>>> I'm a little confused about the relation between GF complete ( as
>>> found at
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-13-
>>> 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure (
>>> as found at
>>> http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure-
>>> 1.2.tar )  which is comparatively older ( 2008 ). Do you know how
>>> Jerasure
>>> 2.0 relates to GF complete ?
>>>
>>> For completeness, here is a thread with pointers to Mojette Transform
>>> that's being used as part of Rozofs.
>>>
>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14666.html
>>>
>>> I'm not able to compare it with the other libraries because it seems
>>> to take a completely different approach. Do you have an opinion about it
>>> ?
>>>
>>> As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next
>>> month but I'd love to understand more about this as soon as possible
>>> :-)
>>>
>>> Cheers
>>>
>>> P.S. Updated
>>> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
>>> _as_ a_storage_backend#Erasure_Encoded_Storage with a link to
>>> http://web.eecs.utk.edu/~plank/plank/www/software.html for the record
>>>
>>> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote:
>>>> Dear Community
>>>> I am a young engineer (not software or math, please bare with me)
>>>> with
>>> some suggestions regarding erasure codes. I never used Ceph before or
>>> any other distributed file system.
>>>>
>>>> I stumped upon the suggestion for adding erasure codes to Ceph, as
>>>> described in this article
>>>>
>>> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding
>>> _as_
>>> a_storage_backend
>>>>
>>>> first I would like to say great initiative to add erasure codes to
>>>> Ceph.
>>>> Ceph needs its own implementation and it have to be done right, I
>>>> cannot
>>> stress this enough, suggested software mentioned in that article would
>>> result in very low performance.
>>>>
>>>> Why?
>>>> Reed-Solomon is normally something regarded as being very slow
>>>> compared
>>> to other erasure codes, because the underlying Galois-Field
>>> multiplication is slow. Please see video at usenix.org forexplanation.
>>>>
>>>> The implementations of Zfec library and other suggested software the
>>> others rely on the Vandermonde matrix, a matrix used in in
>>> Reed-Solomon erasure codes, a faster approach would be to use the
>>> Cauchy-Reed-Solomon implementation. Please see [1,2,3]
>>>> Although there is something even better, by using the Intel SSE2/3
>>>> SIMD
>>> instructions it is possible to do the as fast as any other XOR based
>>> erasure codes (RaptorQ LT-codes, LDPC etc.).
>>>>
>>>> The suggested FECpp lib uses the optimisation but with a relative
>>>> small
>>> Galois-field only 2^8, since Ceph aimes at unlimited scalability
>>> increasing the size of the Galois-Field would improve performance [4].
>>> Of course the configured Ceph Object Size and/or Stripe width have to
>>> be taken into account.
>>>> Please see
>>>> https://www.usenix.org/conference/fast13/screaming-fast-galois-field
>>>> -
>>> arithmetic-using-sse2-extensions
>>>>
>>>>
>>>> The solution
>>>> Using the GF-Complete open source library [4] to implement
>>>> Reed-Solomon
>>> in Ceph in order to allow Ceph to scale to infinity.
>>>> James S. Plank the author of GF-complete have developed a library
>>> implementing various Reed-Solomon codes called Jerasure.
>>> http://web.eecs.utk.edu/~plank/plank/www/software.html
>>>> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE
>>>> SIMD
>>> instructions, is current in development expected release august 2013.
>>> Will be released under the new BSD license. Jerasure 2.0 also supports
>>> arbitrary Galois-field sizes 8,16,32,64 or 128 bit.
>>>>
>>>> The limit of this implementation would be the processors L2/L3 cache
>>>> not
>>> the underlying arithmetic.
>>>>
>>>> Best Regards
>>>> Martin Flyvbjerg
>>>>
>>>> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf
>>>> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf
>>>> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf
>>>> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do
>>> nothing.
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Comments on Ceph distributed parity implementation
  2013-06-18 14:22             ` James Plank
@ 2013-06-19  1:35               ` Paul Von-Stamwitz
  2013-06-20 18:25               ` Loic Dachary
  1 sibling, 0 replies; 23+ messages in thread
From: Paul Von-Stamwitz @ 2013-06-19  1:35 UTC (permalink / raw)
  To: James Plank, Benoît Parrein; +Cc: Loic Dachary, ceph-devel@vger.kernel.org

Benoit,

I agree with you. Redundancy and availability, as well as performance for both encoding and reconstruction, are related. There are inherent trade-offs between them. The beautiful thing about Ceph is the support for multiple pools which can have different attributes. Today we have pools with differing performance and redundancy characteristics. This may add complexity when adding erasure coding. Ideally, a user could create a pool that favored capacity and another one that favored availability and another... etc.

Jim, I like the image of your "space-overhead/fault-tolerance curve" because I see our issue as multiple points on the curve rather than a single one. From a performance perspective, I agree that the CPU burden is not really an issue, but inter-node communication is.

All the best
Paul

On June 18, 2013 at 7:23 AM, James Plank wrote:
> 
> Hi all -- thank you for including me on this thread, although I have
> little substantive to add.  At the moment, my sole focus is finishing a
> journal paper about GF implementations, with a concomitant GF-complete
> release to accompany it.  I agree that the CPU burden of the GF arithmetic
> will not be a bottleneck in your system, regardless of which
> implementation you use, as long as you stay at or below GF(2^16).  If you
> want to go higher, GF-complete will help.  When we put out a new release
> (the code will be ready within two weeks, however, the documentation is
> lagging), I'll let you know.  I think LRC is a nice coding paradigm,
> although I imagine that it has IP issues with Microsoft.  I don't have
> first-hand experience with network/regenerating codes, and I'll be honest
> -- there have been so many papers in that realm that I am not up to date
> on them.
> 
> Is there a question on which you'd like some help?  It sounds as though
> you are at two decision points: What code should you use, and at which
> point on the space-overhead/fault-tolerance curve would you like to be?
> 
> Best wishes,
> 
> Jim
> ----------
> 
> On Jun 18, 2013, at 3:44 AM, Benoît Parrein wrote:
> 
> > Hi Paul,
> >
> > thank you for your message
> >
> > from my point, LRC focuses on the repairing problem. how to reconstruct
> destroyed node to maintain the same availability by the distributed
> system?
> > in this context they can even go below 1x rate by introducing local
> parity on classical Reed Solomon blocks (but they pay a supplementary
> overhead). see excellent Alex Dimakis's papers for that. but, still from
> my point, the same relationship between redundancy and availability occurs
> (if you consider binomial model for your loses).
> >
> > best
> > bp
> >
> >
> > Le 17/06/2013 18:55, Paul Von-Stamwitz a écrit :
> >> Loic,
> >>
> >> As Benoit points out, Mojette uses discrete geometry rather than
> algebra, so simple XOR is all that is needed.
> >>
> >> Benoit,
> >>
> >> Microsoft's paper states that their [12,2,2] LRC provides better
> availability than 3x replication with 1.33x efficiency. 1.5x is certainly
> a good number. I'm just pointing out that better efficiency can be had
> without losing availibity.
> >>
> >> All the best,
> >> Paul
> >
> > <benoit_parrein.vcf>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-18 14:22             ` James Plank
  2013-06-19  1:35               ` Paul Von-Stamwitz
@ 2013-06-20 18:25               ` Loic Dachary
  2013-06-21  1:23                 ` Paul Von-Stamwitz
  1 sibling, 1 reply; 23+ messages in thread
From: Loic Dachary @ 2013-06-20 18:25 UTC (permalink / raw)
  To: James Plank; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3781 bytes --]



On 06/18/2013 04:22 PM, James Plank wrote:
> Hi all -- thank you for including me on this thread, although I have little substantive to add.  At the moment, my sole focus is finishing a journal paper about GF implementations, with a concomitant GF-complete release to accompany it.  I agree that the CPU burden of the GF arithmetic will not be a bottleneck in your system, regardless of which implementation you use, as long as you stay at or below GF(2^16).  If you want to go higher, GF-complete will help.  When we put out a new release (the code will be ready within two weeks, however, the documentation is lagging), I'll let you know.  I think LRC is a nice coding paradigm, although I imagine that it has IP issues with Microsoft.  I don't have first-hand experience with network/regenerating codes, and I'll be honest -- there have been so many papers in that realm that I am not up to date on them.
> 
> Is there a question on which you'd like some help?  It sounds as though you are at two decision points: What code should you use, and at which point on the space-overhead/fault-tolerance curve would you like to be?

Hi James, 

Unless someone objects it looks like Ceph going to use jerasure-1.2 with reed-solomon. I'm glad to hear that GF arithmetic will not be a bottleneck : we're going to stay below GF(2^8). However minimizing the CPU footprint is essential and I'm looking forward to use the next version including the SIMD optimizations that you demonstrated in gf-complete.

I wrote down a short description of the read/write path I plan to implement in ceph : https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst . A quick look at the drawings will hopefully give you an idea. Each OSD is a disk connected to the others over the network. Although I chose K+M = 5 I suspect the most common use case will be around K+M = 7+3 = 10 

I've seen that jerasure-1.2 not only provides classic reed-solomon but also cauchy reed-solomon and liberation / minimal density MDS codes. I assume classic reed-solomon is best suited for the default Ceph use case described above but I'm not sure. What do you think ?

Thanks a lot for your advices :-) It helps me write sensible code. 

Cheers

> 
> Best wishes,
> 
> Jim
> ----------
> 
> On Jun 18, 2013, at 3:44 AM, Benoît Parrein wrote:
> 
>> Hi Paul,
>>
>> thank you for your message
>>
>> from my point, LRC focuses on the repairing problem. how to reconstruct destroyed node to maintain the same availability by the distributed system?
>> in this context they can even go below 1x rate by introducing local parity on classical Reed Solomon blocks (but they pay a supplementary overhead). see excellent Alex Dimakis's papers for that. but, still from my point, the same relationship between redundancy and availability occurs (if you consider binomial model for your loses).
>>
>> best
>> bp
>>
>>
>> Le 17/06/2013 18:55, Paul Von-Stamwitz a écrit :
>>> Loic,
>>>
>>> As Benoit points out, Mojette uses discrete geometry rather than algebra, so simple XOR is all that is needed.
>>>
>>> Benoit,
>>>
>>> Microsoft's paper states that their [12,2,2] LRC provides better availability than 3x replication with 1.33x efficiency. 1.5x is certainly a good number. I'm just pointing out that better efficiency can be had without losing availibity.
>>>
>>> All the best,
>>> Paul
>>
>> <benoit_parrein.vcf>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Comments on Ceph distributed parity implementation
  2013-06-20 18:25               ` Loic Dachary
@ 2013-06-21  1:23                 ` Paul Von-Stamwitz
  2013-06-21  8:29                   ` Loic Dachary
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Von-Stamwitz @ 2013-06-21  1:23 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel@vger.kernel.org


On 06/20/2013 11:26 PM, Loic Dachary wrote:
> 
> I wrote down a short description of the read/write path I plan to
> implement in ceph :
https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst.
> A quick look at the drawings will hopefully give you an idea. Each OSD is > a disk connected to the others over the network. Although I chose K+M = 5 > I suspect the most common use case will be around K+M = 7+3 = 10

Hi Loic,

A couple questions regarding your proposal:

Where would encode/decode occur? Client or OSD? Previous discussions seemed to want it at the OSD level. Your proposal seems that it could be either.

What about partial reads? For example, if only 'H' was requested, would you decode the entire object first? Or, read from OSD1 and reconstruct on error?

Thanks,
Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-21  1:23                 ` Paul Von-Stamwitz
@ 2013-06-21  8:29                   ` Loic Dachary
  2013-06-22  0:08                     ` Paul Von-Stamwitz
  0 siblings, 1 reply; 23+ messages in thread
From: Loic Dachary @ 2013-06-21  8:29 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]

On 06/21/2013 03:23 AM, Paul Von-Stamwitz wrote:
> 
> On 06/20/2013 11:26 PM, Loic Dachary wrote:
>>
>> I wrote down a short description of the read/write path I plan to
>> implement in ceph :
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst.
>> A quick look at the drawings will hopefully give you an idea. Each OSD is > a disk connected to the others over the network. Although I chose K+M = 5 > I suspect the most common use case will be around K+M = 7+3 = 10
> 
> Hi Loic,
> 
> A couple questions regarding your proposal:
> 
> Where would encode/decode occur? Client or OSD? Previous discussions seemed to want it at the OSD level. Your proposal seems that it could be either.

I think it should be in the OSD. Although it would be possible to implement it in the client, it would have to be implemented in the OSD anyway for scrubbing. Therefore I think it is simpler to implement it in the OSD as a first step.

> 
> What about partial reads? For example, if only 'H' was requested, would you decode the entire object first? 

Unless you see a good reason to implement this optimization from the start, I think the entire object would be decoded.

> Or, read from OSD1 and reconstruct on error?

I think that's what we want, ultimately. And also to encode / decode large objects (1GB for instance) as a sequence of independant regions (4MB for instance or smaller). 

I updated https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst to reflect our discussion.

The first, simplest implementation is likely to be fit to use with RGW and probably too slow to use with RBD. Do you think we should try to optimize for RBD right now ? 

Cheers

> 
> Thanks,
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Comments on Ceph distributed parity implementation
  2013-06-21  8:29                   ` Loic Dachary
@ 2013-06-22  0:08                     ` Paul Von-Stamwitz
  2013-06-22  8:26                       ` Loic Dachary
  2013-06-24  2:26                       ` Harvey Skinner
  0 siblings, 2 replies; 23+ messages in thread
From: Paul Von-Stamwitz @ 2013-06-22  0:08 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel@vger.kernel.org, Harvey Skinner


On 06/21/2013 1:29 AM, Loic Dachary wrote:
> 
> On 06/21/2013 03:23 AM, Paul Von-Stamwitz wrote:
> >
> > On 06/20/2013 11:26 PM, Loic Dachary wrote:
> >>
> >> I wrote down a short description of the read/write path I plan to
> >> implement in ceph :
> > https://github.com/dachary/ceph/blob/wip-
> 4929/doc/dev/osd_internals/erasure-code.rst.
> >> A quick look at the drawings will hopefully give you an idea. Each OSD
> is > a disk connected to the others over the network. Although I chose K+M
> = 5 > I suspect the most common use case will be around K+M = 7+3 = 10
> >
> > Hi Loic,
> >
> > A couple questions regarding your proposal:
> >
> > Where would encode/decode occur? Client or OSD? Previous discussions
> seemed to want it at the OSD level. Your proposal seems that it could be
> either.
> 
> I think it should be in the OSD. Although it would be possible to
> implement it in the client, it would have to be implemented in the OSD
> anyway for scrubbing. Therefore I think it is simpler to implement it in
> the OSD as a first step.

Agreed. Encode/decode/reconstruction belongs in the OSD layer. There is a possibility that the client and OSDs can work in tandem. For example, the client already does sharding. But, the OSD is currently responsible for replication, so it should be responsible for parity. For now, we should strive for encoding to be transparent to the client.

> 
> >
> > What about partial reads? For example, if only 'H' was requested, would
> you decode the entire object first?
> 
> Unless you see a good reason to implement this optimization from the start,
> I think the entire object would be decoded.
> 
> > Or, read from OSD1 and reconstruct on error?
> 
> I think that's what we want, ultimately. And also to encode / decode large
> objects (1GB for instance) as a sequence of independant regions (4MB for
> instance or smaller).
> 
> I updated https://github.com/dachary/ceph/blob/wip-
> 4929/doc/dev/osd_internals/erasure-code.rst to reflect our discussion.
> 
> The first, simplest implementation is likely to be fit to use with RGW and
> probably too slow to use with RBD. Do you think we should try to optimize
> for RBD right now ?

Yes, RGW is the obvious best candidate for the first implementation. We don't need to implement for RBD and CephFS now, but we should consider how the design would handle other applications in the future. The alternative is to optimize purely for RGW and provide an API/plug-in capability suggested by Harvey Skinner to make way for optimized solutions for other applications.

All the best
pvs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-22  0:08                     ` Paul Von-Stamwitz
@ 2013-06-22  8:26                       ` Loic Dachary
  2013-06-24  2:26                       ` Harvey Skinner
  1 sibling, 0 replies; 23+ messages in thread
From: Loic Dachary @ 2013-06-22  8:26 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: ceph-devel@vger.kernel.org, Harvey Skinner

[-- Attachment #1: Type: text/plain, Size: 2978 bytes --]

>> The first, simplest implementation is likely to be fit to use with RGW and
>> probably too slow to use with RBD. Do you think we should try to optimize
>> for RBD right now ?
> 
> Yes, RGW is the obvious best candidate for the first implementation. We don't need to implement for RBD and CephFS now, but we should consider how the design would handle other applications in the future. The alternative is to optimize purely for RGW and provide an API/plug-in capability suggested by Harvey Skinner to make way for optimized solutions for other applications.
> 

I agree that the design should make room to plug in optimizations in the future. I've tried to figure out where the API/plug-in should fit. 

a) pluggable placement group
b) pluggable erasure code library

The pluggable placement group capability is what I'm working on right now. It requires some re-architecture of the current code and the API is starting to emerge. The implementation should eventually be in a separate shared library ( say ErasureCodePG ) loaded at run time and selected with a configuration option when creating a pool. I suspect that experimenting with new optimization strategies is going to be done by hacking ErasureCodePG and create new pools using it. 

Let say we find a way to optimize for RBD and implement that in the RBDErasureCodePG placement group. And we configure the RBD pool to use this placement group backend while keeping the ErasureCodePG placement group backend for RGW. Later on it may make sense to merge the two or make sure they share similar code for maintainance purposes. But that probably leaves all the room we need to experiment until a general solution is found.

The pluggable erasure code library API will be something like what is described in http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

    context(k, m, reed-solomon|...) => context* c 
    encode(context* c, void* data) => void* chunks[k+m]
    decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are not used
    repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are rebuilt

It won't be enough for hierarchical codes but they don't seem to be considered attractive at the moment. It should be enough for LRC ( http://anrg.usc.edu/~maheswaran/Xorbas.pdf ) since it only requires an additional argument to the context ( the number of chunks required to do a local repair ).

The need for another API ( in addition to pluggable placement groups and pluggable erasure code library ) may appear in the future. I can't see it right now. I try to refrain from over-engineering while making sure we don't need to re-architecture because something obvious was overlooked. This discussion is helping a lot :-) 

What do you think ?

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Comments on Ceph distributed parity implementation
  2013-06-22  0:08                     ` Paul Von-Stamwitz
  2013-06-22  8:26                       ` Loic Dachary
@ 2013-06-24  2:26                       ` Harvey Skinner
  1 sibling, 0 replies; 23+ messages in thread
From: Harvey Skinner @ 2013-06-24  2:26 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: Loic Dachary, ceph-devel@vger.kernel.org, hpmpec2a

hi all,

I agree that the OSD level (storage nodes) is where the
encoding/reconstruct should be.   From an overall solution
perspective, client nodes are sized for the apps they are deploying.
It would be a real push to get them to also over provision app/client
servers to take on the additional processing needed for encoding, etc.
  Better to have this as part of the storage node sizing criteria; is
the end user is going to use erasure-encoding or replication or both
(different pools) on the same storage node platform.

Harvey

On Fri, Jun 21, 2013 at 5:08 PM, Paul Von-Stamwitz
<PVonStamwitz@us.fujitsu.com> wrote:
>
> On 06/21/2013 1:29 AM, Loic Dachary wrote:
>>
>> On 06/21/2013 03:23 AM, Paul Von-Stamwitz wrote:
>> >
>> > On 06/20/2013 11:26 PM, Loic Dachary wrote:
>> >>
>> >> I wrote down a short description of the read/write path I plan to
>> >> implement in ceph :
>> > https://github.com/dachary/ceph/blob/wip-
>> 4929/doc/dev/osd_internals/erasure-code.rst.
>> >> A quick look at the drawings will hopefully give you an idea. Each OSD
>> is > a disk connected to the others over the network. Although I chose
>> K+M
>> = 5 > I suspect the most common use case will be around K+M = 7+3 = 10
>> >
>> > Hi Loic,
>> >
>> > A couple questions regarding your proposal:
>> >
>> > Where would encode/decode occur? Client or OSD? Previous discussions
>> seemed to want it at the OSD level. Your proposal seems that it could be
>> either.
>>
>> I think it should be in the OSD. Although it would be possible to
>> implement it in the client, it would have to be implemented in the OSD
>> anyway for scrubbing. Therefore I think it is simpler to implement it in
>> the OSD as a first step.
>
> Agreed. Encode/decode/reconstruction belongs in the OSD layer. There is a
> possibility that the client and OSDs can work in tandem. For example, the
> client already does sharding. But, the OSD is currently responsible for
> replication, so it should be responsible for parity. For now, we should
> strive for encoding to be transparent to the client.
>
>>
>> >
>> > What about partial reads? For example, if only 'H' was requested, would
>> you decode the entire object first?
>>
>> Unless you see a good reason to implement this optimization from the
>> start,
>> I think the entire object would be decoded.
>>
>> > Or, read from OSD1 and reconstruct on error?
>>
>> I think that's what we want, ultimately. And also to encode / decode
>> large
>> objects (1GB for instance) as a sequence of independant regions (4MB for
>> instance or smaller).
>>
>> I updated https://github.com/dachary/ceph/blob/wip-
>> 4929/doc/dev/osd_internals/erasure-code.rst to reflect our discussion.
>>
>> The first, simplest implementation is likely to be fit to use with RGW
>> and
>> probably too slow to use with RBD. Do you think we should try to optimize
>> for RBD right now ?
>
> Yes, RGW is the obvious best candidate for the first implementation. We
> don't need to implement for RBD and CephFS now, but we should consider how
> the design would handle other applications in the future. The alternative is
> to optimize purely for RGW and provide an API/plug-in capability suggested
> by Harvey Skinner to make way for optimized solutions for other
> applications.
>
> All the best
> pvs
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-06-24  2:26 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-14 20:13 Comments on Ceph distributed parity implementation Martin Flyvbjerg
2013-06-14 20:29 ` Mark Nelson
2013-06-14 21:05 ` Joe Buck
2013-06-14 22:57 ` Loic Dachary
2013-06-15  1:12   ` Paul Von-Stamwitz
2013-06-15  6:51     ` Loic Dachary
2013-06-16 19:51     ` Benoît Parrein
2013-06-16 21:31       ` Loic Dachary
2013-06-17 16:48         ` Benoît Parrein
2013-06-17 16:55         ` Paul Von-Stamwitz
2013-06-18  7:44           ` Benoît Parrein
2013-06-18 14:22             ` James Plank
2013-06-19  1:35               ` Paul Von-Stamwitz
2013-06-20 18:25               ` Loic Dachary
2013-06-21  1:23                 ` Paul Von-Stamwitz
2013-06-21  8:29                   ` Loic Dachary
2013-06-22  0:08                     ` Paul Von-Stamwitz
2013-06-22  8:26                       ` Loic Dachary
2013-06-24  2:26                       ` Harvey Skinner
     [not found]     ` <C395B77B849187439280E1CF5FE1F2FA8990491B@G9W0337.americas.hpqcorp.net>
     [not found]       ` <CAJOObidVdjtiwk+xk5rwZi4=DBZ9GvTQnAkteCC0OhB_vyg6pg@mail.gmail.com>
     [not found]         ` <CAJOObicNGkweZbVSR-V8NA9YXaZucUpNm0y8Ph3X7EkE=pRG5g@mail.gmail.com>
2013-06-18 14:31           ` Harvey Skinner
2013-06-18 15:46             ` Loic Dachary
2013-06-15  7:30 ` Loic Dachary
2013-06-15  9:40   ` Leen Besselink

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.