Erasure encoding as a storage backend

All of lore.kernel.org
 help / color / mirror / Atom feed

* Erasure encoding as a storage backend
@ 2013-05-04 17:16 Loic Dachary
  2013-05-04 18:27 ` Noah Watkins
  0 siblings, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2013-05-04 17:16 UTC (permalink / raw)
  To: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 3135 bytes --]

Hi,

Here is an updated description of the "Erasure encoding as a storage backend" proposed implementation that will be discussed during the ceph summit ( http://wiki.ceph.com/01Planning/Developer_Summit#Schedule ). The "strip" and "stripe" terms are illustrated at http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend#Proposed_model . 

I am well aware of the shortcomings of this proposal and it would be great to get feedback before the ceph summit to address the most prominent issues.

Cheers

http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

	* PG and ReplicatedPG are reworked so that PG can be used as a base class for ErasureEncodedPG
		* Tests are written for ReplicatedPG to cover 100% of the LOC and most of the expected functionalities.
		* Code is reworked in PG and ReplicatedPG, moving from ReplicatedPG to PG code that is not unique to replication and from PG to ReplicatedPG code that is not generic enough to be useful for the ErasureEncodedPG base class.
	* To isolates ceph from the actual library being used ( zfec, fecpp, ... ), a wrapper around the erasure encoding library is implemented. Each block is encoded into k data blocks and m parity blocks
		* encode(void* data, k, m) => void* data[k], void* parity[m]
		* decode(void* data[k], void* parity[m]) => void* data
		* repair(void* data[k], void* parity[m], indices_of_damaged_blocks[]) => void* data
	* The ErasureEncodePG configuration is set to encode each object into k data objects and m parity objects. 
		* It use the parity ('INDEP') crush mode so that placement is intelligent. The indep  placement avoids moving around a shard between ranks, because a mapping  of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails  and the shards on 2,3,4 won't need to be copied around.
		* The ErasureEncodedPG uses k + m OSDs, numbered Do .. Dk-1 and C0 ... Cm-1
		* Each object is a strip
		* Each stripe has a fixed size of B bytes
	* ErasureEncodedPG implementation
		* Write offset, length
			* read the stripes containing offset, length
			* for each stripe, decode(void* data[k], void* parity[m]) => void* data and append to a bufferlist
			* modify the bufferlist with the write request
			* encode(void* data, k, m) => void* data[k], void* parity[m]
			* write data[0] to Do, data[1] to D1 ... data[k-1] to Dk-1 and parity[0] to C0 ... parity[m-1] to Cm-1
		* Read offset, length
			* read the stripes containing offset
			* for each strip, decode(void* data[k], void* parity[m]) => void* data and append to a bufferlist
		* Object attributes
			* duplicate the object attributes on each OSD
		* Scrubbing
			* for each object, read each stripe and write back if a repair was necessary
		* Repair
			* when an OSD is decomissioned, when another OSD replaces it, for each object contained in a ErasureEncodedPG using this OSD, read the object, repair each strips and write back the strip that resides on the new OSD


-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Erasure encoding as a storage backend
  2013-05-04 17:16 Erasure encoding as a storage backend Loic Dachary
@ 2013-05-04 18:27 ` Noah Watkins
  2013-05-04 18:36   ` Loic Dachary
  0 siblings, 1 reply; 7+ messages in thread
From: Noah Watkins @ 2013-05-04 18:27 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On May 4, 2013, at 10:16 AM, Loic Dachary <loic@dachary.org> wrote:

> it would be great to get feedback before the ceph summit to address the most prominent issues.

One thing that has been in the back of my mind is how this proposal is influenced (if at all) by a future that includes declustered per-file raid in CephFS. I realize that may be a distant future, but it seems as though there could be a lot of overlap for the (non-client driven) rebuild/recovery component of such an architecture.

-Noah

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Erasure encoding as a storage backend
  2013-05-04 18:27 ` Noah Watkins
@ 2013-05-04 18:36   ` Loic Dachary
  2013-05-04 18:47     ` Noah Watkins
  0 siblings, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2013-05-04 18:36 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1079 bytes --]



On 05/04/2013 08:27 PM, Noah Watkins wrote:
> 
> On May 4, 2013, at 10:16 AM, Loic Dachary <loic@dachary.org> wrote:
> 
>> it would be great to get feedback before the ceph summit to address the most prominent issues.
> 
> One thing that has been in the back of my mind is how this proposal is influenced (if at all) by a future that includes declustered per-file raid in CephFS. I realize that may be a distant future, but it seems as though there could be a lot of overlap for the (non-client driven) rebuild/recovery component of such an architecture.

Hi Noah,

I'm not sure what declustered per-file raid is, which means it had no influence on this proposal ;-) Would you be so kind as to educate me ?

Cheers

> -Noah
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Erasure encoding as a storage backend
  2013-05-04 18:36   ` Loic Dachary
@ 2013-05-04 18:47     ` Noah Watkins
  2013-05-04 19:26       ` Loic Dachary
  2013-05-05  4:51       ` Gregory Farnum
  0 siblings, 2 replies; 7+ messages in thread
From: Noah Watkins @ 2013-05-04 18:47 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On May 4, 2013, at 11:36 AM, Loic Dachary <loic@dachary.org> wrote:

> 
> 
> On 05/04/2013 08:27 PM, Noah Watkins wrote:
>> 
>> On May 4, 2013, at 10:16 AM, Loic Dachary <loic@dachary.org> wrote:
>> 
>>> it would be great to get feedback before the ceph summit to address the most prominent issues.
>> 
>> One thing that has been in the back of my mind is how this proposal is influenced (if at all) by a future that includes declustered per-file raid in CephFS. I realize that may be a distant future, but it seems as though there could be a lot of overlap for the (non-client driven) rebuild/recovery component of such an architecture.
> 
> Hi Noah,
> 
> I'm not sure what declustered per-file raid is, which means it had no influence on this proposal ;-) Would you be so kind as to educate me ?

I'm definitely far from an expert on the topic. But briefly the way I think about it is:

Currently CephFS stripes a file byte stream across a set of objects (e.g. first MB in object 0, 2nd in object 1, etc..), and each of these objects is in turn replicated. Following a failure, PGs re-replicate objects.

In client drive raid the striping algorithm is changed, and clients are calculating and distributing parity. In this case the parity rather than replication provides redundancy. So, one might consider storing objects in a pool with replication size 1. However, the standard PG that does replication wouldn't be able to handle faults correctly (parity rebuild, rather than re-replication), and a smart PG like the ErasureCodedPG would be needed.

So it seems like the problems are related, but I'm not sure exactly how much overlap there is :)

-Noah

> Cheers
> 
>> -Noah
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Erasure encoding as a storage backend
  2013-05-04 18:47     ` Noah Watkins
@ 2013-05-04 19:26       ` Loic Dachary
  2013-05-05  4:51       ` Gregory Farnum
  1 sibling, 0 replies; 7+ messages in thread
From: Loic Dachary @ 2013-05-04 19:26 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]



On 05/04/2013 08:47 PM, Noah Watkins wrote:
> 
> On May 4, 2013, at 11:36 AM, Loic Dachary <loic@dachary.org> wrote:
> 
>>
>>
>> On 05/04/2013 08:27 PM, Noah Watkins wrote:
>>>
>>> On May 4, 2013, at 10:16 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>> it would be great to get feedback before the ceph summit to address the most prominent issues.
>>>
>>> One thing that has been in the back of my mind is how this proposal is influenced (if at all) by a future that includes declustered per-file raid in CephFS. I realize that may be a distant future, but it seems as though there could be a lot of overlap for the (non-client driven) rebuild/recovery component of such an architecture.
>>
>> Hi Noah,
>>
>> I'm not sure what declustered per-file raid is, which means it had no influence on this proposal ;-) Would you be so kind as to educate me ?
> 
> I'm definitely far from an expert on the topic. But briefly the way I think about it is:
> 
> Currently CephFS stripes a file byte stream across a set of objects (e.g. first MB in object 0, 2nd in object 1, etc..), and each of these objects is in turn replicated. Following a failure, PGs re-replicate objects.
> 
> In client drive raid the striping algorithm is changed, and clients are calculating and distributing parity. In this case the parity rather than replication provides redundancy. So, one might consider storing objects in a pool with replication size 1. However, the standard PG that does replication wouldn't be able to handle faults correctly (parity rebuild, rather than re-replication), and a smart PG like the ErasureCodedPG would be needed.
> 
> So it seems like the problems are related, but I'm not sure exactly how much overlap there is :)

Do you refer to http://ceph.com/docs/master/architecture/#how-ceph-clients-stripe-data when talking about client drive raid ? My understanding is that it is designed to maximize throughout. This is done in the client library ( gateway, rbd or cephfs ). Since erasure encoding is about recovering from failures and would be implemented in libosd ( next to ReplicatedPG ), I am under the impression that there is no overlap.

What do you think ?

> 
> -Noah
> 
> 
>> Cheers
>>
>>> -Noah
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Erasure encoding as a storage backend
  2013-05-04 18:47     ` Noah Watkins
  2013-05-04 19:26       ` Loic Dachary
@ 2013-05-05  4:51       ` Gregory Farnum
  2013-05-05 14:51         ` Noah Watkins
  1 sibling, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2013-05-05  4:51 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Loic Dachary, Ceph Development

On Sat, May 4, 2013 at 11:47 AM, Noah Watkins <jayhawk@cs.ucsc.edu> wrote:
>
> On May 4, 2013, at 11:36 AM, Loic Dachary <loic@dachary.org> wrote:
>
>>
>>
>> On 05/04/2013 08:27 PM, Noah Watkins wrote:
>>>
>>> On May 4, 2013, at 10:16 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>> it would be great to get feedback before the ceph summit to address the most prominent issues.
>>>
>>> One thing that has been in the back of my mind is how this proposal is influenced (if at all) by a future that includes declustered per-file raid in CephFS. I realize that may be a distant future, but it seems as though there could be a lot of overlap for the (non-client driven) rebuild/recovery component of such an architecture.
>>
>> Hi Noah,
>>
>> I'm not sure what declustered per-file raid is, which means it had no influence on this proposal ;-) Would you be so kind as to educate me ?
>
> I'm definitely far from an expert on the topic. But briefly the way I think about it is:
>
> Currently CephFS stripes a file byte stream across a set of objects (e.g. first MB in object 0, 2nd in object 1, etc..), and each of these objects is in turn replicated. Following a failure, PGs re-replicate objects.
>
> In client drive raid the striping algorithm is changed, and clients are calculating and distributing parity. In this case the parity rather than replication provides redundancy. So, one might consider storing objects in a pool with replication size 1. However, the standard PG that does replication wouldn't be able to handle faults correctly (parity rebuild, rather than re-replication), and a smart PG like the ErasureCodedPG would be needed.
>
> So it seems like the problems are related, but I'm not sure exactly how much overlap there is :)

I'm pretty sure we'd just want to use erasure-coded RADOS pools,
rather than trying to do any CephFS magic erasure encoding. Doing it
above the RADOS layers would introduce some very odd behaviors in
terms of losing objects, as you've mentioned, and requires the clients
to do a lot more network traffic for reads and writes.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Erasure encoding as a storage backend
  2013-05-05  4:51       ` Gregory Farnum
@ 2013-05-05 14:51         ` Noah Watkins
  0 siblings, 0 replies; 7+ messages in thread
From: Noah Watkins @ 2013-05-05 14:51 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Noah Watkins, Loic Dachary, Ceph Development


On May 4, 2013, at 9:51 PM, Gregory Farnum <greg@inktank.com> wrote:

> I'm pretty sure we'd just want to use erasure-coded RADOS pools,
> rather than trying to do any CephFS magic erasure encoding. Doing it
> above the RADOS layers would introduce some very odd behaviors in
> terms of losing objects, as you've mentioned, and requires the clients
> to do a lot more network traffic for reads and writes.

Cool. I was just thinking of some setups I've heard of in HPC environments where the extra client work was ostensibly worth it in terms of reducing disk heads, or something :)

-Noah

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-05-05 14:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-04 17:16 Erasure encoding as a storage backend Loic Dachary
2013-05-04 18:27 ` Noah Watkins
2013-05-04 18:36   ` Loic Dachary
2013-05-04 18:47     ` Noah Watkins
2013-05-04 19:26       ` Loic Dachary
2013-05-05  4:51       ` Gregory Farnum
2013-05-05 14:51         ` Noah Watkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.