* Erasure coding implementation : high level description
@ 2013-06-25 17:41 Loic Dachary
2013-06-29 16:56 ` Loic Dachary
0 siblings, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2013-06-25 17:41 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]
Hi Sage,
Paraphrasing what you suggested today :
The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them.
I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Erasure coding implementation : high level description
2013-06-25 17:41 Erasure coding implementation : high level description Loic Dachary
@ 2013-06-29 16:56 ` Loic Dachary
2013-07-01 21:45 ` Loic Dachary
0 siblings, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2013-06-29 16:56 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 2173 bytes --]
Hi Sage,
The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with
struct AccessMode {
https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it.
The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation.
Cheers
On 06/25/2013 07:41 PM, Loic Dachary wrote:
> Hi Sage,
>
> Paraphrasing what you suggested today :
>
> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them.
>
> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>
> Cheers
>
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Erasure coding implementation : high level description
2013-06-29 16:56 ` Loic Dachary
@ 2013-07-01 21:45 ` Loic Dachary
2013-07-02 3:52 ` Sage Weil
0 siblings, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2013-07-01 21:45 UTC (permalink / raw)
To: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 3656 bytes --]
For the record,
Sam suggested today that the chunks of a stripe ( an object if we limit ourselves to full writes ) are written without deleting the chunks from a previous version of the object. i.e. for instance
object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks "ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively.
object A1 is updated to "ABCDEF123" => version 2 of the object is written as chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 respectively.
At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and "123" ( chunk 3 object A1 version 2 ).
When the PG receives an update of last_complete ( which should happen when the PG becomes active ) it knows that all objects with a version lower than last_complete can be discarded. It can then trim the objects stored on the OSD that have a version older than last_complete. With ReplicatedPG this does not need to be done because the new version of the object overrides the previous one. It could be done together with pg_log trimming but it would waste more disk space because the default log size it by default 3000 meaning a chunk would only be deleted from disk after 3000 pg_log_entry were added to pg_log.
The object name does not currently contain the version number and this would need to be changed to avoid name clashes.
Cheers
On 29/06/2013 18:56, Loic Dachary wrote:
> Hi Sage,
>
> The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with
>
> struct AccessMode {
>
> https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
>
> I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it.
>
> The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation.
>
> Cheers
>
> On 06/25/2013 07:41 PM, Loic Dachary wrote:
>> Hi Sage,
>>
>> Paraphrasing what you suggested today :
>>
>> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them.
>>
>> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>>
>> Cheers
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Erasure coding implementation : high level description
2013-07-01 21:45 ` Loic Dachary
@ 2013-07-02 3:52 ` Sage Weil
2013-07-05 11:56 ` Loic Dachary
0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2013-07-02 3:52 UTC (permalink / raw)
To: Loic Dachary; +Cc: Ceph Development
We had a chat about this this afternoon and another idea came up: support
only object append in full-stripe writes. The primary would log the write
(as per usual), along with the write offset and length. Each shard
processes its piece and extends the file. If there is a failure and the pg
log gets rolled back, we simply truncate off the incompletely
written/committed stripe from each shard.
This is more limited (no overwrites yet, append-only) but captures the
most important use-cases, it's super simple, and it's efficient. It's
also simple enough that I don't think it commits us in any particular
direction if/when we later want to do per-stripe overwrites.
One thing it does bring up, though is how the stripe size is determined.
I suggest that it is specified by the writer on object creation (since the
writer is responsible for writing in stripe-aligned chunks) and is
recorded as immutable per-object metadata. Maybe there is a per-pool
property to inform clients, but that is mostly just policy...
sage
On Mon, 1 Jul 2013, Loic Dachary wrote:
> For the record,
>
> Sam suggested today that the chunks of a stripe ( an object if we limit ourselves to full writes ) are written without deleting the chunks from a previous version of the object. i.e. for instance
>
> object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks "ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively.
> object A1 is updated to "ABCDEF123" => version 2 of the object is written as chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 respectively.
>
> At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and "123" ( chunk 3 object A1 version 2 ).
>
> When the PG receives an update of last_complete ( which should happen when the PG becomes active ) it knows that all objects with a version lower than last_complete can be discarded. It can then trim the objects stored on the OSD that have a version older than last_complete. With ReplicatedPG this does not need to be done because the new version of the object overrides the previous one. It could be done together with pg_log trimming but it would waste more disk space because the default log size it by default 3000 meaning a chunk would only be deleted from disk after 3000 pg_log_entry were added to pg_log.
>
> The object name does not currently contain the version number and this would need to be changed to avoid name clashes.
>
> Cheers
>
> On 29/06/2013 18:56, Loic Dachary wrote:
> > Hi Sage,
> >
> > The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with
> >
> > struct AccessMode {
> >
> > https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
> >
> > I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it.
> >
> > The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation.
> >
> > Cheers
> >
> > On 06/25/2013 07:41 PM, Loic Dachary wrote:
> >> Hi Sage,
> >>
> >> Paraphrasing what you suggested today :
> >>
> >> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them.
> >>
> >> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
> >>
> >> Cheers
> >>
> >
>
> --
> Lo?c Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Erasure coding implementation : high level description
2013-07-02 3:52 ` Sage Weil
@ 2013-07-05 11:56 ` Loic Dachary
0 siblings, 0 replies; 5+ messages in thread
From: Loic Dachary @ 2013-07-05 11:56 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 5305 bytes --]
Hi Sage,
I wrote down my understanding of what you suggest in the "Interrupted append" part of
https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
did I miss something ?
Cheers
On 02/07/2013 05:52, Sage Weil wrote:
> We had a chat about this this afternoon and another idea came up: support
> only object append in full-stripe writes. The primary would log the write
> (as per usual), along with the write offset and length. Each shard
> processes its piece and extends the file. If there is a failure and the pg
> log gets rolled back, we simply truncate off the incompletely
> written/committed stripe from each shard.
>
> This is more limited (no overwrites yet, append-only) but captures the
> most important use-cases, it's super simple, and it's efficient. It's
> also simple enough that I don't think it commits us in any particular
> direction if/when we later want to do per-stripe overwrites.
>
> One thing it does bring up, though is how the stripe size is determined.
> I suggest that it is specified by the writer on object creation (since the
> writer is responsible for writing in stripe-aligned chunks) and is
> recorded as immutable per-object metadata. Maybe there is a per-pool
> property to inform clients, but that is mostly just policy...
>
> sage
>
>
>
>
> On Mon, 1 Jul 2013, Loic Dachary wrote:
>
>> For the record,
>>
>> Sam suggested today that the chunks of a stripe ( an object if we limit ourselves to full writes ) are written without deleting the chunks from a previous version of the object. i.e. for instance
>>
>> object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks "ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively.
>> object A1 is updated to "ABCDEF123" => version 2 of the object is written as chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 respectively.
>>
>> At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and "123" ( chunk 3 object A1 version 2 ).
>>
>> When the PG receives an update of last_complete ( which should happen when the PG becomes active ) it knows that all objects with a version lower than last_complete can be discarded. It can then trim the objects stored on the OSD that have a version older than last_complete. With ReplicatedPG this does not need to be done because the new version of the object overrides the previous one. It could be done together with pg_log trimming but it would waste more disk space because the default log size it by default 3000 meaning a chunk would only be deleted from disk after 3000 pg_log_entry were added to pg_log.
>>
>> The object name does not currently contain the version number and this would need to be changed to avoid name clashes.
>>
>> Cheers
>>
>> On 29/06/2013 18:56, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with
>>>
>>> struct AccessMode {
>>>
>>> https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
>>>
>>> I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it.
>>>
>>> The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation.
>>>
>>> Cheers
>>>
>>> On 06/25/2013 07:41 PM, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> Paraphrasing what you suggested today :
>>>>
>>>> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them.
>>>>
>>>> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>>>>
>>>> Cheers
>>>>
>>>
>>
>> --
>> Lo?c Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-07-05 11:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-25 17:41 Erasure coding implementation : high level description Loic Dachary
2013-06-29 16:56 ` Loic Dachary
2013-07-01 21:45 ` Loic Dachary
2013-07-02 3:52 ` Sage Weil
2013-07-05 11:56 ` Loic Dachary
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.