From: Loic Dachary <loic@dachary.org>
To: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: Erasure coding implementation : high level description
Date: Mon, 01 Jul 2013 23:45:25 +0200 [thread overview]
Message-ID: <51D1F875.3080002@dachary.org> (raw)
In-Reply-To: <51CF11A8.2070208@dachary.org>
[-- Attachment #1: Type: text/plain, Size: 3656 bytes --]
For the record,
Sam suggested today that the chunks of a stripe ( an object if we limit ourselves to full writes ) are written without deleting the chunks from a previous version of the object. i.e. for instance
object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks "ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively.
object A1 is updated to "ABCDEF123" => version 2 of the object is written as chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 respectively.
At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and "123" ( chunk 3 object A1 version 2 ).
When the PG receives an update of last_complete ( which should happen when the PG becomes active ) it knows that all objects with a version lower than last_complete can be discarded. It can then trim the objects stored on the OSD that have a version older than last_complete. With ReplicatedPG this does not need to be done because the new version of the object overrides the previous one. It could be done together with pg_log trimming but it would waste more disk space because the default log size it by default 3000 meaning a chunk would only be deleted from disk after 3000 pg_log_entry were added to pg_log.
The object name does not currently contain the version number and this would need to be changed to avoid name clashes.
Cheers
On 29/06/2013 18:56, Loic Dachary wrote:
> Hi Sage,
>
> The level of understanding of ReplicatedPG/PG/OSD required to sketch the path for implementing the erasure coding is beyond me at the moment. A few hours of browsing demonstrated that a number of important areas are still unknown to me. A meaningfull example is probably the logic associated with
>
> struct AccessMode {
>
> https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
>
> I suspect there are a number of similarities with the erasure code that would be relevant to ensure that a stripe is fully written to disk ( i.e. in relation with the "ondisk" acknowledgment probably ) before removing the previous version of the same stripe from all OSDs supporting it.
>
> The time spent during this exploration was not wasted, I learnt a few things that will be useful :-) But I think it would be more useful for me to work on a more modest task to move in the direction of the erasure coding implementation.
>
> Cheers
>
> On 06/25/2013 07:41 PM, Loic Dachary wrote:
>> Hi Sage,
>>
>> Paraphrasing what you suggested today :
>>
>> The logic for writing a stripe ( i.e. all the chunks created by the erasure encoding function for a given object or part of a given object if it exceeds the maximum size of a stripe ) for a single object is going to be done in a way that is not the same as what we currently have for replicated objects. The object is consistent when all chunks ( or at least K if K+M ) are committed to disk. It may make sense to start writing all the chunks in parallel and when they are acknowledged, send a pg_log event that says : now switch to this new version of the object. To avoid ending up with chunks that are partially for one version of the object and other chunks partially for another version of the object and we can't repair any of them.
>>
>> I will try to sketch the path for implementing the erasure coding ( including the above ) by adding to https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>>
>> Cheers
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]
next prev parent reply other threads:[~2013-07-01 21:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-25 17:41 Erasure coding implementation : high level description Loic Dachary
2013-06-29 16:56 ` Loic Dachary
2013-07-01 21:45 ` Loic Dachary [this message]
2013-07-02 3:52 ` Sage Weil
2013-07-05 11:56 ` Loic Dachary
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51D1F875.3080002@dachary.org \
--to=loic@dachary.org \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.