All of lore.kernel.org
 help / color / mirror / Atom feed
From: Loic Dachary <loic@dachary.org>
To: Samuel Just <sam.just@inktank.com>
Cc: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: PG Backend Proposal
Date: Fri, 02 Aug 2013 01:54:32 +0200	[thread overview]
Message-ID: <51FAF538.1020609@dachary.org> (raw)
In-Reply-To: <51FA9783.4000206@dachary.org>

[-- Attachment #1: Type: text/plain, Size: 4461 bytes --]

Hi Sam,

I'm under the impression that
https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.

The chunk rank does not need to match the OSD position in the acting set. As long as each object chunk is stored with its rank in an attribute, changing the order of the acting set does not require to move the chunks around.

With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set to their rank.

If the acting set changes to [2,1,0] the read would reorder the chunk based on their 'erasure_code_rank' attribute instead of the rank of the OSD they originate from in the current acting set. And then be able to decode them with the erasure code library, which requires that the chunks are provided in a specific order.

When doing a full write, the chunks are written in the same order as the acting set. This implies that the order of the chunks of the previous version of the object may be different but I don't see a problem with that.

When doing an append, the primary must first retrieve the order in which the objects are stored by retrieving their 'erasure_code_rank' attribute, because the order of the acting set is not the same as the order of the chunks. It then maps the chunks to the OSDs matching their rank and pushes them to the OSDs.

The only downside is that it may make things more complicated to implement optimizations based on the fact that, sometimes, chunks can just be concatenated to recover the content of the object and don't need to be decoded ( when using systematic codes and the M data chunks are available ).

Cheers

On 01/08/2013 19:14, Loic Dachary wrote:
> 
> 
> On 01/08/2013 18:42, Loic Dachary wrote:
>> Hi Sam,
>>
>> When the acting set changes order two chunks for the same object may co-exist in the same placement group. The key should therefore also contain the chunk number. 
>>
>> That's probably the most sensible comment I have so far. This document is immensely useful (even in its current state) because it shows me your perspective on the implementation. 
>>
>> I'm puzzled by:
> 
> I get it ( thanks to yanzheng ). Object is deleted, then created again ... spurious non version chunks would get in the way.
> 
> :-)
> 
>>
>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object.
>>
>> because I don't understand why the version would be necessary. I thought that deleting an erasure coded object could be even easier than erasing a replicated object because it cannot be resurrected if enough chunks are lots, therefore you don't need to wait for ack from all OSDs in the up set. I'm obviously missing something.
>>
>> I failed to understand how important the pg logs were to maintaining the consistency of the PG. For some reason I thought about them only in terms of being a light weight version of the operation logs. Adding a payload to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would have never thought or dared think the logs could be extended in such a way. Given the recent problems with logs writes having a high impact on performances ( I'm referring to what forced you to introduce code to reduce the amount of logs being written to only those that have been changed instead of the complete logs ) I thought about the pg logs as something immutable.
>>
>> I'm still trying to figure out how PGBackend::perform_write / read / try_rollback would fit in the current backfilling / write / read / scrubbing ... code path. 
>>
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

  reply	other threads:[~2013-08-01 23:54 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-01 16:42 PG Backend Proposal Loic Dachary
2013-08-01 17:14 ` Samuel Just
2013-08-01 17:14 ` Loic Dachary
2013-08-01 23:54   ` Loic Dachary [this message]
2013-08-02  1:34     ` Samuel Just
2013-08-02  3:39       ` Sage Weil
2013-08-02  7:39       ` Loic Dachary
2013-08-02 15:10         ` Loic Dachary
2013-08-02 17:11           ` Samuel Just
2013-08-05 12:36             ` Loic Dachary
2013-08-05 10:29 ` Loic Dachary
2013-08-05 14:18 ` Loic Dachary

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51FAF538.1020609@dachary.org \
    --to=loic@dachary.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sam.just@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.